back to docs

data-cycle / uc2

Technical Documentation

Apartment Domotics & IoT Sensors — End-to-end data pipeline

version

1.0

date

April 2, 2026

team

Group 14, HES-SO Valais

project

64-61 Data Cycle

01

System Overview

The Data Cycle project (Use Case 2) processes IoT sensor data from two smart apartments in Valais, Switzerland. The system ingests raw sensor readings (temperature, humidity, CO2, motion, door/window, energy consumption) and weather forecasts, transforms them through a medallion architecture (Bronze, Silver, Gold), and serves them to BI dashboards and ML models.

ProjectData Cycle -- UC2 Apartment Domotics & IoT Sensors
TeamGroup 14 -- HES-SO Valais, Haute Ecole de Gestion, 2026
Data periodAugust 18 -- October 27, 2023
Apartments2 (Jimmy, Jeremie) -- ~245,000 JSON sensor files
Data volume15M+ rows in Silver, 6 sensor types, 4 weather measurements
VMWindows 11, on-premise (school network)
Repositorygithub.com/dehlya/data-cycle-domotic
02

Architecture

The system follows the medallion architecture pattern with three data layers, each serving a distinct purpose. Data flows from external sources through Bronze (raw), Silver (cleaned), and Gold (aggregated) before reaching BI tools and ML models.

SourcesBronzeSilverGoldBI / ML
layerpurposestoragegrain
BronzeRaw, immutable copy of source dataFile system (NTFS)Per-file (JSON, CSV)
SilverCleaned, deduplicated, full-resolution dataPostgreSQL (silver schema)Per-reading (1 row per sensor per minute)
GoldAggregated, business-ready star schemaPostgreSQL (gold schema)Per-minute (facts) and per-day (health)

Key principleEvery ETL script is idempotent and resume-capable. Scripts can be interrupted and re-run safely without creating duplicates, thanks to watermark tracking and ON CONFLICT upserts.

03

Technology Stack

layertechnologyversionpurpose
IngestionPython3.11+All ETL scripts, async I/O, parallel processing
Orchestrationwatcher.pycustom60s loop, prediction-based, nightly scan, daily weather
Bronze storageNTFS file systemWindows 11Timestamped directory structure, immutable raw files
Silver / GoldPostgreSQL15+Relational DB with schema separation (silver, gold)
ETLpandas + SQLAlchemy2.1 / 2.0DataFrame transforms, parameterized SQL, connection pooling
sFTPparamiko3.4SSH transport for weather CSV download
BIPower BI + SAP SAClatestEnergy/environment dashboards + presence analytics
MLKNIMElatestEnergy forecasting, presence prediction workflows
WebsiteNext.js + Tailwind14.1 / 3Project documentation and team hub
04

Data Sources

source 1 — sensor JSON (SMB share)

Two apartments (Jimmy, Jeremie) generate one JSON file per minute per apartment, stored on an SMB network share (Z:\). Each file contains nested sensor readings: plugs, doors/windows, motions, meteo stations, humidity sensors, and consumption meters.

Filename formatDD.MM.YYYY HHMM_ApartmentName_received.json — e.g. 18.08.2023 1000_JimmyLoup_received.json

apartment (filename)mapped tobronze path
JimmyLoupjimmybronze/jimmy/YYYY/MM/DD/HH/
JeremieVianinjeremiebronze/jeremie/YYYY/MM/DD/HH/

JSON structure (top-level fields)

fieldtypedescription
userstringApartment owner identifier
api_tokenstringAuth token (masked/removed in Silver)
datetimestringCollection timestamp — DD.MM.YYYY HH:MM
plugsobjectSmart plugs keyed by room name
doorsWindowsobjectDoor/window sensors keyed by room name
motionsobjectMotion sensors keyed by room name
meteosobjectEnvironmental sensors nested under meteo key
humiditiesobjectHumidity sensors keyed by room name
consumptionsobjectWhole-house energy consumption keyed by House

sensor types — fields extracted per category

sensor typeJSON keyfields extractedunits
plugplugspower, total, temperatureW, Wh, °C
doordoorsWindows (type=door)open, batterybool, %
windowdoorsWindows (type=window)open, batterybool, %
motionmotionsmotion, light, temperaturebool, lux, °C
meteometeos.meteotemperature_c, co2_ppm, humidity_pct, noise_db, pressure_hpa, battery°C, ppm, %, dB, hPa, %
humidityhumiditiestemperature, humidity, battery°C, %, %
consumptionconsumptionstotal_power, power1-3, current1-3, voltage1-3W, A, V

source 2 — MySQL database (school network)

Static reference tables from the schools MySQL database (pidb at 10.130.25.152:3306). Contains building metadata, room details, sensor mappings, device inventories, and energy profiles.

MySQL tableSilver tabledescription
buildingsdim_buildingsApartment metadata, location, building year
buildingtypedim_building_typesMaison / Appartement lookup
roomsdim_roomsRoom details, sensor counts, orientation, m²
sensorsdim_sensorsSensor IPs mapped to rooms
devicesdim_devicesAppliances per room (fridge, washer, etc.)
profilereferenceref_energy_profilesReference energy consumption kWh/yr
profileref_power_snapshotsPower consumption snapshots over time
parametersref_parametersThreshold configs per building
parameterstyperef_parameters_typeParameter type lookup
dierrorslog_sensor_errorsSensor error logs — null values, failures

GDPRTables skipped: users (names, emails, passwords, phones), actions, achievements, badges, events, eventsgeneric, categories, userrelationships — personal data or irrelevant for analytics.

source 3 — weather CSV (sFTP)

Daily weather forecast files from an sFTP server (/Meteo2). File format: Pred_YYYY-MM-DD.csv. Each CSV has columns: Time, Value, Prediction, Site, Measurement, Unit.

measurement codemapped columnunit
PRED_T_2M_ctrltemperature_c°C
PRED_RELHUM_2M_ctrlhumidity_pct%
PRED_TOT_PREC_ctrlprecipitation_mmmm
PRED_GLOB_ctrlradiation_wm2W/m²

Sentinel value-99999.0 in the Value column means missing data. Converted to NULL during cleaning.

05

Pipeline: Sources to Bronze

watcher.py — orchestrator

Continuous loop running every 60 seconds. Instead of scanning 245k+ files on SMB (~72s), it predicts the next expected filenames from the last known timestamp and checks with .exists() calls (~0.01s). A nightly full scan at midnight catches any missed files. Daily at 07:30, it triggers the weather pipeline as a subprocess.

bulk_to_bronze.py — sensor file copy

Copies new JSON files from SMB to local Bronze storage. Two modes: prediction (default, fast) and full scan (--full flag). Uses 16 parallel threads. Files are stored in timestamped folders: bronze/jimmy/YYYY/MM/DD/HH/filename.json. Skips files that already exist locally.

weather_download.py — sFTP download

Connects to the sFTP server via paramiko, lists CSV files in the remote directory, downloads new ones to bronze/weather/YYYY/MM/DD/. Configurable retry logic (default: 3 attempts, 600s delay). Logs to logs/weather_download.log.

metricprediction modefull scan mode
Files on SMB246,000246,000
New files = 20.4s~80s
New files = 1,24520s~85s
06

Pipeline: Bronze to Silver

flatten_sensors.py — JSON to sensor_events

Parses Bronze JSON files, extracts all sensor readings, normalizes room names (Bhroom to Bathroom, Bdroom to Bedroom, etc.), flags outliers, and upserts into silver.sensor_events. Uses 8 parallel workers with batches of 2,000 files. Watermark tracking via silver.etl_watermark prevents reprocessing.

1.

Load watermark -- reads processed filenames from silver.etl_watermark

2.

Find new files (newest first) -- walks Bronze folders backwards, stops after 50 consecutive already-processed files

3.

Parallel processing -- splits into batches of 2,000, processed by 8 workers (ProcessPoolExecutor)

4.

Each worker: opens JSON, parses datetime, calls flatten() to extract all sensor readings

5.

Upsert -- inserts into silver.sensor_events with ON CONFLICT DO UPDATE

6.

Mark done -- adds processed filenames to the watermark table

room normalization

Raw JSON room names are inconsistent. The flatten function maps them to standardized names:

raw namenormalized
BhroomBathroom
BdroomBedroom
LivingroomLiving Room
OfficeOffice
KitchenKitchen
LaundryLaundry
OutdoorOutdoor
HouseHouse

outlier detection bounds

Values outside these physical bounds are flagged with is_outlier = TRUE but not removed. This preserves data for audit while allowing BI/Gold to filter.

fieldminmaxunit
temperature_c-2060°C
humidity_pct0100%
co2_ppm3005,000ppm
noise_db0140dB
pressure_hpa8701,085hPa
power010,000W
battery0100%

import_mysql_to_silver.py — dimension tables

Snapshots 10 reference tables from the school MySQL database into PostgreSQL Silver. All columns imported as TEXT (type casting done in Gold). Idempotent: DROP + CREATE on each run.

clean_weather.py — weather CSV processing

Reads raw weather CSVs from Bronze, validates required columns (Time, Value, Prediction, Site, Measurement, Unit), cleans timestamps as UTC, filters by WEATHER_MIN_YEAR, deduplicates by timestamp/site/measurement, maps 4 measurement types to standardized columns, replaces sentinel values (-99999.0) with NULL, pivots from long to wide format, flags outliers, and upserts into silver.weather_clean. Watermark via silver.weather_watermark. Logs row-drop counts at each step to logs/clean_weather.log.

weather outlier bounds

fieldminmax
temperature_c-5060
humidity_pct0100
precipitation_mm0500
radiation_wm201,500
scenariofilesrowstime
First run (full backfill)245,00015M+~3.5 hours
Incremental (after watcher)2-20100-1,5002-4 seconds
Nothing new001-2 seconds
07

Pipeline: Silver to Gold

create_gold.py — schema creation

Creates the gold schema with admin privileges, then creates all dimension and fact tables as the app user (so the app user owns them). Idempotent: uses CREATE TABLE IF NOT EXISTS. Creates indexes on date_key, apartment_key, room_key for fast BI queries.

populate_gold.py — 9-step process

1.

dim_date -- extract unique dates from sensor_events, compute day_of_week, week, month, year, is_weekend

2.

dim_datetime -- extract minute-level timestamps, truncate to minute grain, compute time attributes

3.

dim_apartment -- distinct apartments, enriched from silver.dim_buildings (building_name, building_id)

4.

dim_room -- distinct rooms per apartment, linked via apartment_key

5.

dim_device -- synthetic device_id (apartment_room_sensortype), linked to room_key

6.

fact_energy_minute -- GROUP BY minute/device: MAX(power), MAX(total energy), Wh to kWh conversion

7.

fact_environment_minute -- GROUP BY minute/room: MAX per metric, BOOL_OR for window/door flags

8.

fact_presence_minute -- GROUP BY minute/room: SUM(motion), BOOL_OR(motion OR door open) = presence

9.

fact_device_health_day -- GROUP BY day/device: MIN/AVG battery percentage

PatternAll fact tables use GROUP BY date_trunc('minute', timestamp) with CASE WHEN pivots. Upserts via ON CONFLICT DO UPDATE. Currently a full reload each run -- future: watermark-based incremental.

08

Pipeline: Gold to BI / ML

BI dashboards (planned)

dashboardtoolfact tablekey metrics
EnergyPower BIfact_energy_minutePower (W), energy (kWh), cost (CHF)
EnvironmentPower BIfact_environment_minuteTemperature, humidity, CO2, noise, pressure
PresenceSAP Analytics Cloudfact_presence_minuteMotion count, presence flag, door activity
Sensor HealthPower BIfact_device_health_dayBattery levels, uptime, error counts

row-level security (RLS)

Each apartment owner only sees their own data. Implemented via Power BI RLS roles filtering on apartment_key. Three roles: role_jimmy, role_jeremie (filtered), and role_admin (unfiltered, all apartments).

ML models (in progress)

Built with KNIME Analytics Platform. Two models: energy consumption prediction (considering weather forecasts and historical patterns) and room presence prediction (from motion/door sensor patterns). Results will be loaded back into gold.fact_ml_predictions (future table).

09

Database Schemas

silver.sensor_events (main fact table)

columntypedescription
idBIGSERIALPrimary key
apartmentVARCHAR(20)'jimmy' or 'jeremie'
roomVARCHAR(50)Normalized: 'Bathroom', 'Kitchen', 'Living Room', etc.
sensor_typeVARCHAR(20)'plug', 'motion', 'meteo', 'door', 'window', 'humidity', 'consumption'
fieldVARCHAR(50)'power', 'temperature_c', 'co2_ppm', 'open', 'motion', etc.
valueFLOATThe sensor reading
unitVARCHAR(10)'W', '°C', 'ppm', '%', 'bool', etc.
timestampTIMESTAMPTZWhen the reading was taken (UTC)
is_outlierBOOLEANTRUE if value outside physical bounds

ConstraintsUNIQUE (apartment, room, sensor_type, field, timestamp). Indexes on timestamp, apartment, sensor_type.

silver.weather_clean

columntypedescription
idBIGSERIALPrimary key
timestampTIMESTAMPTZForecast timestamp (UTC)
siteVARCHAR(50)Weather station name
temperature_cFLOATPredicted temperature
humidity_pctFLOATPredicted humidity
precipitation_mmFLOATPredicted precipitation
radiation_wm2FLOATPredicted solar radiation
is_outlierBOOLEANTRUE if any value outside bounds

ConstraintsUNIQUE (timestamp, site). Index on timestamp.

silver — other tables

tablepurposesource
etl_watermarkTracks processed sensor JSON filenamesflatten_sensors.py
weather_watermarkTracks processed weather CSV filenamesclean_weather.py
dim_buildingsBuilding metadataMySQL
dim_building_typesMaison / Appartement lookupMySQL
dim_roomsRoom details, sensor counts, m²MySQL
dim_sensorsSensor IPs mapped to roomsMySQL
dim_devicesAppliances per roomMySQL
ref_energy_profilesReference kWh/yr by typeMySQL
ref_power_snapshotsPower consumption snapshotsMySQL
ref_parametersThreshold configs per buildingMySQL
ref_parameters_typeParameter type lookupMySQL
log_sensor_errorsSensor error logsMySQL

gold schema — dimension tables

tablegrainPKunique constraintkey columns
dim_datetime1 minuteBIGINT (YYYYMMDDHHMM)timestamp_utcdate_key, hour, minute, day_of_week, week, month, year, is_weekend
dim_date1 dayINT (YYYYMMDD)dateday_of_week, week, month, year, is_weekend, is_holiday
dim_apartmentapartmentSERIALapartment_idname, building_id, building_name, floor
dim_roomroomSERIAL(room_name, apartment_key)room_type, apartment_key (FK)
dim_devicedeviceSERIAL(device_id, sensor_type)room_key (FK), device_type, is_active

gold schema — fact tables

tablegrainPKFK referencesmeasures
fact_energy_minute1 min / device(datetime_key, device_key)dim_datetime, dim_date, dim_device, dim_room, dim_apartmentpower_w, energy_wh, energy_kwh, cost_chf, counter_total, is_valid
fact_environment_minute1 min / room(datetime_key, room_key)dim_datetime, dim_date, dim_room, dim_apartmenttemperature_c, humidity_pct, co2_ppm, noise_db, pressure_hpa, window_open_flag, door_open_flag, is_anomaly
fact_presence_minute1 min / room(datetime_key, room_key)dim_datetime, dim_date, dim_room, dim_apartmentmotion_count, door_open_flag, presence_flag, presence_prob (NULL until ML)
fact_device_health_day1 day / device(date_key, device_key)dim_date, dim_device, dim_room, dim_apartmenterror_count, missing_readings, uptime_pct, battery_min_pct, battery_avg_pct

IndexesAll fact tables have indexes on date_key and apartment_key for fast BI filtering. Energy also indexed on room_key.

10

Scripts Reference

scriptpathdescriptionflags
watcher.pyingestion/fast_flow/Main loop: sensor ingestion (60s) + weather (daily)--scan | --weather
bulk_to_bronze.pyingestion/fast_flow/Copy sensor JSON from SMB to Bronze--full
flatten_sensors.pyetl/bronze_to_silver/Bronze JSON to silver.sensor_events
create_silver.pyetl/bronze_to_silver/Create Silver schema + tables
import_mysql_to_silver.pyetl/bronze_to_silver/MySQL dimensions to Silver
weather_download.pyingestion/slow_flow/Download weather CSV from sFTP
clean_weather.pyetl/bronze_to_silver/Clean weather CSV to silver.weather_clean
create_gold.pyetl/silver_to_gold/Create Gold star schema
populate_gold.pyetl/silver_to_gold/Populate Gold dimensions + facts
11

Configuration

SecurityThe .env file contains credentials and must never be committed to git. It is listed in .gitignore. Share securely (e.g. via Teams, encrypted channel).

variablerequireddescription
DB_URLyesPostgreSQL app user connection string
DB_ADMIN_URLoptionalAdmin connection for schema/DB creation (first run only)
MYSQL_URLyesMySQL source for dimension tables (school network)
SMB_PATHyesMounted SMB share path (sensor JSON files)
BRONZE_ROOTyesLocal Bronze storage folder
SFTP_HOSToptionalsFTP server for weather data
SFTP_PORToptionalsFTP port (default: 22)
SFTP_USERoptionalsFTP username
SFTP_PASSWORDoptionalsFTP password
SFTP_PATHoptionalRemote sFTP directory (e.g. /Meteo2)
WEATHER_MIN_YEARoptionalIgnore weather data before this year (default: 2023)
WEATHER_HOURoptionalHour to trigger daily weather pipeline (default: 7)
WEATHER_MINoptionalMinute to trigger daily weather pipeline (default: 30)
LOG_DIRoptionalDirectory for ETL log files (default: logs/)
SFTP_MAX_RETRIESoptionalMax sFTP connection retries (default: 3)
SFTP_RETRY_DELAYoptionalSeconds between retries (default: 600)
12

Deployment

Development

Database: domotic_dev

Config: .env

Purpose: testing changes, schema experiments

Production

Database: domotic_prod

Config: .env.prod

Purpose: BI dashboards connect here, stable data

deployment steps (dev to prod)

1.

Pull latest code on VM: git pull origin main

2.

Run schema scripts against prod: python create_silver.py / create_gold.py

3.

Run import_mysql_to_silver.py against prod (refreshes dimension tables)

4.

Run populate_gold.py against prod (rebuilds fact tables from Silver)

5.

Verify row counts: SELECT COUNT(*) FROM gold.fact_energy_minute;

6.

Restart watcher if running: stop old process, start new one with .env.prod

13

Services & Processes

servicedescriptionstartstop
watcher.pyMain pipeline orchestratorpython ingestion/fast_flow/watcher.pyCtrl+C
PostgreSQLSilver + Gold storageWindows service (auto)services.msc
Next.js siteDocumentation websiteAuto-deploys on git push (Vercel)n/a

Normal operationOnly the watcher needs to be running. PostgreSQL starts automatically as a Windows service. The website is hosted on Vercel and deploys automatically on git push to main.

14

Monitoring & Logs

log filesourcewhat to check
logs/weather_download.logweather_download.pysFTP connection errors, download failures, file counts
logs/clean_weather.logclean_weather.pyRow drop counts per step, outlier counts, processing errors
stdout (terminal)watcher.pyPipeline runs, skip counts, timing, weather triggers
stdout (terminal)flatten_sensors.pyBatch progress, row counts, errors per file

health checks

1.

Is the watcher running? Check the terminal or process list for python watcher.py

2.

Is data flowing? SELECT MAX(timestamp) FROM silver.sensor_events -- should be recent

3.

Is Gold up to date? SELECT COUNT(*) FROM gold.fact_energy_minute -- should match expectations

4.

Any weather errors? Check logs/weather_download.log for recent entries

5.

Disk space? Check storage\bronze size -- grows ~2GB/month

15

Troubleshooting

problemcausefix
Watcher says 'SMB path not found'Z: drive not mountedMap network drive in File Explorer or check VPN
flatten_sensors.py is slowFirst run processes ~245k filesNormal -- takes ~3.5h. Subsequent runs are seconds.
Schema creation fails with permission errorApp user can't create schemaSet DB_ADMIN_URL in .env with superuser credentials
weather_download.py connection refusedsFTP unreachable or wrong credentialsCheck SFTP_HOST, SFTP_USER, SFTP_PASSWORD. Check VPN.
populate_gold.py returns 0 rowsSilver is emptyRun flatten_sensors.py first
Duplicate key errors in upsertsNormal behaviorON CONFLICT handles this. Not an error.
Log files not appearinglogs/ directory issueScripts auto-create it. Check LOG_DIR and write permissions.
16

Maintenance

taskfrequencysteps
Rotate database passwordQuarterlyUpdate PostgreSQL password, update DB_URL in .env, restart watcher
Rotate sFTP credentialsOn changeUpdate SFTP_USER / SFTP_PASSWORD in .env, restart watcher
Check disk space (Bronze)MonthlyBronze grows ~2GB/month. Archive old years if needed.
PostgreSQL VACUUMWeekly (auto)autovacuum enabled by default. Manual: VACUUM ANALYZE silver.sensor_events;
Review log filesWeeklyCheck logs/weather_download.log and logs/clean_weather.log
Backup databaseDailypg_dump -Fc domotic_prod > backup_YYYYMMDD.dump
17

Security

Credentials

All credentials stored in .env (gitignored). No hardcoded passwords in source code. Shared via secure channels only (Teams, not email).

SQL injection

All database queries use SQLAlchemy parameterized statements. No string concatenation or f-strings in SQL.

Data privacy (GDPR)

MySQL tables containing personal data (users, relationships, emails, passwords) are explicitly skipped during import. Apartment names used as pseudonymized identifiers.

Network

VM runs on school internal network. SMB and MySQL are internal-only. sFTP weather uses encrypted SSH transport (paramiko). No public endpoints.

18

Architecture Decisions

ADR-001Python 3.11 + asyncio for ingestion

Context: Need to process 245k+ files from SMB with I/O-bound operations (file copy, DB writes).

Decision: Use Python with ThreadPoolExecutor for parallel file copy and ProcessPoolExecutor for CPU-bound JSON parsing. asyncio reserved for future real-time ingestion.

Consequence: Team familiarity, rich ecosystem (pandas, SQLAlchemy, paramiko). Acceptable performance for batch processing.

ADR-002PostgreSQL for Silver and Gold

Context: Need a relational DB that supports schema separation, OLAP queries, and Power BI connector.

Decision: PostgreSQL 15+ with separate silver and gold schemas in the same database. Two databases for dev/prod isolation.

Consequence: Free, multi-user, Power BI native connector, excellent OLAP performance. Hosted on the VM.

ADR-003Custom watcher over Airflow

Context: Evaluated Apache Airflow for orchestration, but single-VM deployment made it overkill.

Decision: Lightweight Python watcher script with 60s loop, prediction-based file discovery, and daily weather trigger.

Consequence: Zero infrastructure overhead, easy to understand and maintain. Trade-off: no built-in alerting or DAG visualization.

ADR-004File system for Bronze

Context: Need immutable, auditable raw data storage that is easy to inspect and replay.

Decision: Local NTFS file system with timestamped directory structure (YYYY/MM/DD/HH/).

Consequence: Zero overhead, easy to browse in File Explorer, cloud-agnostic. Trade-off: no built-in versioning.