data-cycle / uc2
Technical Documentation
Apartment Domotics & IoT Sensors — End-to-end data pipeline
version
1.0
date
April 2, 2026
team
Group 14, HES-SO Valais
project
64-61 Data Cycle
System Overview
The Data Cycle project (Use Case 2) processes IoT sensor data from two smart apartments in Valais, Switzerland. The system ingests raw sensor readings (temperature, humidity, CO2, motion, door/window, energy consumption) and weather forecasts, transforms them through a medallion architecture (Bronze, Silver, Gold), and serves them to BI dashboards and ML models.
| Project | Data Cycle -- UC2 Apartment Domotics & IoT Sensors |
| Team | Group 14 -- HES-SO Valais, Haute Ecole de Gestion, 2026 |
| Data period | August 18 -- October 27, 2023 |
| Apartments | 2 (Jimmy, Jeremie) -- ~245,000 JSON sensor files |
| Data volume | 15M+ rows in Silver, 6 sensor types, 4 weather measurements |
| VM | Windows 11, on-premise (school network) |
| Repository | github.com/dehlya/data-cycle-domotic |
Architecture
The system follows the medallion architecture pattern with three data layers, each serving a distinct purpose. Data flows from external sources through Bronze (raw), Silver (cleaned), and Gold (aggregated) before reaching BI tools and ML models.
| layer | purpose | storage | grain |
|---|---|---|---|
| Bronze | Raw, immutable copy of source data | File system (NTFS) | Per-file (JSON, CSV) |
| Silver | Cleaned, deduplicated, full-resolution data | PostgreSQL (silver schema) | Per-reading (1 row per sensor per minute) |
| Gold | Aggregated, business-ready star schema | PostgreSQL (gold schema) | Per-minute (facts) and per-day (health) |
Key principle—Every ETL script is idempotent and resume-capable. Scripts can be interrupted and re-run safely without creating duplicates, thanks to watermark tracking and ON CONFLICT upserts.
Technology Stack
| layer | technology | version | purpose |
|---|---|---|---|
| Ingestion | Python | 3.11+ | All ETL scripts, async I/O, parallel processing |
| Orchestration | watcher.py | custom | 60s loop, prediction-based, nightly scan, daily weather |
| Bronze storage | NTFS file system | Windows 11 | Timestamped directory structure, immutable raw files |
| Silver / Gold | PostgreSQL | 15+ | Relational DB with schema separation (silver, gold) |
| ETL | pandas + SQLAlchemy | 2.1 / 2.0 | DataFrame transforms, parameterized SQL, connection pooling |
| sFTP | paramiko | 3.4 | SSH transport for weather CSV download |
| BI | Power BI + SAP SAC | latest | Energy/environment dashboards + presence analytics |
| ML | KNIME | latest | Energy forecasting, presence prediction workflows |
| Website | Next.js + Tailwind | 14.1 / 3 | Project documentation and team hub |
Data Sources
source 1 — sensor JSON (SMB share)
Two apartments (Jimmy, Jeremie) generate one JSON file per minute per apartment, stored on an SMB network share (Z:\). Each file contains nested sensor readings: plugs, doors/windows, motions, meteo stations, humidity sensors, and consumption meters.
Filename format—DD.MM.YYYY HHMM_ApartmentName_received.json — e.g. 18.08.2023 1000_JimmyLoup_received.json
| apartment (filename) | mapped to | bronze path |
|---|---|---|
| JimmyLoup | jimmy | bronze/jimmy/YYYY/MM/DD/HH/ |
| JeremieVianin | jeremie | bronze/jeremie/YYYY/MM/DD/HH/ |
JSON structure (top-level fields)
| field | type | description |
|---|---|---|
| user | string | Apartment owner identifier |
| api_token | string | Auth token (masked/removed in Silver) |
| datetime | string | Collection timestamp — DD.MM.YYYY HH:MM |
| plugs | object | Smart plugs keyed by room name |
| doorsWindows | object | Door/window sensors keyed by room name |
| motions | object | Motion sensors keyed by room name |
| meteos | object | Environmental sensors nested under meteo key |
| humidities | object | Humidity sensors keyed by room name |
| consumptions | object | Whole-house energy consumption keyed by House |
sensor types — fields extracted per category
| sensor type | JSON key | fields extracted | units |
|---|---|---|---|
| plug | plugs | power, total, temperature | W, Wh, °C |
| door | doorsWindows (type=door) | open, battery | bool, % |
| window | doorsWindows (type=window) | open, battery | bool, % |
| motion | motions | motion, light, temperature | bool, lux, °C |
| meteo | meteos.meteo | temperature_c, co2_ppm, humidity_pct, noise_db, pressure_hpa, battery | °C, ppm, %, dB, hPa, % |
| humidity | humidities | temperature, humidity, battery | °C, %, % |
| consumption | consumptions | total_power, power1-3, current1-3, voltage1-3 | W, A, V |
source 2 — MySQL database (school network)
Static reference tables from the school’s MySQL database (pidb at 10.130.25.152:3306). Contains building metadata, room details, sensor mappings, device inventories, and energy profiles.
| MySQL table | Silver table | description |
|---|---|---|
| buildings | dim_buildings | Apartment metadata, location, building year |
| buildingtype | dim_building_types | Maison / Appartement lookup |
| rooms | dim_rooms | Room details, sensor counts, orientation, m² |
| sensors | dim_sensors | Sensor IPs mapped to rooms |
| devices | dim_devices | Appliances per room (fridge, washer, etc.) |
| profilereference | ref_energy_profiles | Reference energy consumption kWh/yr |
| profile | ref_power_snapshots | Power consumption snapshots over time |
| parameters | ref_parameters | Threshold configs per building |
| parameterstype | ref_parameters_type | Parameter type lookup |
| dierrors | log_sensor_errors | Sensor error logs — null values, failures |
GDPR—Tables skipped: users (names, emails, passwords, phones), actions, achievements, badges, events, eventsgeneric, categories, userrelationships — personal data or irrelevant for analytics.
source 3 — weather CSV (sFTP)
Daily weather forecast files from an sFTP server (/Meteo2). File format: Pred_YYYY-MM-DD.csv. Each CSV has columns: Time, Value, Prediction, Site, Measurement, Unit.
| measurement code | mapped column | unit |
|---|---|---|
| PRED_T_2M_ctrl | temperature_c | °C |
| PRED_RELHUM_2M_ctrl | humidity_pct | % |
| PRED_TOT_PREC_ctrl | precipitation_mm | mm |
| PRED_GLOB_ctrl | radiation_wm2 | W/m² |
Sentinel value—-99999.0 in the Value column means missing data. Converted to NULL during cleaning.
Pipeline: Sources to Bronze
watcher.py — orchestrator
Continuous loop running every 60 seconds. Instead of scanning 245k+ files on SMB (~72s), it predicts the next expected filenames from the last known timestamp and checks with .exists() calls (~0.01s). A nightly full scan at midnight catches any missed files. Daily at 07:30, it triggers the weather pipeline as a subprocess.
bulk_to_bronze.py — sensor file copy
Copies new JSON files from SMB to local Bronze storage. Two modes: prediction (default, fast) and full scan (--full flag). Uses 16 parallel threads. Files are stored in timestamped folders: bronze/jimmy/YYYY/MM/DD/HH/filename.json. Skips files that already exist locally.
weather_download.py — sFTP download
Connects to the sFTP server via paramiko, lists CSV files in the remote directory, downloads new ones to bronze/weather/YYYY/MM/DD/. Configurable retry logic (default: 3 attempts, 600s delay). Logs to logs/weather_download.log.
| metric | prediction mode | full scan mode |
|---|---|---|
| Files on SMB | 246,000 | 246,000 |
| New files = 2 | 0.4s | ~80s |
| New files = 1,245 | 20s | ~85s |
Pipeline: Bronze to Silver
flatten_sensors.py — JSON to sensor_events
Parses Bronze JSON files, extracts all sensor readings, normalizes room names (Bhroom to Bathroom, Bdroom to Bedroom, etc.), flags outliers, and upserts into silver.sensor_events. Uses 8 parallel workers with batches of 2,000 files. Watermark tracking via silver.etl_watermark prevents reprocessing.
Load watermark -- reads processed filenames from silver.etl_watermark
Find new files (newest first) -- walks Bronze folders backwards, stops after 50 consecutive already-processed files
Parallel processing -- splits into batches of 2,000, processed by 8 workers (ProcessPoolExecutor)
Each worker: opens JSON, parses datetime, calls flatten() to extract all sensor readings
Upsert -- inserts into silver.sensor_events with ON CONFLICT DO UPDATE
Mark done -- adds processed filenames to the watermark table
room normalization
Raw JSON room names are inconsistent. The flatten function maps them to standardized names:
| raw name | normalized |
|---|---|
| Bhroom | Bathroom |
| Bdroom | Bedroom |
| Livingroom | Living Room |
| Office | Office |
| Kitchen | Kitchen |
| Laundry | Laundry |
| Outdoor | Outdoor |
| House | House |
outlier detection bounds
Values outside these physical bounds are flagged with is_outlier = TRUE but not removed. This preserves data for audit while allowing BI/Gold to filter.
| field | min | max | unit |
|---|---|---|---|
| temperature_c | -20 | 60 | °C |
| humidity_pct | 0 | 100 | % |
| co2_ppm | 300 | 5,000 | ppm |
| noise_db | 0 | 140 | dB |
| pressure_hpa | 870 | 1,085 | hPa |
| power | 0 | 10,000 | W |
| battery | 0 | 100 | % |
import_mysql_to_silver.py — dimension tables
Snapshots 10 reference tables from the school MySQL database into PostgreSQL Silver. All columns imported as TEXT (type casting done in Gold). Idempotent: DROP + CREATE on each run.
clean_weather.py — weather CSV processing
Reads raw weather CSVs from Bronze, validates required columns (Time, Value, Prediction, Site, Measurement, Unit), cleans timestamps as UTC, filters by WEATHER_MIN_YEAR, deduplicates by timestamp/site/measurement, maps 4 measurement types to standardized columns, replaces sentinel values (-99999.0) with NULL, pivots from long to wide format, flags outliers, and upserts into silver.weather_clean. Watermark via silver.weather_watermark. Logs row-drop counts at each step to logs/clean_weather.log.
weather outlier bounds
| field | min | max |
|---|---|---|
| temperature_c | -50 | 60 |
| humidity_pct | 0 | 100 |
| precipitation_mm | 0 | 500 |
| radiation_wm2 | 0 | 1,500 |
| scenario | files | rows | time |
|---|---|---|---|
| First run (full backfill) | 245,000 | 15M+ | ~3.5 hours |
| Incremental (after watcher) | 2-20 | 100-1,500 | 2-4 seconds |
| Nothing new | 0 | 0 | 1-2 seconds |
Pipeline: Silver to Gold
create_gold.py — schema creation
Creates the gold schema with admin privileges, then creates all dimension and fact tables as the app user (so the app user owns them). Idempotent: uses CREATE TABLE IF NOT EXISTS. Creates indexes on date_key, apartment_key, room_key for fast BI queries.
populate_gold.py — 9-step process
dim_date -- extract unique dates from sensor_events, compute day_of_week, week, month, year, is_weekend
dim_datetime -- extract minute-level timestamps, truncate to minute grain, compute time attributes
dim_apartment -- distinct apartments, enriched from silver.dim_buildings (building_name, building_id)
dim_room -- distinct rooms per apartment, linked via apartment_key
dim_device -- synthetic device_id (apartment_room_sensortype), linked to room_key
fact_energy_minute -- GROUP BY minute/device: MAX(power), MAX(total energy), Wh to kWh conversion
fact_environment_minute -- GROUP BY minute/room: MAX per metric, BOOL_OR for window/door flags
fact_presence_minute -- GROUP BY minute/room: SUM(motion), BOOL_OR(motion OR door open) = presence
fact_device_health_day -- GROUP BY day/device: MIN/AVG battery percentage
Pattern—All fact tables use GROUP BY date_trunc('minute', timestamp) with CASE WHEN pivots. Upserts via ON CONFLICT DO UPDATE. Currently a full reload each run -- future: watermark-based incremental.
Pipeline: Gold to BI / ML
BI dashboards (planned)
| dashboard | tool | fact table | key metrics |
|---|---|---|---|
| Energy | Power BI | fact_energy_minute | Power (W), energy (kWh), cost (CHF) |
| Environment | Power BI | fact_environment_minute | Temperature, humidity, CO2, noise, pressure |
| Presence | SAP Analytics Cloud | fact_presence_minute | Motion count, presence flag, door activity |
| Sensor Health | Power BI | fact_device_health_day | Battery levels, uptime, error counts |
row-level security (RLS)
Each apartment owner only sees their own data. Implemented via Power BI RLS roles filtering on apartment_key. Three roles: role_jimmy, role_jeremie (filtered), and role_admin (unfiltered, all apartments).
ML models
Built with KNIME Analytics Platform. Two use cases at the 15-minute room grain: room occupancy prediction and room energy consumption prediction. Each use case is implemented as a benchmark workflow (model comparison & selection) plus a prediction workflow (operational scoring & database writing). Results land in gold.fact_prediction. See Section 09 for the full methodology.
ML Workflows (KNIME)
Context and scope
This section presents the KNIME workflows developed for the Data Cycle Project in order to address two analytical use cases at room level:
Occupancy prediction, based on motion-related sensor events.
Energy consumption prediction, based on room-level power measurements enriched with weather forecast variables.
For each use case, the implementation follows a two-workflow logic:
A benchmark workflow, whose role is to prepare the dataset, compare several candidate models, and select the most suitable one.
A prediction workflow, whose role is to reuse the validated preparation logic, apply the selected model, reconstruct a business-readable output, and write the prediction results into PostgreSQL.
The four workflows documented in this report are therefore the following: Motion_Benchmark, Motion_Prediction, Consumption_Benchmark, Consumption_Prediction. This separation between benchmarking and prediction reflects a deliberate methodological choice: the benchmark workflow is used for model comparison and selection, whereas the prediction workflow is intended for operational scoring and database writing.
Part I — Occupancy prediction workflows
1. Business objective
The objective of the first use case is to predict whether a room is occupied during a 15-minute time window. Since the source system does not provide a direct occupancy indicator, occupancy is inferred from motion sensor events.
The target variable is defined as follows:
occupied = 1 if at least one motion = 1 event is observed within the 15-minute window.
occupied = 0 otherwise.
Consequently, each final observation represents one room, in one apartment, at one specific 15-minute timestamp, with an associated binary occupancy status.
1.1 Selection of relevant rooms
Following exploratory analysis, only the following rooms were retained for the occupancy model: Kitchen, Living Room, Office. The remaining rooms were excluded because they did not provide a sufficiently consistent motion signal to support a reliable target construction.
2. Source data and selected variables
The main source table for the occupancy use case is sensor_events. The useful fields selected from this source include apartment, room, sensor_type, field, value, unit, timestamp, and is_outlier. Initial filtering removes outliers, external zones, and all records not relevant to occupancy modeling.
3. Shared preparation logic for Motion workflows
The benchmark and prediction workflows for occupancy share the same preparation logic. This common part ensures that the selected model is trained and later applied on data built according to the same transformation rules.
3.1 Database access
The workflow begins with the standard database access chain: PostgreSQL Connector → DB Table Selector → DB Row Filter → DB Column Filter → DB Reader. This stage isolates the relevant subset of sensor event data required for subsequent local KNIME transformations.
3.2 Time standardisation
A Time Rounder node is used to round the raw event timestamp to a 15-minute grain. This operation defines the future modeling grain as the combination of apartment, room, rounded timestamp.
3.3 Construction of the base grid
A GroupBy node is used to build a base grid containing all distinct combinations of apartment, room, 15-minute window. This base grid acts as the backbone of the final dataset. It ensures that all subsequent aggregated signals are aligned on the same temporal structure.
3.4 Construction of the occupancy label
A dedicated motion branch is created from the filtered source data: Row Filter on sensor_type = motion and field = motion → Rule Engine → GroupBy → Column Renamer. The logic is the following:
A motion event is converted into motion_flag = 1 when value = 1, otherwise 0.
occupied is obtained through MAX(motion_flag) over the 15-minute window.
motion_count is obtained through SUM(motion_flag).
This procedure yields a binary and traceable occupancy target derived from observable sensor behaviour.
3.5 Integration of complementary behavioural signals
Two additional branches are prepared to enrich the occupancy dataset.
Door branch. This branch filters the source data on sensor_type = door and field = open, creates a binary open flag, and aggregates it into door_open_count and door_open_any.
Window branch. This branch follows the same logic for windows and produces window_open_count and window_open_any.
Each aggregated branch is reintegrated into the base grid through a Left Join on apartment, room, timestamp (Rounded). Missing values are replaced with 0 usingMissing Value, so that the absence of an event is interpreted as absence of activity rather than missing data.
3.6 Time-derived explanatory variables
Time-derived features are created using Date&Time Part Extractorand Rule Engine. The resulting variables are: Hour, Day of Week (Number), Day of Week (Name), Month (Number), Month (Name), is_weekend. These variables allow the model to capture temporal regularities in room occupancy.
3.7 Room filtering for modeling
A Row Filter is used to retain only the rooms previously identified as usable for occupancy prediction: Kitchen, Living Room, Office.
3.8 Lag-based temporal memory
The following nodes are used to generate lag features: Sorter → Lag Column → Column Renamer → Missing Value. The following lag variables are created:occupied_lag_1, motion_count_lag_1,door_open_count_lag_1, window_open_count_lag_1. These features encode the immediate past state of the room and provide short-term temporal memory to the model.
3.9 Column selection for machine learning
A Column Filter is then used to retain only the useful modeling variables: occupied, apartment, room, Hour, Day of Week (Number), is_weekend, occupied_lag_1, motion_count_lag_1, door_open_count_lag_1, window_open_count_lag_1.
3.10 Encoding of categorical variables
A One to Many node is applied to apartment and room. This transformation converts categories into binary indicators and avoids introducing an artificial ordinal relationship between rooms or apartments.
3.11 Target formatting
Because the selected learners are nominal classification models, the targetoccupied is converted from numeric to string using aNumber to String node.
4. Workflow 1 — Motion_Benchmark
The purpose of the benchmark workflow is to compare several classification models and select the most suitable one for room occupancy prediction.
Train/test split. A Table Partitioner node is used to split the dataset into 80% training data and 20% test data. The selected strategy is First rows, which is consistent with the temporal nature of the use case because the dataset is chronologically ordered before partitioning, thereby preserving a sequential split between past observations and more recent observations.
Candidate models. Three classification branches are prepared: Logistic Regression, Decision Tree, Random Forest.
Evaluation procedure. Each branch follows the same evaluation logic: Learner → Predictor → Scorer → ROC Curve (JavaScript). The benchmark results are consolidated into a comparison table.
Model selection criterion. The final model is selected on the basis of AUC.
Selected model—logistic_regression — identified by the benchmark as the best-performing model for the occupancy use case.
Methodological role of the workflow. The benchmark workflow is used exclusively to prepare the training and test data, compare alternative learners, evaluate model performance, and justify the final model selection. It is not used for database writing.
5. Workflow 2 — Motion_Prediction
The purpose of the prediction workflow is to apply the selected occupancy model to prepared data, reconstruct a readable business output, and write the final prediction table into PostgreSQL.
Shared logic with the benchmark workflow. This workflow reuses the same preparation chain as the benchmark workflow (data access, time rounding, base grid construction, target and feature engineering, room filtering, categorical encoding). This ensures strict consistency between model training and operational scoring.
Model application. The selected learner and predictor are Logistic Regression Learner and Logistic Regression Predictor. The predictor generates the column Prediction (occupied).
Reconstruction of a business-readable output. AJoiner is used to reattach readable business columns to the prediction result, including apartment, room, timestamp (Rounded), observed occupied value. The join is performed on RowID, assuming row identity has been preserved throughout the preparation pipeline.
Final table prepared for PostgreSQL. After renaming and enrichment, the final table contains the following columns. The constantsmodel_name = logistic_regression and target = Presenceare added.
| column | description |
|---|---|
| predicted_occupied | Model output for the 15-min window (0/1) |
| apartment | Readable apartment ID |
| room | Kitchen / Living Room / Office |
| timestamp_rounded | Start of the 15-minute window (UTC) |
| actual_occupied | Observed value from motion signal |
| model_name | Constant: logistic_regression |
| target | Constant: Presence |
Database writing. The final prediction table is written into PostgreSQL using PostgreSQL Connector → DB Writer. A possible target table isfact_prediction.
Business meaning of the result. Each written row represents one room, in one apartment, at one specific 15-minute timestamp, with the predicted occupancy, the observed occupancy, and the model identifier. This makes it possible to answer the operational question: “Will this room be occupied at this precise timestamp?” The resulting table can subsequently be aggregated in Power BI to answer broader analytical questions such as room occupancy patterns by day, hour, or apartment.
Part II — Energy consumption prediction workflows
6. Business objective
The objective of the second use case is to predict the average room energy consumption over a 15-minute time window. The target is defined astarget_power_w = average power_w observed during the 15-minute window. Each observation therefore represents one room, in one apartment, at one precise 15-minute timestamp, with an observed mean power value. The final model aims to answer the following question: “What is the expected average power consumption of this room at this precise timestamp?”
7. Source data and selected variables
The main source table for the energy use case is fact_energy_minute. The workflow also uses the following supporting tables: dim_datetime,dim_room, dim_apartment, fact_weather_hour.
7.1 Selected variables from fact_energy_minute
The useful columns are datetime_key, room_key, apartment_key, power_w. A first filtering step keeps only valid measurements: is_valid = TRUE andpower_w IS NOT NULL.
7.2 Role of the dimensions
The dimension tables are used as follows: dim_datetime provides the actual timestamp via timestamp_utc; dim_room provides the readable room label room_name; dim_apartmentprovides the readable apartment label apartment_id.
7.3 Weather source
The weather-enrichment branch uses fact_weather_hour. Selected weather variables are prediction_date, temperature_c, humidity_pct, precipitation_mm, radiation_wm2. These variables are not treated as observed weather only, but as forecast-related explanatory variables used to improve future-oriented consumption prediction.
8. Shared preparation logic for Consumption workflows
The benchmark and prediction workflows for energy consumption share the same preparation logic.
8.1 Data access and first joins
The workflow begins with the following chain: PostgreSQL Connector → DB Table Selector → DB Row Filter → DB Column Filter → DB Joiner with dim_datetime → DB Joiner with dim_room → DB Joiner with dim_apartment → DB Reader. The objective is to extract a minimal but readable energy dataset before local time transformations.
8.2 Construction of the 15-minute energy target
The following nodes are used: Time Rounder → GroupBy → Column Renamer. The minute-level timestamp is first rounded to the 15-minute grain. AGroupBy then aggregates all measurements per room_key, apartment_key, timestamp_rounded. The target is created astarget_power_w = AVG(power_w). The workflow also keeps room_name, apartment_id, reading_count. The reading_count variable allows the workflow to monitor the number of minute-level values that contributed to each 15-minute aggregated observation.
8.3 Weather forecast preparation and anti-leakage logic
A dedicated weather branch is built in parallel.
8.3.1 Weather branch preparation. The branch begins with DB Table Selector on fact_weather_hour → DB Column Filter → DB Joiner with dim_datetime → DB Reader → Column Renamer. This createsweather_target_hour_ts, which represents the hour for which the weather forecast is valid.
8.3.2 Forecast-date control. Because several forecasts exist for the same target weather hour, a filtering rule is required to prevent information leakage. The following nodes are used: Date&Time to String → String to Date&Time → Rule Engine → Row Filter. The implemented rule keeps only weather rows such that prediction_date < target weather date. This implies that forecasts produced on the same day as the target hour are excluded, and only forecasts produced on previous days are accepted. This choice was made in order to reduce the risk of using information that would be unrealistically close to the predicted time.
8.3.3 Selection of the most recent valid forecast. Among the remaining admissible forecast rows, the workflow retains the most recent one for each target weather hour. This is achieved through Sorter → GroupBy → Column Renamer. The sorting order is (1) weather_target_hour_ts ascending, (2) prediction_date descending. The subsequent GroupBy keeps the first valid weather record for each target hour, thereby selecting the most recent admissible forecast.
8.4 Join between weather and energy data
On the energy branch, a second hourly rounding is created fromtimestamp_rounded using a Time Rounder. This producesweather_hour_ts. A Joiner is then used to attach the selected weather forecast to the energy dataset onweather_hour_ts = weather_target_hour_ts. Rows without weather information are removed after the join.
8.5 Time-derived explanatory variables
The enriched energy dataset is further processed usingDate&Time Part Extractor and Rule Engine. The resulting time-based variables are: hour, day_of_week_number, month_number, is_weekend.
8.6 Lag-based temporal memory
The temporal structure of the energy dataset is handled through Sorter → Group Loop Start → multiple Lag Column nodes → Loop End → Column Renamer → Row Filter. Lag computation is done inside grouped loops in order to prevent contamination across rooms and apartments.
The selected lag variables are power_lag_1,power_lag_4, power_lag_96, power_lag_672. These lags capture respectively the previous 15-minute observation, the previous hour, the same time one day earlier, and the same time one week earlier. Rows for which the required lag history is not available are removed before modeling.
8.7 Column selection for machine learning
A Column Filter is used to retain the explanatory variables required for the learners. The retained set includes: target_power_w, hour, day_of_week_number, month_number, is_weekend, power_lag_1, power_lag_4, power_lag_96, power_lag_672, temperature_c, humidity_pct, precipitation_mm, radiation_wm2, room_name, apartment_id. Technical columns used only for joining or control are removed before the learners.
8.8 Encoding of categorical variables
A One to Many node is applied to room_name and apartment_id. This transformation enables the regression models to exploit room-specific and apartment-specific effects.
9. Workflow 3 — Consumption_Benchmark
The benchmark workflow is used to compare several regression models and select the best one for room-level energy consumption prediction.
Train/test split. A Table Partitioner node is used with the following configuration: 80% training data, 20% test data, strategyFirst rows. This preserves a temporally coherent split between past observations used for training and more recent observations used for evaluation.
Candidate models. Three regression branches are prepared: Linear Regression, Simple Regression Tree, Tree Ensemble Regression.
Evaluation procedure. Each branch follows the same evaluation pattern: Learner → Predictor → Numeric Scorer. Two evaluation perspectives are considered.
9.4.1 Mean Signed Difference on the full test set. The full test set is used to compute the Mean Signed Difference (MSD). This metric indicates the direction of the average bias: a positive MSD means the model tends to underestimate actual consumption; a negative MSD means the model tends to overestimate actual consumption; a value close to zero indicates a limited global bias.
9.4.2 Mean Absolute Percentage Error on positive targets only.Because the Mean Absolute Percentage Error (MAPE) cannot be computed when the actual target is equal to zero, an additional filtered evaluation branch is built. A Row Filter first retains only rows for which the real target is strictly positive, after which a second Numeric Scorercomputes the MAPE on this subset. This approach does not modify the model itself. It only restricts the scope of the MAPE calculation to the rows where percentage error is mathematically meaningful.
Model selection criterion. The benchmark results are consolidated into a final comparison table. The selected model is chosen using the following two-step rule:
Retain only models for which MAPE < 10.
Among the remaining candidates, select the model with MSD closest to 0.
Selected model—linear_regression — best compromise between low percentage error and limited signed bias.
Methodological role of the workflow. This workflow is dedicated to model comparison, quality assessment, and final learner selection. It is not intended for database writing.
10. Workflow 4 — Consumption_Prediction
The objective of the prediction workflow is to apply the selected energy model, reconstruct a readable room-level prediction table, and write the results into PostgreSQL.
Shared logic with the benchmark workflow. This workflow reuses the same preparation logic as the benchmark workflow, including database extraction, 15-minute target construction, weather forecast enrichment, time-based features, lag engineering, and categorical encoding. This ensures that operational scoring is consistent with the benchmark preparation logic.
Model application. The selected modeling nodes are Linear Regression Learner and Regression Predictor. The predictor returns the predicted room-level power value.
Reconstruction of a business-readable output. AJoiner is used to reattach readable fields to the prediction result, such as apartment, room, timestamp_rounded, actual observed power value. This makes it possible to move from a purely technical scoring dataset to a table that can be directly interpreted by business or BI users.
Final table prepared for PostgreSQL. After renaming and enrichment, the final table contains the following columns. The constantsmodel_name = linear_regression and target = Consumptionare added.
| column | description |
|---|---|
| predicted_power_w | Model output for the 15-min window (W) |
| apartment | Readable apartment ID |
| room | Readable room name |
| timestamp_rounded | Start of the 15-minute window (UTC) |
| actual_power_w | Observed average power |
| model_name | Constant: linear_regression |
| target | Constant: Consumption |
Database writing. The final result is written into PostgreSQL using PostgreSQL Connector → DB Writer. A possible target table isfact_prediction.
Business meaning of the result. Each output row represents one room, in one apartment, at one specific 15-minute timestamp, with the predicted average power consumption, the observed average power consumption, and the model identifier. This allows the workflow to answer the operational question: “What is the predicted energy consumption of this room at this precise timestamp?” The resulting prediction table can later be aggregated in Power BI to analyse patterns such as room-level load, apartment-level usage, peak time windows, or the influence of weather on consumption.
11. Methodological synthesis
The four workflows follow a coherent methodological framework. First, each use case relies on a strict separation between benchmarking, which supports model comparison and justified selection, and prediction, which supports operational scoring and database integration. Second, the chosen grain of 15 minutes provides a balance between stability and temporal precision — it reduces minute-level noise while preserving a meaningful operational time resolution. Third, both use cases rely on explicit feature engineering: time-derived explanatory variables, lag features, categorical encoding, and, for energy consumption, weather forecast enrichment under an anti-leakage rule. Finally, the prediction workflows are designed not only to score data, but also to reconstruct business-readable outputs suitable for downstream reporting and dashboarding.
12. Executive summary
| workflow | goal |
|---|---|
| Motion_Benchmark | Compare several classification models, evaluate their quality, select the best one using AUC. |
| Motion_Prediction | Apply the selected occupancy model, rebuild a readable prediction table, write the results into PostgreSQL. |
| Consumption_Benchmark | Compare several regression models, evaluate them using MAPE and MSD, select the best one according to the defined decision rule. |
| Consumption_Prediction | Apply the selected energy model, rebuild a readable room-level prediction table, write the results into PostgreSQL. |
13. Final project result
At this stage, the project includes a complete benchmark workflow for room occupancy prediction, a complete prediction workflow for room occupancy scoring, a complete benchmark workflow for room energy consumption prediction, a complete prediction workflow for room energy scoring, PostgreSQL-ready prediction tables, and a solid foundation for downstream Power BI analysis.
Database Schemas
silver.sensor_events (main fact table)
| column | type | description |
|---|---|---|
| id | BIGSERIAL | Primary key |
| apartment | VARCHAR(20) | 'jimmy' or 'jeremie' |
| room | VARCHAR(50) | Normalized: 'Bathroom', 'Kitchen', 'Living Room', etc. |
| sensor_type | VARCHAR(20) | 'plug', 'motion', 'meteo', 'door', 'window', 'humidity', 'consumption' |
| field | VARCHAR(50) | 'power', 'temperature_c', 'co2_ppm', 'open', 'motion', etc. |
| value | FLOAT | The sensor reading |
| unit | VARCHAR(10) | 'W', '°C', 'ppm', '%', 'bool', etc. |
| timestamp | TIMESTAMPTZ | When the reading was taken (UTC) |
| is_outlier | BOOLEAN | TRUE if value outside physical bounds |
Constraints—UNIQUE (apartment, room, sensor_type, field, timestamp). Indexes on timestamp, apartment, sensor_type.
silver.weather_clean
| column | type | description |
|---|---|---|
| id | BIGSERIAL | Primary key |
| timestamp | TIMESTAMPTZ | Forecast timestamp (UTC) |
| site | VARCHAR(50) | Weather station name |
| temperature_c | FLOAT | Predicted temperature |
| humidity_pct | FLOAT | Predicted humidity |
| precipitation_mm | FLOAT | Predicted precipitation |
| radiation_wm2 | FLOAT | Predicted solar radiation |
| is_outlier | BOOLEAN | TRUE if any value outside bounds |
Constraints—UNIQUE (timestamp, site). Index on timestamp.
silver — other tables
| table | purpose | source |
|---|---|---|
| etl_watermark | Tracks processed sensor JSON filenames | flatten_sensors.py |
| weather_watermark | Tracks processed weather CSV filenames | clean_weather.py |
| dim_buildings | Building metadata | MySQL |
| dim_building_types | Maison / Appartement lookup | MySQL |
| dim_rooms | Room details, sensor counts, m² | MySQL |
| dim_sensors | Sensor IPs mapped to rooms | MySQL |
| dim_devices | Appliances per room | MySQL |
| ref_energy_profiles | Reference kWh/yr by type | MySQL |
| ref_power_snapshots | Power consumption snapshots | MySQL |
| ref_parameters | Threshold configs per building | MySQL |
| ref_parameters_type | Parameter type lookup | MySQL |
| log_sensor_errors | Sensor error logs | MySQL |
gold schema — dimension tables
| table | grain | PK | unique constraint | key columns |
|---|---|---|---|---|
| dim_datetime | 1 minute | BIGINT (YYYYMMDDHHMM) | timestamp_utc | date_key, hour, minute, day_of_week, week, month, year, is_weekend |
| dim_date | 1 day | INT (YYYYMMDD) | date | day_of_week, week, month, year, is_weekend, is_holiday |
| dim_apartment | apartment | SERIAL | apartment_id | name, building_id, building_name, floor |
| dim_room | room | SERIAL | (room_name, apartment_key) | room_type, apartment_key (FK) |
| dim_device | device | SERIAL | (device_id, sensor_type) | room_key (FK), device_type, is_active |
gold schema — fact tables
| table | grain | PK | FK references | measures |
|---|---|---|---|---|
| fact_energy_minute | 1 min / device | (datetime_key, device_key) | dim_datetime, dim_date, dim_device, dim_room, dim_apartment | power_w, energy_wh, energy_kwh, cost_chf, counter_total, is_valid |
| fact_environment_minute | 1 min / room | (datetime_key, room_key) | dim_datetime, dim_date, dim_room, dim_apartment | temperature_c, humidity_pct, co2_ppm, noise_db, pressure_hpa, window_open_flag, door_open_flag, is_anomaly |
| fact_presence_minute | 1 min / room | (datetime_key, room_key) | dim_datetime, dim_date, dim_room, dim_apartment | motion_count, door_open_flag, presence_flag, presence_prob (NULL until ML) |
| fact_device_health_day | 1 day / device | (date_key, device_key) | dim_date, dim_device, dim_room, dim_apartment | error_count, missing_readings, uptime_pct, battery_min_pct, battery_avg_pct |
Indexes—All fact tables have indexes on date_key and apartment_key for fast BI filtering. Energy also indexed on room_key.
Scripts Reference
| script | path | description | flags |
|---|---|---|---|
| watcher.py | ingestion/fast_flow/ | Main loop: sensor ingestion (60s) + weather (daily) | --scan | --weather |
| bulk_to_bronze.py | ingestion/fast_flow/ | Copy sensor JSON from SMB to Bronze | --full |
| flatten_sensors.py | etl/bronze_to_silver/ | Bronze JSON to silver.sensor_events | |
| create_silver.py | etl/bronze_to_silver/ | Create Silver schema + tables | |
| import_mysql_to_silver.py | etl/bronze_to_silver/ | MySQL dimensions to Silver | |
| weather_download.py | ingestion/slow_flow/ | Download weather CSV from sFTP | |
| clean_weather.py | etl/bronze_to_silver/ | Clean weather CSV to silver.weather_clean | |
| create_gold.py | etl/silver_to_gold/ | Create Gold star schema | |
| populate_gold.py | etl/silver_to_gold/ | Populate Gold dimensions + facts |
Configuration
Security—The .env file contains credentials and must never be committed to git. It is listed in .gitignore. Share securely (e.g. via Teams, encrypted channel).
| variable | required | description |
|---|---|---|
| DB_URL | yes | PostgreSQL app user connection string |
| DB_ADMIN_URL | optional | Admin connection for schema/DB creation (first run only) |
| MYSQL_URL | yes | MySQL source for dimension tables (school network) |
| SMB_PATH | yes | Mounted SMB share path (sensor JSON files) |
| BRONZE_ROOT | yes | Local Bronze storage folder |
| SFTP_HOST | optional | sFTP server for weather data |
| SFTP_PORT | optional | sFTP port (default: 22) |
| SFTP_USER | optional | sFTP username |
| SFTP_PASSWORD | optional | sFTP password |
| SFTP_PATH | optional | Remote sFTP directory (e.g. /Meteo2) |
| WEATHER_MIN_YEAR | optional | Ignore weather data before this year (default: 2023) |
| WEATHER_HOUR | optional | Hour to trigger daily weather pipeline (default: 7) |
| WEATHER_MIN | optional | Minute to trigger daily weather pipeline (default: 30) |
| LOG_DIR | optional | Directory for ETL log files (default: logs/) |
| SFTP_MAX_RETRIES | optional | Max sFTP connection retries (default: 3) |
| SFTP_RETRY_DELAY | optional | Seconds between retries (default: 600) |
Deployment
Development
Database: domotic_dev
Config: .env
Purpose: testing changes, schema experiments
Production
Database: domotic_prod
Config: .env.prod
Purpose: BI dashboards connect here, stable data
deployment steps (dev to prod)
Pull latest code on VM: git pull origin main
Run schema scripts against prod: python create_silver.py / create_gold.py
Run import_mysql_to_silver.py against prod (refreshes dimension tables)
Run populate_gold.py against prod (rebuilds fact tables from Silver)
Verify row counts: SELECT COUNT(*) FROM gold.fact_energy_minute;
Restart watcher if running: stop old process, start new one with .env.prod
Services & Processes
| service | description | start | stop |
|---|---|---|---|
| watcher.py | Main pipeline orchestrator | python ingestion/fast_flow/watcher.py | Ctrl+C |
| PostgreSQL | Silver + Gold storage | Windows service (auto) | services.msc |
| Next.js site | Documentation website | Auto-deploys on git push (Vercel) | n/a |
Normal operation—Only the watcher needs to be running. PostgreSQL starts automatically as a Windows service. The website is hosted on Vercel and deploys automatically on git push to main.
Monitoring & Logs
| log file | source | what to check |
|---|---|---|
| logs/weather_download.log | weather_download.py | sFTP connection errors, download failures, file counts |
| logs/clean_weather.log | clean_weather.py | Row drop counts per step, outlier counts, processing errors |
| stdout (terminal) | watcher.py | Pipeline runs, skip counts, timing, weather triggers |
| stdout (terminal) | flatten_sensors.py | Batch progress, row counts, errors per file |
health checks
Is the watcher running? Check the terminal or process list for python watcher.py
Is data flowing? SELECT MAX(timestamp) FROM silver.sensor_events -- should be recent
Is Gold up to date? SELECT COUNT(*) FROM gold.fact_energy_minute -- should match expectations
Any weather errors? Check logs/weather_download.log for recent entries
Disk space? Check storage\bronze size -- grows ~2GB/month
Troubleshooting
| problem | cause | fix |
|---|---|---|
| Watcher says 'SMB path not found' | Z: drive not mounted | Map network drive in File Explorer or check VPN |
| flatten_sensors.py is slow | First run processes ~245k files | Normal -- takes ~3.5h. Subsequent runs are seconds. |
| Schema creation fails with permission error | App user can't create schema | Set DB_ADMIN_URL in .env with superuser credentials |
| weather_download.py connection refused | sFTP unreachable or wrong credentials | Check SFTP_HOST, SFTP_USER, SFTP_PASSWORD. Check VPN. |
| populate_gold.py returns 0 rows | Silver is empty | Run flatten_sensors.py first |
| Duplicate key errors in upserts | Normal behavior | ON CONFLICT handles this. Not an error. |
| Log files not appearing | logs/ directory issue | Scripts auto-create it. Check LOG_DIR and write permissions. |
Maintenance
| task | frequency | steps |
|---|---|---|
| Rotate database password | Quarterly | Update PostgreSQL password, update DB_URL in .env, restart watcher |
| Rotate sFTP credentials | On change | Update SFTP_USER / SFTP_PASSWORD in .env, restart watcher |
| Check disk space (Bronze) | Monthly | Bronze grows ~2GB/month. Archive old years if needed. |
| PostgreSQL VACUUM | Weekly (auto) | autovacuum enabled by default. Manual: VACUUM ANALYZE silver.sensor_events; |
| Review log files | Weekly | Check logs/weather_download.log and logs/clean_weather.log |
| Backup database | Daily | pg_dump -Fc domotic_prod > backup_YYYYMMDD.dump |
Security
Credentials
All credentials stored in .env (gitignored). No hardcoded passwords in source code. Shared via secure channels only (Teams, not email).
SQL injection
All database queries use SQLAlchemy parameterized statements. No string concatenation or f-strings in SQL.
Data privacy (GDPR)
MySQL tables containing personal data (users, relationships, emails, passwords) are explicitly skipped during import. Apartment names used as pseudonymized identifiers.
Network
VM runs on school internal network. SMB and MySQL are internal-only. sFTP weather uses encrypted SSH transport (paramiko). No public endpoints.
GDPR & Ethics
The pipeline collects sensor data from real apartments, which means it processes personal information about identifiable households. This section documents how the project handles GDPR obligations (Regulation EU 2016/679) and the wider ethical considerations of continuous in-home sensing.
19.1 Data inventory
| Data category | Where stored | Why it's collected | Identifiability |
|---|---|---|---|
| Apartment address (street, postal) | silver.apartment_metadata, dim_buildings (MySQL import) | Tie sensor data to physical location for analytics | Direct identifier |
| Apartment name + owner_user_id | dim_apartment | Internal join key + display label | Direct identifier |
| Sensor events (motion, door, CO2, power, temp, humidity) | silver.sensor_events, gold.fact_*_minute | Core analytics + ML predictions | Indirect — household behaviour patterns |
| Weather forecasts | silver.weather_forecasts | Energy consumption modelling | Not personal |
| MySQL users / passwords / emails | NEVER imported (skipped at silver layer) | — | Excluded |
19.2 Legal basis (Art. 6)
As an academic research project, the most appropriate legal basis is consent (Art. 6(1)(a)) — residents are informed of the sensors and agree to participate in the study. This consent is obtained at apartment onboarding and may be withdrawn at any time, which triggers data erasure (see 19.3).
19.3 Data subjects' rights
Right of access (Art. 15) — residents may request a full copy of their apartment's data via the project lead. Output: CSV export of all rows where apartment_key matches.
Right to rectification (Art. 16) — corrections to room labels or apartment metadata go through the data controller; no resident-facing edit surface.
Right to erasure (Art. 17) — deletion request triggers DELETE CASCADE from gold facts, silver events, and bronze archival folders for that apartment.
Right to data portability (Art. 20) — same export as access right, in CSV (machine-readable, structured).
Right to object (Art. 21) — residents may opt out of specific analyses (e.g. presence prediction) without leaving the study.
19.4 Data minimisation & purpose limitation (Art. 5)
The project explicitly excludes the MySQL users, relationships, actions tables (containing names, emails, phone numbers, gamification data) from the silver layer (see etl/bronze_to_silver/import_mysql_to_silver.pycomment block). Stated purpose is energy efficiency research and occupancy analytics — secondary uses (e.g. commercial profiling) are explicitly out of scope.
19.5 Storage limitation (Art. 5(1)(e))
Current state: bronze JSON archives are kept indefinitely; silver and gold tables grow without retention. Recommended retention policy (to be implemented before any production deployment):
| Layer | Retention | Rationale |
|---|---|---|
| Bronze | 1 year then archived to cold storage | Audit trail / reprocessing without keeping live |
| Silver | 2 years rolling | Sufficient for trend analysis |
| Gold | 5 years rolling, aggregated | Long-term reporting |
| Predictions | 1 year rolling | Model outputs lose relevance fast |
19.6 Integrity & confidentiality (Art. 32)
Database access control: split admin / app user. Only the app user (limited privileges) is in .env. Admin credentials never persisted.
Power BI row-level security: dim_apartment.owner_user_id used to scope what each user can see. Three roles: role_jimmy, role_jeremie, role_admin.
Network segregation: VM and source DB are on the school internal network. sFTP weather uses encrypted SSH transport (paramiko). No public endpoints.
Credentials in .env, gitignored. Never committed to source.
Future hardening (not yet implemented): TLS for the PostgreSQL listener, encryption-at-rest for the data volume.
19.7 Risk assessment (DPIA-lite)
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Bathroom motion sensor reveals intimate routines | Medium | High | Aggregate to coarser time grain in published dashboards |
| Door + window pattern reveals sleep/leave-home schedule | High | Medium | Restrict raw access to data controller; only patterns visible in BI |
| Energy consumption pattern reveals occupancy | High | Low | Aggregate at hourly minimum in any external sharing |
| Combination of apartment data + outdoor weather is reidentifiable | Low | High | Never publish exact apartment location in conjunction with sensor data |
| Bronze JSON kept forever leaks raw events on breach | Low | Medium | Implement retention policy (see 19.5) |
| Reidentification through cross-referencing with public datasets | Low | High | Anonymisation step (see 19.8) + restrict apartment metadata exports |
19.8 Anonymisation strategy + threat model
The pipeline is a toolmaker, not the data controller. Each deployment of the project (via the install wizard) creates a new instance under the deployer's control: their VM, their DB credentials, their decisions about who gets dashboard access. So the realistic threat surface for personal data is the Power BI dashboard accessed by non-technical users(e.g. residents, evaluators, teachers) — not the underlying tables, which are only reachable by the deployer's technical staff.
| Actor | Access path | Sees |
|---|---|---|
| Source data owner (school IT) | MySQL source DB | Plaintext PII (out of our control) |
| Deployer / tech staff | VM filesystem + Postgres locally | Plaintext bronze + silver + gold |
| Non-tech end users | Power BI dashboard only | Whatever the dashboard exposes |
Anonymisation effort is therefore concentrated where it actually matters: at the gold dimension layer, since Power BI consumes from gold. Bronze and silver retain identifiable values for legitimate research use — protected by VM access control, RLS credentials, and (recommended) disk-level encryption configured by the deployer.
19.9 What gets masked in gold (implemented)
On every populate_dimensions run (idempotent), the following columns of gold.dim_apartment are sanitised:
| Column | Original | After masking | Why |
|---|---|---|---|
| apartment_id (key) | jimmy / jeremie | unchanged | Internal join key, used by RLS |
| name | jimmy / jeremie | unchanged | Already a pseudonym (first name only, with consent); used by RLS |
| building_name | real house name from MySQL | Building <id> or NULL | Would leak physical location |
| owner_user_id | real user account identifier | NULL | Direct identifier — never needed for analytics |
We deliberately keep the name column (a common first name like “jimmy”). The legal and practical reasoning:
Under GDPR Art. 4(1), a common first name in isolation is not personal data — it does not identify a specific natural person without additional context (no surname, no email, no address, no photo).
It becomes personal data only when combined with other identifiers, which we explicitly remove (owner_user_id, building_name, address columns are NULL'd or masked).
Power BI row-level security uses apartment_id (and indirectly name) as a stable filter key — masking it would force a per-deployment .pbix update, which doesn't scale.
Residents have explicitly consented to the use of their first name as a label in the study.
Backlog #18 closed—Implemented at the gold layer in etl/silver_to_gold/populate_dimensions.py. Always runs (no opt-in / opt-out) since it only removes identifiers that no analytic query needs. Power BI and KNIME consume the masked values transparently. Real source values remain in silver / bronze for authorised technical use only.
19.10 Deployer obligations
What we provide vs what each deployer must configure on their installed instance:
| Aspect | Provided by the project | Deployer's responsibility |
|---|---|---|
| BI display anonymisation | Available (gold layer, opt-in) | Set BI_PSEUDONYMISE=true if needed |
| Row-level security (Power BI) | 3 roles defined (jimmy / jeremie / admin) | Assign users to roles |
| DB admin / app user split | Installer creates both | Don't share admin creds |
| VM access control | — | OS-level user accounts |
| Disk encryption | — | BitLocker / LUKS on the VM volume |
| Bronze retention | — | Periodic cleanup or cold-storage move |
| DSAR (data subject requests) | Documented procedure | Execute deletes/exports for their residents |
| Records of processing (Art. 30) | Template in this doc | Maintain their own register |
19.11 Ethics beyond GDPR
Informed consent: residents must understand both what data is collected and what conclusions can be drawn from it (presence patterns, lifestyle inference).
Beneficence: the project should produce tangible value to the residents (e.g. energy bill reduction insights), not only academic output.
Privacy by design: pseudonymisation, RLS, admin/app split are all in place from day one rather than retrofitted.
Transparency: residents should be able to view their own data through a personal dashboard (planned, not yet implemented).
Proportionality: the granularity of motion / door / window sensing is high. Any extension (e.g. cameras) would need a separate ethical review.
19.12 Compliance status
| Requirement | Status |
|---|---|
| Data inventory documented | Done (this section) |
| Lawful basis identified (consent) | Documented; consent process held by project lead |
| Skip personal MySQL tables (users etc.) | Done (import_mysql_to_silver.py) |
| Database admin / app user separation | Done (installer creates both) |
| Power BI row-level security | Done (3 roles) |
| BI-layer pseudonymisation | Done (#18 — gold dim_apartment, opt-in via BI_PSEUDONYMISE=true in .env) |
| Encrypted credentials in transit | Partial — sFTP yes, PG TLS to-do |
| Encryption at rest | Deployer responsibility (BitLocker / LUKS) |
| Bronze retention policy | Deployer responsibility — recommended in 19.5 |
| Records of processing activities (Art. 30) | Template in this doc; deployer maintains |
| Data Protection Impact Assessment (full) | Lite version done (19.7); full DPIA pending per deployment |
| Data Processing Agreement with school | Per deployment, not centralised |
| Resident-facing personal dashboard | Planned (future work) |
Status—The project is on a strong privacy-by-design track but is not production-ready from a GDPR standpoint. Items in the “to-do” rows above must be addressed before any data leaves the academic context (e.g. publishing results, sharing the dataset, scaling to more apartments).
Scalability
The current pipeline is sized for two apartments and was developed on a single Windows VM with one PostgreSQL instance. This section projects what happens as the number of apartments, the time horizon, and the number of concurrent dashboard users grow.
20.1 Current data rates (per apartment)
| Source | Rate | Daily volume |
|---|---|---|
| Sensor JSON (SMB) | 1 file/min × 24h | ~1,440 files/day, ~10–20 MB/day raw |
| MySQL dim sync | Manual or daily | ~50 KB total (static dimensions) |
| Weather forecasts | 1 CSV/day, 24 hourly forecasts | ~50 KB/day (compressed in PG) |
| silver.sensor_events | ~30 events/min × 1440 min × ~10 sensors | ~430,000 rows/day |
| gold.fact_*_minute | 1 row/min/room × ~6 rooms | ~8,640 rows/day |
20.2 1-year projections
| Layer | 1 apt × 1 year | 10 apt × 1 year | 100 apt × 1 year |
|---|---|---|---|
| Bronze (raw JSON) | ~5 GB | ~50 GB | ~500 GB |
| silver.sensor_events rows | ~157 M | ~1.57 B | ~15.7 B |
| silver size (compressed) | ~10 GB | ~100 GB | ~1 TB |
| gold.fact_*_minute rows | ~3.2 M | ~32 M | ~320 M |
| gold size (with indexes) | ~500 MB | ~5 GB | ~50 GB |
20.3 Bottlenecks at scale
| Component | Threshold | Symptom | Mitigation |
|---|---|---|---|
| Single PostgreSQL node | ~50–100 apartments | INSERT throughput drops, query plans degrade | Native partitioning by date_key + tune work_mem / shared_buffers / maintenance_work_mem |
| Single bronze disk | Months of accumulation on a single VM disk | Disk fills, slow scans | Cheap option: scripts/cleanup_bronze.py (delete already-ingested files older than N days). Long-term: move bronze to S3-compatible blob storage. |
| Watcher (single Python process) | ~500 files/min | Falls behind on the 1-minute fast loop | Run separate watcher per apartment cluster, or move to a real queue (Kafka/Redis Streams) |
| Power BI direct query | Many concurrent users | Slow refreshes, dashboard timeouts | Switch from Direct Query to Import + scheduled refresh; consider Power BI Premium / read replica |
| MySQL source DB load | Polling many tables | Read load on the school's source DB | CDC (Debezium) or batch sync at off-peak |
20.3.1 Bronze lifecycle (single-VM deployments)
On a single-VM deployment without object storage, bronze accumulates roughly 5 GB per apartment per year. Once it crosses ~50 GB, scans (filesystem walks, full-mode bulk_to_bronze) get noticeably slower.
The project ships with scripts/cleanup_bronze.py as a pragmatic mitigation:
Reads silver watermarks (silver.etl_watermark, silver.weather_watermark) — the same tables the ETL uses to skip already-processed files.
For each filename in the watermarks where processed_at is older than BRONZE_RETENTION_DAYS (default 30, configurable in .env), deletes the bronze file.
Cleans up empty folders left behind.
Set BRONZE_RETENTION_DAYS=-1 to disable cleanup entirely (keep bronze forever).
Run manually (cron) or hook into the watcher's nightly tick. Example: python scripts/cleanup_bronze.py --dry-run to preview, then without --dry-run to delete.
Trade-off—Bronze becomes a bounded buffer (last N days) instead of an immutable archive. You lose the ability to re-derive silver from bronze for files older than the retention window. The original source data (SMB share, sFTP server, MySQL) is still available for full reprocessing if absolutely needed. For most academic deployments this is the right trade-off; production systems should also configure cold storage.
20.4 Recommended evolution path
Up to ~10 apartments: current architecture works without changes.
10–50 apartments: add native PostgreSQL partitioning by month on fact_*_minute tables; add btree indexes on (apartment_key, date_key) for analytical queries; tune work_mem to ~64MB.
50–100 apartments: move bronze to S3/MinIO; consider TimescaleDB extension for time-series compression on fact_*_minute (typically 10× compression on these workloads).
100+ apartments: shard PostgreSQL by apartment_key; introduce Kafka or Redis Streams for ingestion; deploy watcher per shard.
1000+ apartments: split read and write workloads, use read replicas for Power BI; consider migrating to a managed time-series DB (InfluxDB Cloud, Timescale Cloud).
20.5 Power BI RLS at scale
The current dashboard defines one Power BI role per apartment (role_jimmy, role_jeremie, role_admin). This works for 2–5 apartments but becomes unmanageable at scale: each new apartment requires a .pbix edit + republish.
Recommended evolution for > 5 apartments: replace per-apartment roles with a single dynamic role driven by a user-to-apartment mapping table.
Add to gold (one-time setup):
| Step | Action |
|---|---|
| 1. | Create gold.dim_user_apartment_map (user_email TEXT, apartment_key INTEGER) |
| 2. | Seed it: 1 row per (user, apartment they can see) pair. Admin gets all apartments |
| 3. | In Power BI: replace all role_* with a single role_user using DAX with USERPRINCIPALNAME() |
| 4. | Filter expression: 'gold dim_apartment'[apartment_key] IN CALCULATETABLE(VALUES('gold dim_user_apartment_map'[apartment_key]), 'gold dim_user_apartment_map'[user_email] = USERPRINCIPALNAME()) |
| 5. | Adding a new apartment = INSERT 1 row. No .pbix change needed |
| 6. | Adding a new user = INSERT 1 row per apartment they can see |
This pattern also strengthens the GDPR posture: each user's access is explicitly granted via the mapping table (auditable via SQL), rather than being implicit in a role assignment.
20.6 Compute & cost projections (rough)
For 50 apartments, 5 years of retention, on Swiss-hosted cloud infrastructure (rough order-of-magnitude figures, not vendor quotes):
| Resource | Spec | Approx. monthly cost (CHF) |
|---|---|---|
| Postgres (managed, 8 vCPU / 32 GB / 500 GB SSD) | Single node | ~250 |
| Object storage (bronze, ~3 TB) | S3-compatible | ~70 |
| VM (watcher + ETL workers, 4 vCPU / 16 GB) | Always-on | ~80 |
| Power BI Pro (per user × 10) | License | ~110 |
| Total (50 apartments, 5-year horizon) | — | ~510 CHF/month |
Bottom line—The architecture scales linearly to ~50 apartments without structural changes. Beyond that, partitioning + object storage are straightforward extensions. Beyond ~500, the design needs a rethink (sharding, real ingestion queue). The biggest hidden cost at scale is operational, not infrastructure: monitoring, alerting, and on-call coverage. See Section 14 (Services & Processes) for the current monitoring story.
Architecture Decisions
Context: Need to process 245k+ files from SMB with I/O-bound operations (file copy, DB writes).
Decision: Use Python with ThreadPoolExecutor for parallel file copy and ProcessPoolExecutor for CPU-bound JSON parsing. asyncio reserved for future real-time ingestion.
Consequence: Team familiarity, rich ecosystem (pandas, SQLAlchemy, paramiko). Acceptable performance for batch processing.
Context: Need a relational DB that supports schema separation, OLAP queries, and Power BI connector.
Decision: PostgreSQL 15+ with separate silver and gold schemas in the same database. Two databases for dev/prod isolation.
Consequence: Free, multi-user, Power BI native connector, excellent OLAP performance. Hosted on the VM.
Context: Evaluated Apache Airflow for orchestration, but single-VM deployment made it overkill.
Decision: Lightweight Python watcher script with 60s loop, prediction-based file discovery, and daily weather trigger.
Consequence: Zero infrastructure overhead, easy to understand and maintain. Trade-off: no built-in alerting or DAG visualization.
Context: Need immutable, auditable raw data storage that is easy to inspect and replay.
Decision: Local NTFS file system with timestamped directory structure (YYYY/MM/DD/HH/).
Consequence: Zero overhead, easy to browse in File Explorer, cloud-agnostic. Trade-off: no built-in versioning.
Context: Deploying the project required ~10 manual steps (clone, .env, venv, deps, DB create, schemas, ETL, Power BI). Each step had failure modes a non-technical user couldn't recover from.
Decision: Build a client-side install wizard at /install. The wizard collects connection details, generates a single self-contained Python installer with the .env values baked in, and the user runs one command. The installer auto-mounts SMB on Windows, creates the DB + app user as admin, runs the full ETL, verifies row counts, and offers to open Power BI.
Consequence: Deploy-for-dummies story works in ~30-60 min start to finish. No credentials uploaded to any server (form is fully client-side). Installer is regenerated per-deployment so connection details aren't shared. Trade-off: more code to maintain (installer template + wizard), and the .pbix still needs manual data-source repointing on first open (documented in next steps).
Context: First install attempts failed with 'permission denied' because the app user couldn't CREATE SCHEMA, and silver tables ended up owned by postgres so clean_weather couldn't ALTER them.
Decision: The installer asks for both admin and app credentials. Admin is used only at install time (creates the DB, creates the app user, grants all privileges, transfers ownership of silver tables to the app user). Only the app user is written to .env. Admin credentials live exclusively in the downloaded installer script and are discarded after the run.
Consequence: Pipeline runs as a least-privilege user during normal operation. Admin secret never persists in .env. Operations that legitimately need admin (DB / user creation, schema ownership) work cleanly. Trade-off: deployer needs both credentials, and both must be entered in the wizard.
Context: Power BI dashboards are read by non-technical users (residents, evaluators). The dim_apartment table imported from MySQL contains owner identifiers and building names that would leak personal data and physical location.
Decision: After populate_dimensions, sanitise dim_apartment.owner_user_id (NULL) and dim_apartment.building_name ('Building <id>' or NULL). Keep dim_apartment.name as-is (a common first name like 'jimmy'), because (a) under GDPR Art. 4(1) a common first name in isolation is not personal data, (b) Power BI RLS uses it as a stable filter key, (c) residents have given informed consent. The sanitisation always runs (no opt-in flag) since it only removes identifiers no analytic query needs.
Consequence: Default deployment is GDPR-friendly. Existing per-apartment Power BI roles keep working without .pbix edits. Trade-off: scaling to many apartments still requires per-apartment role definitions; section 20.5 documents the USERPRINCIPALNAME-based mapping pattern for > 5 apartments.
Context: Reaffirming the orchestration choice now that the deploy story is in place. Adding Airflow at this stage would have meant: another DB (Airflow metastore), web UI, executor configuration, deployment complexity, dependency conflicts (Airflow pins old SQLAlchemy versions).
Decision: Keep the Python watcher as the single orchestrator. It is started once (manually or via the installer prompt) and runs forever, polling SMB every 60s and triggering the daily weather subprocess. Status is observable via install.log, scripts/status.py, and direct DB row counts.
Consequence: Zero additional infrastructure. Single failure point that is also trivially restartable. Trade-off: no built-in DAG visualization, retries, or scheduling beyond the daily weather hook. For this scale (1 VM, < 10 apartments), this is the right size.