data-cycle / uc2

Technical Documentation

Apartment Domotics & IoT Sensors — End-to-end data pipeline

version

1.0

date

April 2, 2026

team

Group 14, HES-SO Valais

project

64-61 Data Cycle

System Overview

The Data Cycle project (Use Case 2) processes IoT sensor data from two smart apartments in Valais, Switzerland. The system ingests raw sensor readings (temperature, humidity, CO2, motion, door/window, energy consumption) and weather forecasts, transforms them through a medallion architecture (Bronze, Silver, Gold), and serves them to BI dashboards and ML models.


Project	Data Cycle -- UC2 Apartment Domotics & IoT Sensors
Team	Group 14 -- HES-SO Valais, Haute Ecole de Gestion, 2026
Data period	August 18 -- October 27, 2023
Apartments	2 (Jimmy, Jeremie) -- ~245,000 JSON sensor files
Data volume	15M+ rows in Silver, 6 sensor types, 4 weather measurements
VM	Windows 11, on-premise (school network)
Repository	github.com/dehlya/data-cycle-domotic

Architecture

The system follows the medallion architecture pattern with three data layers, each serving a distinct purpose. Data flows from external sources through Bronze (raw), Silver (cleaned), and Gold (aggregated) before reaching BI tools and ML models.

Sources→Bronze→Silver→Gold→BI / ML

layer	purpose	storage	grain
Bronze	Raw, immutable copy of source data	File system (NTFS)	Per-file (JSON, CSV)
Silver	Cleaned, deduplicated, full-resolution data	PostgreSQL (silver schema)	Per-reading (1 row per sensor per minute)
Gold	Aggregated, business-ready star schema	PostgreSQL (gold schema)	Per-minute (facts) and per-day (health)

Key principle—Every ETL script is idempotent and resume-capable. Scripts can be interrupted and re-run safely without creating duplicates, thanks to watermark tracking and ON CONFLICT upserts.

Technology Stack

layer	technology	version	purpose
Ingestion	Python	3.11+	All ETL scripts, async I/O, parallel processing
Orchestration	watcher.py	custom	60s loop, prediction-based, nightly scan, daily weather
Bronze storage	NTFS file system	Windows 11	Timestamped directory structure, immutable raw files
Silver / Gold	PostgreSQL	15+	Relational DB with schema separation (silver, gold)
ETL	pandas + SQLAlchemy	2.1 / 2.0	DataFrame transforms, parameterized SQL, connection pooling
sFTP	paramiko	3.4	SSH transport for weather CSV download
BI	Power BI + SAP SAC	latest	Energy/environment dashboards + presence analytics
ML	KNIME	latest	Energy forecasting, presence prediction workflows
Website	Next.js + Tailwind	14.1 / 3	Project documentation and team hub

Data Sources

source 1 — sensor JSON (SMB share)

Two apartments (Jimmy, Jeremie) generate one JSON file per minute per apartment, stored on an SMB network share (Z:\). Each file contains nested sensor readings: plugs, doors/windows, motions, meteo stations, humidity sensors, and consumption meters.

Filename format—DD.MM.YYYY HHMM_ApartmentName_received.json — e.g. 18.08.2023 1000_JimmyLoup_received.json

apartment (filename)	mapped to	bronze path
JimmyLoup	jimmy	bronze/jimmy/YYYY/MM/DD/HH/
JeremieVianin	jeremie	bronze/jeremie/YYYY/MM/DD/HH/

JSON structure (top-level fields)

field	type	description
user	string	Apartment owner identifier
api_token	string	Auth token (masked/removed in Silver)
datetime	string	Collection timestamp — DD.MM.YYYY HH:MM
plugs	object	Smart plugs keyed by room name
doorsWindows	object	Door/window sensors keyed by room name
motions	object	Motion sensors keyed by room name
meteos	object	Environmental sensors nested under meteo key
humidities	object	Humidity sensors keyed by room name
consumptions	object	Whole-house energy consumption keyed by House

sensor types — fields extracted per category

sensor type	JSON key	fields extracted	units
plug	plugs	power, total, temperature	W, Wh, °C
door	doorsWindows (type=door)	open, battery	bool, %
window	doorsWindows (type=window)	open, battery	bool, %
motion	motions	motion, light, temperature	bool, lux, °C
meteo	meteos.meteo	temperature_c, co2_ppm, humidity_pct, noise_db, pressure_hpa, battery	°C, ppm, %, dB, hPa, %
humidity	humidities	temperature, humidity, battery	°C, %, %
consumption	consumptions	total_power, power1-3, current1-3, voltage1-3	W, A, V

source 2 — MySQL database (school network)

Static reference tables from the school’s MySQL database (pidb at 10.130.25.152:3306). Contains building metadata, room details, sensor mappings, device inventories, and energy profiles.

MySQL table	Silver table	description
buildings	dim_buildings	Apartment metadata, location, building year
buildingtype	dim_building_types	Maison / Appartement lookup
rooms	dim_rooms	Room details, sensor counts, orientation, m²
sensors	dim_sensors	Sensor IPs mapped to rooms
devices	dim_devices	Appliances per room (fridge, washer, etc.)
profilereference	ref_energy_profiles	Reference energy consumption kWh/yr
profile	ref_power_snapshots	Power consumption snapshots over time
parameters	ref_parameters	Threshold configs per building
parameterstype	ref_parameters_type	Parameter type lookup
dierrors	log_sensor_errors	Sensor error logs — null values, failures

GDPR—Tables skipped: users (names, emails, passwords, phones), actions, achievements, badges, events, eventsgeneric, categories, userrelationships — personal data or irrelevant for analytics.

source 3 — weather CSV (sFTP)

Daily weather forecast files from an sFTP server (/Meteo2). File format: Pred_YYYY-MM-DD.csv. Each CSV has columns: Time, Value, Prediction, Site, Measurement, Unit.

measurement code	mapped column	unit
PRED_T_2M_ctrl	temperature_c	°C
PRED_RELHUM_2M_ctrl	humidity_pct	%
PRED_TOT_PREC_ctrl	precipitation_mm	mm
PRED_GLOB_ctrl	radiation_wm2	W/m²

Sentinel value—-99999.0 in the Value column means missing data. Converted to NULL during cleaning.

Pipeline: Sources to Bronze

watcher.py — orchestrator

Continuous loop running every 60 seconds. Instead of scanning 245k+ files on SMB (~72s), it predicts the next expected filenames from the last known timestamp and checks with .exists() calls (~0.01s). A nightly full scan at midnight catches any missed files. Daily at 07:30, it triggers the weather pipeline as a subprocess.

bulk_to_bronze.py — sensor file copy

Copies new JSON files from SMB to local Bronze storage. Two modes: prediction (default, fast) and full scan (--full flag). Uses 16 parallel threads. Files are stored in timestamped folders: bronze/jimmy/YYYY/MM/DD/HH/filename.json. Skips files that already exist locally.

weather_download.py — sFTP download

Connects to the sFTP server via paramiko, lists CSV files in the remote directory, downloads new ones to bronze/weather/YYYY/MM/DD/. Configurable retry logic (default: 3 attempts, 600s delay). Logs to logs/weather_download.log.

metric	prediction mode	full scan mode
Files on SMB	246,000	246,000
New files = 2	0.4s	~80s
New files = 1,245	20s	~85s

Pipeline: Bronze to Silver

flatten_sensors.py — JSON to sensor_events

Parses Bronze JSON files, extracts all sensor readings, normalizes room names (Bhroom to Bathroom, Bdroom to Bedroom, etc.), flags outliers, and upserts into silver.sensor_events. Uses 8 parallel workers with batches of 2,000 files. Watermark tracking via silver.etl_watermark prevents reprocessing.

Load watermark -- reads processed filenames from silver.etl_watermark

Find new files (newest first) -- walks Bronze folders backwards, stops after 50 consecutive already-processed files

Parallel processing -- splits into batches of 2,000, processed by 8 workers (ProcessPoolExecutor)

Each worker: opens JSON, parses datetime, calls flatten() to extract all sensor readings

Upsert -- inserts into silver.sensor_events with ON CONFLICT DO UPDATE

Mark done -- adds processed filenames to the watermark table

room normalization

Raw JSON room names are inconsistent. The flatten function maps them to standardized names:

raw name	normalized
Bhroom	Bathroom
Bdroom	Bedroom
Livingroom	Living Room
Office	Office
Kitchen	Kitchen
Laundry	Laundry
Outdoor	Outdoor
House	House

outlier detection bounds

Values outside these physical bounds are flagged with is_outlier = TRUE but not removed. This preserves data for audit while allowing BI/Gold to filter.

field	min	max	unit
temperature_c	-20	60	°C
humidity_pct	0	100	%
co2_ppm	300	5,000	ppm
noise_db	0	140	dB
pressure_hpa	870	1,085	hPa
power	0	10,000	W
battery	0	100	%

import_mysql_to_silver.py — dimension tables

Snapshots 10 reference tables from the school MySQL database into PostgreSQL Silver. All columns imported as TEXT (type casting done in Gold). Idempotent: DROP + CREATE on each run.

clean_weather.py — weather CSV processing

Reads raw weather CSVs from Bronze, validates required columns (Time, Value, Prediction, Site, Measurement, Unit), cleans timestamps as UTC, filters by WEATHER_MIN_YEAR, deduplicates by timestamp/site/measurement, maps 4 measurement types to standardized columns, replaces sentinel values (-99999.0) with NULL, pivots from long to wide format, flags outliers, and upserts into silver.weather_clean. Watermark via silver.weather_watermark. Logs row-drop counts at each step to logs/clean_weather.log.

weather outlier bounds

field	min	max
temperature_c	-50	60
humidity_pct	0	100
precipitation_mm	0	500
radiation_wm2	0	1,500

scenario	files	rows	time
First run (full backfill)	245,000	15M+	~3.5 hours
Incremental (after watcher)	2-20	100-1,500	2-4 seconds
Nothing new	0	0	1-2 seconds

Pipeline: Silver to Gold

create_gold.py — schema creation

Creates the gold schema with admin privileges, then creates all dimension and fact tables as the app user (so the app user owns them). Idempotent: uses CREATE TABLE IF NOT EXISTS. Creates indexes on date_key, apartment_key, room_key for fast BI queries.

populate_gold.py — 9-step process

dim_date -- extract unique dates from sensor_events, compute day_of_week, week, month, year, is_weekend

dim_datetime -- extract minute-level timestamps, truncate to minute grain, compute time attributes

dim_apartment -- distinct apartments, enriched from silver.dim_buildings (building_name, building_id)

dim_room -- distinct rooms per apartment, linked via apartment_key

dim_device -- synthetic device_id (apartment_room_sensortype), linked to room_key

fact_energy_minute -- GROUP BY minute/device: MAX(power), MAX(total energy), Wh to kWh conversion

fact_environment_minute -- GROUP BY minute/room: MAX per metric, BOOL_OR for window/door flags

fact_presence_minute -- GROUP BY minute/room: SUM(motion), BOOL_OR(motion OR door open) = presence

fact_device_health_day -- GROUP BY day/device: MIN/AVG battery percentage

Pattern—All fact tables use GROUP BY date_trunc('minute', timestamp) with CASE WHEN pivots. Upserts via ON CONFLICT DO UPDATE. Currently a full reload each run -- future: watermark-based incremental.

Pipeline: Gold to BI / ML

BI dashboards (planned)

dashboard	tool	fact table	key metrics
Energy	Power BI	fact_energy_minute	Power (W), energy (kWh), cost (CHF)
Environment	Power BI	fact_environment_minute	Temperature, humidity, CO2, noise, pressure
Presence	SAP Analytics Cloud	fact_presence_minute	Motion count, presence flag, door activity
Sensor Health	Power BI	fact_device_health_day	Battery levels, uptime, error counts

row-level security (RLS)

Each apartment owner only sees their own data. Implemented via Power BI RLS roles filtering on apartment_key. Three roles: role_jimmy, role_jeremie (filtered), and role_admin (unfiltered, all apartments).

ML models

Built with KNIME Analytics Platform. Two use cases at the 15-minute room grain: room occupancy prediction and room energy consumption prediction. Each use case is implemented as a benchmark workflow (model comparison & selection) plus a prediction workflow (operational scoring & database writing). Results land in gold.fact_prediction. See Section 09 for the full methodology.

ML Workflows (KNIME)

Context and scope

This section presents the KNIME workflows developed for the Data Cycle Project in order to address two analytical use cases at room level:

Occupancy prediction, based on motion-related sensor events.

Energy consumption prediction, based on room-level power measurements enriched with weather forecast variables.

For each use case, the implementation follows a two-workflow logic:

A benchmark workflow, whose role is to prepare the dataset, compare several candidate models, and select the most suitable one.

A prediction workflow, whose role is to reuse the validated preparation logic, apply the selected model, reconstruct a business-readable output, and write the prediction results into PostgreSQL.

The four workflows documented in this report are therefore the following: Motion_Benchmark, Motion_Prediction, Consumption_Benchmark, Consumption_Prediction. This separation between benchmarking and prediction reflects a deliberate methodological choice: the benchmark workflow is used for model comparison and selection, whereas the prediction workflow is intended for operational scoring and database writing.

Part I — Occupancy prediction workflows

1. Business objective

The objective of the first use case is to predict whether a room is occupied during a 15-minute time window. Since the source system does not provide a direct occupancy indicator, occupancy is inferred from motion sensor events.

The target variable is defined as follows:

occupied = 1 if at least one motion = 1 event is observed within the 15-minute window.

occupied = 0 otherwise.

Consequently, each final observation represents one room, in one apartment, at one specific 15-minute timestamp, with an associated binary occupancy status.

1.1 Selection of relevant rooms

Following exploratory analysis, only the following rooms were retained for the occupancy model: Kitchen, Living Room, Office. The remaining rooms were excluded because they did not provide a sufficiently consistent motion signal to support a reliable target construction.

2. Source data and selected variables

The main source table for the occupancy use case is sensor_events. The useful fields selected from this source include apartment, room, sensor_type, field, value, unit, timestamp, and is_outlier. Initial filtering removes outliers, external zones, and all records not relevant to occupancy modeling.

3. Shared preparation logic for Motion workflows

The benchmark and prediction workflows for occupancy share the same preparation logic. This common part ensures that the selected model is trained and later applied on data built according to the same transformation rules.

3.1 Database access

The workflow begins with the standard database access chain: PostgreSQL Connector → DB Table Selector → DB Row Filter → DB Column Filter → DB Reader. This stage isolates the relevant subset of sensor event data required for subsequent local KNIME transformations.

3.2 Time standardisation

A Time Rounder node is used to round the raw event timestamp to a 15-minute grain. This operation defines the future modeling grain as the combination of apartment, room, rounded timestamp.

3.3 Construction of the base grid

A GroupBy node is used to build a base grid containing all distinct combinations of apartment, room, 15-minute window. This base grid acts as the backbone of the final dataset. It ensures that all subsequent aggregated signals are aligned on the same temporal structure.

3.4 Construction of the occupancy label

A dedicated motion branch is created from the filtered source data: Row Filter on sensor_type = motion and field = motion → Rule Engine → GroupBy → Column Renamer. The logic is the following:

A motion event is converted into motion_flag = 1 when value = 1, otherwise 0.

occupied is obtained through MAX(motion_flag) over the 15-minute window.

motion_count is obtained through SUM(motion_flag).

This procedure yields a binary and traceable occupancy target derived from observable sensor behaviour.

3.5 Integration of complementary behavioural signals

Two additional branches are prepared to enrich the occupancy dataset.

Door branch. This branch filters the source data on sensor_type = door and field = open, creates a binary open flag, and aggregates it into door_open_count and door_open_any.

Window branch. This branch follows the same logic for windows and produces window_open_count and window_open_any.

Each aggregated branch is reintegrated into the base grid through a Left Join on apartment, room, timestamp (Rounded). Missing values are replaced with 0 usingMissing Value, so that the absence of an event is interpreted as absence of activity rather than missing data.

3.6 Time-derived explanatory variables

Time-derived features are created using Date&Time Part Extractorand Rule Engine. The resulting variables are: Hour, Day of Week (Number), Day of Week (Name), Month (Number), Month (Name), is_weekend. These variables allow the model to capture temporal regularities in room occupancy.

3.7 Room filtering for modeling

A Row Filter is used to retain only the rooms previously identified as usable for occupancy prediction: Kitchen, Living Room, Office.

3.8 Lag-based temporal memory

The following nodes are used to generate lag features: Sorter → Lag Column → Column Renamer → Missing Value. The following lag variables are created:occupied_lag_1, motion_count_lag_1,door_open_count_lag_1, window_open_count_lag_1. These features encode the immediate past state of the room and provide short-term temporal memory to the model.

3.9 Column selection for machine learning

A Column Filter is then used to retain only the useful modeling variables: occupied, apartment, room, Hour, Day of Week (Number), is_weekend, occupied_lag_1, motion_count_lag_1, door_open_count_lag_1, window_open_count_lag_1.

3.10 Encoding of categorical variables

A One to Many node is applied to apartment and room. This transformation converts categories into binary indicators and avoids introducing an artificial ordinal relationship between rooms or apartments.

3.11 Target formatting

Because the selected learners are nominal classification models, the targetoccupied is converted from numeric to string using aNumber to String node.

4. Workflow 1 — Motion_Benchmark

The purpose of the benchmark workflow is to compare several classification models and select the most suitable one for room occupancy prediction.

Train/test split. A Table Partitioner node is used to split the dataset into 80% training data and 20% test data. The selected strategy is First rows, which is consistent with the temporal nature of the use case because the dataset is chronologically ordered before partitioning, thereby preserving a sequential split between past observations and more recent observations.

Candidate models. Three classification branches are prepared: Logistic Regression, Decision Tree, Random Forest.

Evaluation procedure. Each branch follows the same evaluation logic: Learner → Predictor → Scorer → ROC Curve (JavaScript). The benchmark results are consolidated into a comparison table.

Model selection criterion. The final model is selected on the basis of AUC.

Selected model—logistic_regression — identified by the benchmark as the best-performing model for the occupancy use case.

Methodological role of the workflow. The benchmark workflow is used exclusively to prepare the training and test data, compare alternative learners, evaluate model performance, and justify the final model selection. It is not used for database writing.

5. Workflow 2 — Motion_Prediction

The purpose of the prediction workflow is to apply the selected occupancy model to prepared data, reconstruct a readable business output, and write the final prediction table into PostgreSQL.

Shared logic with the benchmark workflow. This workflow reuses the same preparation chain as the benchmark workflow (data access, time rounding, base grid construction, target and feature engineering, room filtering, categorical encoding). This ensures strict consistency between model training and operational scoring.

Model application. The selected learner and predictor are Logistic Regression Learner and Logistic Regression Predictor. The predictor generates the column Prediction (occupied).

Reconstruction of a business-readable output. AJoiner is used to reattach readable business columns to the prediction result, including apartment, room, timestamp (Rounded), observed occupied value. The join is performed on RowID, assuming row identity has been preserved throughout the preparation pipeline.

Final table prepared for PostgreSQL. After renaming and enrichment, the final table contains the following columns. The constantsmodel_name = logistic_regression and target = Presenceare added.

column	description
predicted_occupied	Model output for the 15-min window (0/1)
apartment	Readable apartment ID
room	Kitchen / Living Room / Office
timestamp_rounded	Start of the 15-minute window (UTC)
actual_occupied	Observed value from motion signal
model_name	Constant: logistic_regression
target	Constant: Presence

Database writing. The final prediction table is written into PostgreSQL using PostgreSQL Connector → DB Writer. A possible target table isfact_prediction.

Business meaning of the result. Each written row represents one room, in one apartment, at one specific 15-minute timestamp, with the predicted occupancy, the observed occupancy, and the model identifier. This makes it possible to answer the operational question: “Will this room be occupied at this precise timestamp?” The resulting table can subsequently be aggregated in Power BI to answer broader analytical questions such as room occupancy patterns by day, hour, or apartment.

Part II — Energy consumption prediction workflows

6. Business objective

The objective of the second use case is to predict the average room energy consumption over a 15-minute time window. The target is defined astarget_power_w = average power_w observed during the 15-minute window. Each observation therefore represents one room, in one apartment, at one precise 15-minute timestamp, with an observed mean power value. The final model aims to answer the following question: “What is the expected average power consumption of this room at this precise timestamp?”

7. Source data and selected variables

The main source table for the energy use case is fact_energy_minute. The workflow also uses the following supporting tables: dim_datetime,dim_room, dim_apartment, fact_weather_hour.

7.1 Selected variables from fact_energy_minute

The useful columns are datetime_key, room_key, apartment_key, power_w. A first filtering step keeps only valid measurements: is_valid = TRUE andpower_w IS NOT NULL.

7.2 Role of the dimensions

The dimension tables are used as follows: dim_datetime provides the actual timestamp via timestamp_utc; dim_room provides the readable room label room_name; dim_apartmentprovides the readable apartment label apartment_id.

7.3 Weather source

The weather-enrichment branch uses fact_weather_hour. Selected weather variables are prediction_date, temperature_c, humidity_pct, precipitation_mm, radiation_wm2. These variables are not treated as observed weather only, but as forecast-related explanatory variables used to improve future-oriented consumption prediction.

8. Shared preparation logic for Consumption workflows

The benchmark and prediction workflows for energy consumption share the same preparation logic.

8.1 Data access and first joins

The workflow begins with the following chain: PostgreSQL Connector → DB Table Selector → DB Row Filter → DB Column Filter → DB Joiner with dim_datetime → DB Joiner with dim_room → DB Joiner with dim_apartment → DB Reader. The objective is to extract a minimal but readable energy dataset before local time transformations.

8.2 Construction of the 15-minute energy target

The following nodes are used: Time Rounder → GroupBy → Column Renamer. The minute-level timestamp is first rounded to the 15-minute grain. AGroupBy then aggregates all measurements per room_key, apartment_key, timestamp_rounded. The target is created astarget_power_w = AVG(power_w). The workflow also keeps room_name, apartment_id, reading_count. The reading_count variable allows the workflow to monitor the number of minute-level values that contributed to each 15-minute aggregated observation.

8.3 Weather forecast preparation and anti-leakage logic

A dedicated weather branch is built in parallel.

8.3.1 Weather branch preparation. The branch begins with DB Table Selector on fact_weather_hour → DB Column Filter → DB Joiner with dim_datetime → DB Reader → Column Renamer. This createsweather_target_hour_ts, which represents the hour for which the weather forecast is valid.

8.3.2 Forecast-date control. Because several forecasts exist for the same target weather hour, a filtering rule is required to prevent information leakage. The following nodes are used: Date&Time to String → String to Date&Time → Rule Engine → Row Filter. The implemented rule keeps only weather rows such that prediction_date < target weather date. This implies that forecasts produced on the same day as the target hour are excluded, and only forecasts produced on previous days are accepted. This choice was made in order to reduce the risk of using information that would be unrealistically close to the predicted time.

8.3.3 Selection of the most recent valid forecast. Among the remaining admissible forecast rows, the workflow retains the most recent one for each target weather hour. This is achieved through Sorter → GroupBy → Column Renamer. The sorting order is (1) weather_target_hour_ts ascending, (2) prediction_date descending. The subsequent GroupBy keeps the first valid weather record for each target hour, thereby selecting the most recent admissible forecast.

8.4 Join between weather and energy data

On the energy branch, a second hourly rounding is created fromtimestamp_rounded using a Time Rounder. This producesweather_hour_ts. A Joiner is then used to attach the selected weather forecast to the energy dataset onweather_hour_ts = weather_target_hour_ts. Rows without weather information are removed after the join.

8.5 Time-derived explanatory variables

The enriched energy dataset is further processed usingDate&Time Part Extractor and Rule Engine. The resulting time-based variables are: hour, day_of_week_number, month_number, is_weekend.

8.6 Lag-based temporal memory

The temporal structure of the energy dataset is handled through Sorter → Group Loop Start → multiple Lag Column nodes → Loop End → Column Renamer → Row Filter. Lag computation is done inside grouped loops in order to prevent contamination across rooms and apartments.

The selected lag variables are power_lag_1,power_lag_4, power_lag_96, power_lag_672. These lags capture respectively the previous 15-minute observation, the previous hour, the same time one day earlier, and the same time one week earlier. Rows for which the required lag history is not available are removed before modeling.

8.7 Column selection for machine learning

A Column Filter is used to retain the explanatory variables required for the learners. The retained set includes: target_power_w, hour, day_of_week_number, month_number, is_weekend, power_lag_1, power_lag_4, power_lag_96, power_lag_672, temperature_c, humidity_pct, precipitation_mm, radiation_wm2, room_name, apartment_id. Technical columns used only for joining or control are removed before the learners.

8.8 Encoding of categorical variables

A One to Many node is applied to room_name and apartment_id. This transformation enables the regression models to exploit room-specific and apartment-specific effects.

9. Workflow 3 — Consumption_Benchmark

The benchmark workflow is used to compare several regression models and select the best one for room-level energy consumption prediction.

Train/test split. A Table Partitioner node is used with the following configuration: 80% training data, 20% test data, strategyFirst rows. This preserves a temporally coherent split between past observations used for training and more recent observations used for evaluation.

Candidate models. Three regression branches are prepared: Linear Regression, Simple Regression Tree, Tree Ensemble Regression.

Evaluation procedure. Each branch follows the same evaluation pattern: Learner → Predictor → Numeric Scorer. Two evaluation perspectives are considered.

9.4.1 Mean Signed Difference on the full test set. The full test set is used to compute the Mean Signed Difference (MSD). This metric indicates the direction of the average bias: a positive MSD means the model tends to underestimate actual consumption; a negative MSD means the model tends to overestimate actual consumption; a value close to zero indicates a limited global bias.

9.4.2 Mean Absolute Percentage Error on positive targets only.Because the Mean Absolute Percentage Error (MAPE) cannot be computed when the actual target is equal to zero, an additional filtered evaluation branch is built. A Row Filter first retains only rows for which the real target is strictly positive, after which a second Numeric Scorercomputes the MAPE on this subset. This approach does not modify the model itself. It only restricts the scope of the MAPE calculation to the rows where percentage error is mathematically meaningful.

Model selection criterion. The benchmark results are consolidated into a final comparison table. The selected model is chosen using the following two-step rule:

Retain only models for which MAPE < 10.

Among the remaining candidates, select the model with MSD closest to 0.

Selected model—linear_regression — best compromise between low percentage error and limited signed bias.

Methodological role of the workflow. This workflow is dedicated to model comparison, quality assessment, and final learner selection. It is not intended for database writing.

10. Workflow 4 — Consumption_Prediction

The objective of the prediction workflow is to apply the selected energy model, reconstruct a readable room-level prediction table, and write the results into PostgreSQL.

Shared logic with the benchmark workflow. This workflow reuses the same preparation logic as the benchmark workflow, including database extraction, 15-minute target construction, weather forecast enrichment, time-based features, lag engineering, and categorical encoding. This ensures that operational scoring is consistent with the benchmark preparation logic.

Model application. The selected modeling nodes are Linear Regression Learner and Regression Predictor. The predictor returns the predicted room-level power value.

Reconstruction of a business-readable output. AJoiner is used to reattach readable fields to the prediction result, such as apartment, room, timestamp_rounded, actual observed power value. This makes it possible to move from a purely technical scoring dataset to a table that can be directly interpreted by business or BI users.

Final table prepared for PostgreSQL. After renaming and enrichment, the final table contains the following columns. The constantsmodel_name = linear_regression and target = Consumptionare added.

column	description
predicted_power_w	Model output for the 15-min window (W)
apartment	Readable apartment ID
room	Readable room name
timestamp_rounded	Start of the 15-minute window (UTC)
actual_power_w	Observed average power
model_name	Constant: linear_regression
target	Constant: Consumption

Database writing. The final result is written into PostgreSQL using PostgreSQL Connector → DB Writer. A possible target table isfact_prediction.

Business meaning of the result. Each output row represents one room, in one apartment, at one specific 15-minute timestamp, with the predicted average power consumption, the observed average power consumption, and the model identifier. This allows the workflow to answer the operational question: “What is the predicted energy consumption of this room at this precise timestamp?” The resulting prediction table can later be aggregated in Power BI to analyse patterns such as room-level load, apartment-level usage, peak time windows, or the influence of weather on consumption.

11. Methodological synthesis

The four workflows follow a coherent methodological framework. First, each use case relies on a strict separation between benchmarking, which supports model comparison and justified selection, and prediction, which supports operational scoring and database integration. Second, the chosen grain of 15 minutes provides a balance between stability and temporal precision — it reduces minute-level noise while preserving a meaningful operational time resolution. Third, both use cases rely on explicit feature engineering: time-derived explanatory variables, lag features, categorical encoding, and, for energy consumption, weather forecast enrichment under an anti-leakage rule. Finally, the prediction workflows are designed not only to score data, but also to reconstruct business-readable outputs suitable for downstream reporting and dashboarding.

12. Executive summary

workflow	goal
Motion_Benchmark	Compare several classification models, evaluate their quality, select the best one using AUC.
Motion_Prediction	Apply the selected occupancy model, rebuild a readable prediction table, write the results into PostgreSQL.
Consumption_Benchmark	Compare several regression models, evaluate them using MAPE and MSD, select the best one according to the defined decision rule.
Consumption_Prediction	Apply the selected energy model, rebuild a readable room-level prediction table, write the results into PostgreSQL.

13. Final project result

At this stage, the project includes a complete benchmark workflow for room occupancy prediction, a complete prediction workflow for room occupancy scoring, a complete benchmark workflow for room energy consumption prediction, a complete prediction workflow for room energy scoring, PostgreSQL-ready prediction tables, and a solid foundation for downstream Power BI analysis.

Database Schemas

silver.sensor_events (main fact table)

column	type	description
id	BIGSERIAL	Primary key
apartment	VARCHAR(20)	'jimmy' or 'jeremie'
room	VARCHAR(50)	Normalized: 'Bathroom', 'Kitchen', 'Living Room', etc.
sensor_type	VARCHAR(20)	'plug', 'motion', 'meteo', 'door', 'window', 'humidity', 'consumption'
field	VARCHAR(50)	'power', 'temperature_c', 'co2_ppm', 'open', 'motion', etc.
value	FLOAT	The sensor reading
unit	VARCHAR(10)	'W', '°C', 'ppm', '%', 'bool', etc.
timestamp	TIMESTAMPTZ	When the reading was taken (UTC)
is_outlier	BOOLEAN	TRUE if value outside physical bounds

Constraints—UNIQUE (apartment, room, sensor_type, field, timestamp). Indexes on timestamp, apartment, sensor_type.

silver.weather_clean

column	type	description
id	BIGSERIAL	Primary key
timestamp	TIMESTAMPTZ	Forecast timestamp (UTC)
site	VARCHAR(50)	Weather station name
temperature_c	FLOAT	Predicted temperature
humidity_pct	FLOAT	Predicted humidity
precipitation_mm	FLOAT	Predicted precipitation
radiation_wm2	FLOAT	Predicted solar radiation
is_outlier	BOOLEAN	TRUE if any value outside bounds

Constraints—UNIQUE (timestamp, site). Index on timestamp.

silver — other tables

table	purpose	source
etl_watermark	Tracks processed sensor JSON filenames	flatten_sensors.py
weather_watermark	Tracks processed weather CSV filenames	clean_weather.py
dim_buildings	Building metadata	MySQL
dim_building_types	Maison / Appartement lookup	MySQL
dim_rooms	Room details, sensor counts, m²	MySQL
dim_sensors	Sensor IPs mapped to rooms	MySQL
dim_devices	Appliances per room	MySQL
ref_energy_profiles	Reference kWh/yr by type	MySQL
ref_power_snapshots	Power consumption snapshots	MySQL
ref_parameters	Threshold configs per building	MySQL
ref_parameters_type	Parameter type lookup	MySQL
log_sensor_errors	Sensor error logs	MySQL

gold schema — dimension tables

table	grain	PK	unique constraint	key columns
dim_datetime	1 minute	BIGINT (YYYYMMDDHHMM)	timestamp_utc	date_key, hour, minute, day_of_week, week, month, year, is_weekend
dim_date	1 day	INT (YYYYMMDD)	date	day_of_week, week, month, year, is_weekend, is_holiday
dim_apartment	apartment	SERIAL	apartment_id	name, building_id, building_name, floor
dim_room	room	SERIAL	(room_name, apartment_key)	room_type, apartment_key (FK)
dim_device	device	SERIAL	(device_id, sensor_type)	room_key (FK), device_type, is_active

gold schema — fact tables

table	grain	PK	FK references	measures
fact_energy_minute	1 min / device	(datetime_key, device_key)	dim_datetime, dim_date, dim_device, dim_room, dim_apartment	power_w, energy_wh, energy_kwh, cost_chf, counter_total, is_valid
fact_environment_minute	1 min / room	(datetime_key, room_key)	dim_datetime, dim_date, dim_room, dim_apartment	temperature_c, humidity_pct, co2_ppm, noise_db, pressure_hpa, window_open_flag, door_open_flag, is_anomaly
fact_presence_minute	1 min / room	(datetime_key, room_key)	dim_datetime, dim_date, dim_room, dim_apartment	motion_count, door_open_flag, presence_flag, presence_prob (NULL until ML)
fact_device_health_day	1 day / device	(date_key, device_key)	dim_date, dim_device, dim_room, dim_apartment	error_count, missing_readings, uptime_pct, battery_min_pct, battery_avg_pct

Indexes—All fact tables have indexes on date_key and apartment_key for fast BI filtering. Energy also indexed on room_key.

Scripts Reference

script	path	description	flags
watcher.py	ingestion/fast_flow/	Main loop: sensor ingestion (60s) + weather (daily)	--scan \| --weather
bulk_to_bronze.py	ingestion/fast_flow/	Copy sensor JSON from SMB to Bronze	--full
flatten_sensors.py	etl/bronze_to_silver/	Bronze JSON to silver.sensor_events
create_silver.py	etl/bronze_to_silver/	Create Silver schema + tables
import_mysql_to_silver.py	etl/bronze_to_silver/	MySQL dimensions to Silver
weather_download.py	ingestion/slow_flow/	Download weather CSV from sFTP
clean_weather.py	etl/bronze_to_silver/	Clean weather CSV to silver.weather_clean
create_gold.py	etl/silver_to_gold/	Create Gold star schema
populate_gold.py	etl/silver_to_gold/	Populate Gold dimensions + facts

Configuration

Security—The .env file contains credentials and must never be committed to git. It is listed in .gitignore. Share securely (e.g. via Teams, encrypted channel).

variable	required	description
DB_URL	yes	PostgreSQL app user connection string
DB_ADMIN_URL	optional	Admin connection for schema/DB creation (first run only)
MYSQL_URL	yes	MySQL source for dimension tables (school network)
SMB_PATH	yes	Mounted SMB share path (sensor JSON files)
BRONZE_ROOT	yes	Local Bronze storage folder
SFTP_HOST	optional	sFTP server for weather data
SFTP_PORT	optional	sFTP port (default: 22)
SFTP_USER	optional	sFTP username
SFTP_PASSWORD	optional	sFTP password
SFTP_PATH	optional	Remote sFTP directory (e.g. /Meteo2)
WEATHER_MIN_YEAR	optional	Ignore weather data before this year (default: 2023)
WEATHER_HOUR	optional	Hour to trigger daily weather pipeline (default: 7)
WEATHER_MIN	optional	Minute to trigger daily weather pipeline (default: 30)
LOG_DIR	optional	Directory for ETL log files (default: logs/)
SFTP_MAX_RETRIES	optional	Max sFTP connection retries (default: 3)
SFTP_RETRY_DELAY	optional	Seconds between retries (default: 600)

Deployment

Development

Database: domotic_dev

Config: .env

Purpose: testing changes, schema experiments

Production

Database: domotic_prod

Config: .env.prod

Purpose: BI dashboards connect here, stable data

deployment steps (dev to prod)

Pull latest code on VM: git pull origin main

Run schema scripts against prod: python create_silver.py / create_gold.py

Run import_mysql_to_silver.py against prod (refreshes dimension tables)

Run populate_gold.py against prod (rebuilds fact tables from Silver)

Verify row counts: SELECT COUNT(*) FROM gold.fact_energy_minute;

Restart watcher if running: stop old process, start new one with .env.prod

Services & Processes

service	description	start	stop
watcher.py	Main pipeline orchestrator	python ingestion/fast_flow/watcher.py	Ctrl+C
PostgreSQL	Silver + Gold storage	Windows service (auto)	services.msc
Next.js site	Documentation website	Auto-deploys on git push (Vercel)	n/a

Normal operation—Only the watcher needs to be running. PostgreSQL starts automatically as a Windows service. The website is hosted on Vercel and deploys automatically on git push to main.

Monitoring & Logs

log file	source	what to check
logs/weather_download.log	weather_download.py	sFTP connection errors, download failures, file counts
logs/clean_weather.log	clean_weather.py	Row drop counts per step, outlier counts, processing errors
stdout (terminal)	watcher.py	Pipeline runs, skip counts, timing, weather triggers
stdout (terminal)	flatten_sensors.py	Batch progress, row counts, errors per file

health checks

Is the watcher running? Check the terminal or process list for python watcher.py

Is data flowing? SELECT MAX(timestamp) FROM silver.sensor_events -- should be recent

Is Gold up to date? SELECT COUNT(*) FROM gold.fact_energy_minute -- should match expectations

Any weather errors? Check logs/weather_download.log for recent entries

Disk space? Check storage\bronze size -- grows ~2GB/month

Troubleshooting

problem	cause	fix
Watcher says 'SMB path not found'	Z: drive not mounted	Map network drive in File Explorer or check VPN
flatten_sensors.py is slow	First run processes ~245k files	Normal -- takes ~3.5h. Subsequent runs are seconds.
Schema creation fails with permission error	App user can't create schema	Set DB_ADMIN_URL in .env with superuser credentials
weather_download.py connection refused	sFTP unreachable or wrong credentials	Check SFTP_HOST, SFTP_USER, SFTP_PASSWORD. Check VPN.
populate_gold.py returns 0 rows	Silver is empty	Run flatten_sensors.py first
Duplicate key errors in upserts	Normal behavior	ON CONFLICT handles this. Not an error.
Log files not appearing	logs/ directory issue	Scripts auto-create it. Check LOG_DIR and write permissions.

Maintenance

task	frequency	steps
Rotate database password	Quarterly	Update PostgreSQL password, update DB_URL in .env, restart watcher
Rotate sFTP credentials	On change	Update SFTP_USER / SFTP_PASSWORD in .env, restart watcher
Check disk space (Bronze)	Monthly	Bronze grows ~2GB/month. Archive old years if needed.
PostgreSQL VACUUM	Weekly (auto)	autovacuum enabled by default. Manual: VACUUM ANALYZE silver.sensor_events;
Review log files	Weekly	Check logs/weather_download.log and logs/clean_weather.log
Backup database	Daily	pg_dump -Fc domotic_prod > backup_YYYYMMDD.dump

Security

Credentials

All credentials stored in .env (gitignored). No hardcoded passwords in source code. Shared via secure channels only (Teams, not email).

SQL injection

All database queries use SQLAlchemy parameterized statements. No string concatenation or f-strings in SQL.

Data privacy (GDPR)

MySQL tables containing personal data (users, relationships, emails, passwords) are explicitly skipped during import. Apartment names used as pseudonymized identifiers.

Network

VM runs on school internal network. SMB and MySQL are internal-only. sFTP weather uses encrypted SSH transport (paramiko). No public endpoints.

GDPR & Ethics

The pipeline collects sensor data from real apartments, which means it processes personal information about identifiable households. This section documents how the project handles GDPR obligations (Regulation EU 2016/679) and the wider ethical considerations of continuous in-home sensing.

19.1 Data inventory

Data category	Where stored	Why it's collected	Identifiability
Apartment address (street, postal)	silver.apartment_metadata, dim_buildings (MySQL import)	Tie sensor data to physical location for analytics	Direct identifier
Apartment name + owner_user_id	dim_apartment	Internal join key + display label	Direct identifier
Sensor events (motion, door, CO2, power, temp, humidity)	silver.sensor_events, gold.fact_*_minute	Core analytics + ML predictions	Indirect — household behaviour patterns
Weather forecasts	silver.weather_forecasts	Energy consumption modelling	Not personal
MySQL users / passwords / emails	NEVER imported (skipped at silver layer)	—	Excluded

19.2 Legal basis (Art. 6)

As an academic research project, the most appropriate legal basis is consent (Art. 6(1)(a)) — residents are informed of the sensors and agree to participate in the study. This consent is obtained at apartment onboarding and may be withdrawn at any time, which triggers data erasure (see 19.3).

19.3 Data subjects' rights

Right of access (Art. 15) — residents may request a full copy of their apartment's data via the project lead. Output: CSV export of all rows where apartment_key matches.

Right to rectification (Art. 16) — corrections to room labels or apartment metadata go through the data controller; no resident-facing edit surface.

Right to erasure (Art. 17) — deletion request triggers DELETE CASCADE from gold facts, silver events, and bronze archival folders for that apartment.

Right to data portability (Art. 20) — same export as access right, in CSV (machine-readable, structured).

Right to object (Art. 21) — residents may opt out of specific analyses (e.g. presence prediction) without leaving the study.

19.4 Data minimisation & purpose limitation (Art. 5)

The project explicitly excludes the MySQL users, relationships, actions tables (containing names, emails, phone numbers, gamification data) from the silver layer (see etl/bronze_to_silver/import_mysql_to_silver.pycomment block). Stated purpose is energy efficiency research and occupancy analytics — secondary uses (e.g. commercial profiling) are explicitly out of scope.

19.5 Storage limitation (Art. 5(1)(e))

Current state: bronze JSON archives are kept indefinitely; silver and gold tables grow without retention. Recommended retention policy (to be implemented before any production deployment):

Layer	Retention	Rationale
Bronze	1 year then archived to cold storage	Audit trail / reprocessing without keeping live
Silver	2 years rolling	Sufficient for trend analysis
Gold	5 years rolling, aggregated	Long-term reporting
Predictions	1 year rolling	Model outputs lose relevance fast

19.6 Integrity & confidentiality (Art. 32)

Database access control: split admin / app user. Only the app user (limited privileges) is in .env. Admin credentials never persisted.

Power BI row-level security: dim_apartment.owner_user_id used to scope what each user can see. Three roles: role_jimmy, role_jeremie, role_admin.

Network segregation: VM and source DB are on the school internal network. sFTP weather uses encrypted SSH transport (paramiko). No public endpoints.

Credentials in .env, gitignored. Never committed to source.

Future hardening (not yet implemented): TLS for the PostgreSQL listener, encryption-at-rest for the data volume.

19.7 Risk assessment (DPIA-lite)

Risk	Likelihood	Impact	Mitigation
Bathroom motion sensor reveals intimate routines	Medium	High	Aggregate to coarser time grain in published dashboards
Door + window pattern reveals sleep/leave-home schedule	High	Medium	Restrict raw access to data controller; only patterns visible in BI
Energy consumption pattern reveals occupancy	High	Low	Aggregate at hourly minimum in any external sharing
Combination of apartment data + outdoor weather is reidentifiable	Low	High	Never publish exact apartment location in conjunction with sensor data
Bronze JSON kept forever leaks raw events on breach	Low	Medium	Implement retention policy (see 19.5)
Reidentification through cross-referencing with public datasets	Low	High	Anonymisation step (see 19.8) + restrict apartment metadata exports

19.8 Anonymisation strategy + threat model

The pipeline is a toolmaker, not the data controller. Each deployment of the project (via the install wizard) creates a new instance under the deployer's control: their VM, their DB credentials, their decisions about who gets dashboard access. So the realistic threat surface for personal data is the Power BI dashboard accessed by non-technical users(e.g. residents, evaluators, teachers) — not the underlying tables, which are only reachable by the deployer's technical staff.

Actor	Access path	Sees
Source data owner (school IT)	MySQL source DB	Plaintext PII (out of our control)
Deployer / tech staff	VM filesystem + Postgres locally	Plaintext bronze + silver + gold
Non-tech end users	Power BI dashboard only	Whatever the dashboard exposes

Anonymisation effort is therefore concentrated where it actually matters: at the gold dimension layer, since Power BI consumes from gold. Bronze and silver retain identifiable values for legitimate research use — protected by VM access control, RLS credentials, and (recommended) disk-level encryption configured by the deployer.

19.9 What gets masked in gold (implemented)

On every populate_dimensions run (idempotent), the following columns of gold.dim_apartment are sanitised:

Column	Original	After masking	Why
apartment_id (key)	jimmy / jeremie	unchanged	Internal join key, used by RLS
name	jimmy / jeremie	unchanged	Already a pseudonym (first name only, with consent); used by RLS
building_name	real house name from MySQL	Building <id> or NULL	Would leak physical location
owner_user_id	real user account identifier	NULL	Direct identifier — never needed for analytics

We deliberately keep the name column (a common first name like “jimmy”). The legal and practical reasoning:

Under GDPR Art. 4(1), a common first name in isolation is not personal data — it does not identify a specific natural person without additional context (no surname, no email, no address, no photo).

It becomes personal data only when combined with other identifiers, which we explicitly remove (owner_user_id, building_name, address columns are NULL'd or masked).

Power BI row-level security uses apartment_id (and indirectly name) as a stable filter key — masking it would force a per-deployment .pbix update, which doesn't scale.

Residents have explicitly consented to the use of their first name as a label in the study.

Backlog #18 closed—Implemented at the gold layer in etl/silver_to_gold/populate_dimensions.py. Always runs (no opt-in / opt-out) since it only removes identifiers that no analytic query needs. Power BI and KNIME consume the masked values transparently. Real source values remain in silver / bronze for authorised technical use only.

19.10 Deployer obligations

What we provide vs what each deployer must configure on their installed instance:

Aspect	Provided by the project	Deployer's responsibility
BI display anonymisation	Available (gold layer, opt-in)	Set BI_PSEUDONYMISE=true if needed
Row-level security (Power BI)	3 roles defined (jimmy / jeremie / admin)	Assign users to roles
DB admin / app user split	Installer creates both	Don't share admin creds
VM access control	—	OS-level user accounts
Disk encryption	—	BitLocker / LUKS on the VM volume
Bronze retention	—	Periodic cleanup or cold-storage move
DSAR (data subject requests)	Documented procedure	Execute deletes/exports for their residents
Records of processing (Art. 30)	Template in this doc	Maintain their own register

19.11 Ethics beyond GDPR

Informed consent: residents must understand both what data is collected and what conclusions can be drawn from it (presence patterns, lifestyle inference).

Beneficence: the project should produce tangible value to the residents (e.g. energy bill reduction insights), not only academic output.

Privacy by design: pseudonymisation, RLS, admin/app split are all in place from day one rather than retrofitted.

Transparency: residents should be able to view their own data through a personal dashboard (planned, not yet implemented).

Proportionality: the granularity of motion / door / window sensing is high. Any extension (e.g. cameras) would need a separate ethical review.

19.12 Compliance status

Requirement	Status
Data inventory documented	Done (this section)
Lawful basis identified (consent)	Documented; consent process held by project lead
Skip personal MySQL tables (users etc.)	Done (import_mysql_to_silver.py)
Database admin / app user separation	Done (installer creates both)
Power BI row-level security	Done (3 roles)
BI-layer pseudonymisation	Done (#18 — gold dim_apartment, opt-in via BI_PSEUDONYMISE=true in .env)
Encrypted credentials in transit	Partial — sFTP yes, PG TLS to-do
Encryption at rest	Deployer responsibility (BitLocker / LUKS)
Bronze retention policy	Deployer responsibility — recommended in 19.5
Records of processing activities (Art. 30)	Template in this doc; deployer maintains
Data Protection Impact Assessment (full)	Lite version done (19.7); full DPIA pending per deployment
Data Processing Agreement with school	Per deployment, not centralised
Resident-facing personal dashboard	Planned (future work)

Status—The project is on a strong privacy-by-design track but is not production-ready from a GDPR standpoint. Items in the “to-do” rows above must be addressed before any data leaves the academic context (e.g. publishing results, sharing the dataset, scaling to more apartments).

Scalability

The current pipeline is sized for two apartments and was developed on a single Windows VM with one PostgreSQL instance. This section projects what happens as the number of apartments, the time horizon, and the number of concurrent dashboard users grow.

20.1 Current data rates (per apartment)

Source	Rate	Daily volume
Sensor JSON (SMB)	1 file/min × 24h	~1,440 files/day, ~10–20 MB/day raw
MySQL dim sync	Manual or daily	~50 KB total (static dimensions)
Weather forecasts	1 CSV/day, 24 hourly forecasts	~50 KB/day (compressed in PG)
silver.sensor_events	~30 events/min × 1440 min × ~10 sensors	~430,000 rows/day
gold.fact_*_minute	1 row/min/room × ~6 rooms	~8,640 rows/day

20.2 1-year projections

Layer	1 apt × 1 year	10 apt × 1 year	100 apt × 1 year
Bronze (raw JSON)	~5 GB	~50 GB	~500 GB
silver.sensor_events rows	~157 M	~1.57 B	~15.7 B
silver size (compressed)	~10 GB	~100 GB	~1 TB
gold.fact_*_minute rows	~3.2 M	~32 M	~320 M
gold size (with indexes)	~500 MB	~5 GB	~50 GB

20.3 Bottlenecks at scale

Component	Threshold	Symptom	Mitigation
Single PostgreSQL node	~50–100 apartments	INSERT throughput drops, query plans degrade	Native partitioning by date_key + tune work_mem / shared_buffers / maintenance_work_mem
Single bronze disk	Months of accumulation on a single VM disk	Disk fills, slow scans	Cheap option: scripts/cleanup_bronze.py (delete already-ingested files older than N days). Long-term: move bronze to S3-compatible blob storage.
Watcher (single Python process)	~500 files/min	Falls behind on the 1-minute fast loop	Run separate watcher per apartment cluster, or move to a real queue (Kafka/Redis Streams)
Power BI direct query	Many concurrent users	Slow refreshes, dashboard timeouts	Switch from Direct Query to Import + scheduled refresh; consider Power BI Premium / read replica
MySQL source DB load	Polling many tables	Read load on the school's source DB	CDC (Debezium) or batch sync at off-peak

20.3.1 Bronze lifecycle (single-VM deployments)

On a single-VM deployment without object storage, bronze accumulates roughly 5 GB per apartment per year. Once it crosses ~50 GB, scans (filesystem walks, full-mode bulk_to_bronze) get noticeably slower.

The project ships with scripts/cleanup_bronze.py as a pragmatic mitigation:

Reads silver watermarks (silver.etl_watermark, silver.weather_watermark) — the same tables the ETL uses to skip already-processed files.

For each filename in the watermarks where processed_at is older than BRONZE_RETENTION_DAYS (default 30, configurable in .env), deletes the bronze file.

Cleans up empty folders left behind.

Set BRONZE_RETENTION_DAYS=-1 to disable cleanup entirely (keep bronze forever).

Run manually (cron) or hook into the watcher's nightly tick. Example: python scripts/cleanup_bronze.py --dry-run to preview, then without --dry-run to delete.

Trade-off—Bronze becomes a bounded buffer (last N days) instead of an immutable archive. You lose the ability to re-derive silver from bronze for files older than the retention window. The original source data (SMB share, sFTP server, MySQL) is still available for full reprocessing if absolutely needed. For most academic deployments this is the right trade-off; production systems should also configure cold storage.

20.4 Recommended evolution path

Up to ~10 apartments: current architecture works without changes.

10–50 apartments: add native PostgreSQL partitioning by month on fact_*_minute tables; add btree indexes on (apartment_key, date_key) for analytical queries; tune work_mem to ~64MB.

50–100 apartments: move bronze to S3/MinIO; consider TimescaleDB extension for time-series compression on fact_*_minute (typically 10× compression on these workloads).

100+ apartments: shard PostgreSQL by apartment_key; introduce Kafka or Redis Streams for ingestion; deploy watcher per shard.

1000+ apartments: split read and write workloads, use read replicas for Power BI; consider migrating to a managed time-series DB (InfluxDB Cloud, Timescale Cloud).

20.5 Power BI RLS at scale

The current dashboard defines one Power BI role per apartment (role_jimmy, role_jeremie, role_admin). This works for 2–5 apartments but becomes unmanageable at scale: each new apartment requires a .pbix edit + republish.

Recommended evolution for > 5 apartments: replace per-apartment roles with a single dynamic role driven by a user-to-apartment mapping table.

Add to gold (one-time setup):

Step	Action
1.	Create gold.dim_user_apartment_map (user_email TEXT, apartment_key INTEGER)
2.	Seed it: 1 row per (user, apartment they can see) pair. Admin gets all apartments
3.	In Power BI: replace all role_* with a single role_user using DAX with USERPRINCIPALNAME()
4.	Filter expression: 'gold dim_apartment'[apartment_key] IN CALCULATETABLE(VALUES('gold dim_user_apartment_map'[apartment_key]), 'gold dim_user_apartment_map'[user_email] = USERPRINCIPALNAME())
5.	Adding a new apartment = INSERT 1 row. No .pbix change needed
6.	Adding a new user = INSERT 1 row per apartment they can see

This pattern also strengthens the GDPR posture: each user's access is explicitly granted via the mapping table (auditable via SQL), rather than being implicit in a role assignment.

20.6 Compute & cost projections (rough)

For 50 apartments, 5 years of retention, on Swiss-hosted cloud infrastructure (rough order-of-magnitude figures, not vendor quotes):

Resource	Spec	Approx. monthly cost (CHF)
Postgres (managed, 8 vCPU / 32 GB / 500 GB SSD)	Single node	~250
Object storage (bronze, ~3 TB)	S3-compatible	~70
VM (watcher + ETL workers, 4 vCPU / 16 GB)	Always-on	~80
Power BI Pro (per user × 10)	License	~110
Total (50 apartments, 5-year horizon)	—	~510 CHF/month

Bottom line—The architecture scales linearly to ~50 apartments without structural changes. Beyond that, partitioning + object storage are straightforward extensions. Beyond ~500, the design needs a rethink (sharding, real ingestion queue). The biggest hidden cost at scale is operational, not infrastructure: monitoring, alerting, and on-call coverage. See Section 14 (Services & Processes) for the current monitoring story.

Architecture Decisions

ADR-001Python 3.11 + asyncio for ingestion

Context: Need to process 245k+ files from SMB with I/O-bound operations (file copy, DB writes).

Decision: Use Python with ThreadPoolExecutor for parallel file copy and ProcessPoolExecutor for CPU-bound JSON parsing. asyncio reserved for future real-time ingestion.

Consequence: Team familiarity, rich ecosystem (pandas, SQLAlchemy, paramiko). Acceptable performance for batch processing.

ADR-002PostgreSQL for Silver and Gold

Context: Need a relational DB that supports schema separation, OLAP queries, and Power BI connector.

Decision: PostgreSQL 15+ with separate silver and gold schemas in the same database. Two databases for dev/prod isolation.

Consequence: Free, multi-user, Power BI native connector, excellent OLAP performance. Hosted on the VM.

ADR-003Custom watcher over Airflow

Context: Evaluated Apache Airflow for orchestration, but single-VM deployment made it overkill.

Decision: Lightweight Python watcher script with 60s loop, prediction-based file discovery, and daily weather trigger.

Consequence: Zero infrastructure overhead, easy to understand and maintain. Trade-off: no built-in alerting or DAG visualization.

ADR-004File system for Bronze

Context: Need immutable, auditable raw data storage that is easy to inspect and replay.

Decision: Local NTFS file system with timestamped directory structure (YYYY/MM/DD/HH/).

Consequence: Zero overhead, easy to browse in File Explorer, cloud-agnostic. Trade-off: no built-in versioning.

ADR-005Self-contained installer over manual setup

Context: Deploying the project required ~10 manual steps (clone, .env, venv, deps, DB create, schemas, ETL, Power BI). Each step had failure modes a non-technical user couldn't recover from.

Decision: Build a client-side install wizard at /install. The wizard collects connection details, generates a single self-contained Python installer with the .env values baked in, and the user runs one command. The installer auto-mounts SMB on Windows, creates the DB + app user as admin, runs the full ETL, verifies row counts, and offers to open Power BI.

Consequence: Deploy-for-dummies story works in ~30-60 min start to finish. No credentials uploaded to any server (form is fully client-side). Installer is regenerated per-deployment so connection details aren't shared. Trade-off: more code to maintain (installer template + wizard), and the .pbix still needs manual data-source repointing on first open (documented in next steps).

ADR-006Two PostgreSQL roles (admin + app) created at install

Context: First install attempts failed with 'permission denied' because the app user couldn't CREATE SCHEMA, and silver tables ended up owned by postgres so clean_weather couldn't ALTER them.

Decision: The installer asks for both admin and app credentials. Admin is used only at install time (creates the DB, creates the app user, grants all privileges, transfers ownership of silver tables to the app user). Only the app user is written to .env. Admin credentials live exclusively in the downloaded installer script and are discarded after the run.

Consequence: Pipeline runs as a least-privilege user during normal operation. Admin secret never persists in .env. Operations that legitimately need admin (DB / user creation, schema ownership) work cleanly. Trade-off: deployer needs both credentials, and both must be entered in the wizard.

ADR-007Mask PII in gold dim_apartment, keep first-name pseudonym

Context: Power BI dashboards are read by non-technical users (residents, evaluators). The dim_apartment table imported from MySQL contains owner identifiers and building names that would leak personal data and physical location.

Decision: After populate_dimensions, sanitise dim_apartment.owner_user_id (NULL) and dim_apartment.building_name ('Building <id>' or NULL). Keep dim_apartment.name as-is (a common first name like 'jimmy'), because (a) under GDPR Art. 4(1) a common first name in isolation is not personal data, (b) Power BI RLS uses it as a stable filter key, (c) residents have given informed consent. The sanitisation always runs (no opt-in flag) since it only removes identifiers no analytic query needs.

Consequence: Default deployment is GDPR-friendly. Existing per-apartment Power BI roles keep working without .pbix edits. Trade-off: scaling to many apartments still requires per-apartment role definitions; section 20.5 documents the USERPRINCIPALNAME-based mapping pattern for > 5 apartments.

ADR-008Watcher script over Airflow (revisited)

Context: Reaffirming the orchestration choice now that the deploy story is in place. Adding Airflow at this stage would have meant: another DB (Airflow metastore), web UI, executor configuration, deployment complexity, dependency conflicts (Airflow pins old SQLAlchemy versions).

Decision: Keep the Python watcher as the single orchestrator. It is started once (manually or via the installer prompt) and runs forever, polling SMB every 60s and triggering the daily weather subprocess. Status is observable via install.log, scripts/status.py, and direct DB row counts.

Consequence: Zero additional infrastructure. Single failure point that is also trivially restartable. Trade-off: no built-in DAG visualization, retries, or scheduling beyond the daily weather hook. For this scale (1 VM, < 10 apartments), this is the right size.