docs / workflows

Silver → Gold ETL

Star schema population — dimensions from Silver metadata, fact tables from minute-grain aggregation of sensor_events. Idempotent upserts, safe to rerun.

01Silver inputsource

silver.sensor_events

15M+ rows — all sensor readings

apartment, room, sensor_type, field, value, timestamp

silver.dim_buildings

Building metadata (from MySQL)

houseName, id — enriches dim_apartment

silver.etl_watermark

Tracks processed files

Used by Bronze→Silver, not Gold (yet)

schema
02Schema creation — create_gold.pydone
create_gold.py

etl/silver_to_gold/create_gold.py — run once, idempotent

1.

Creates gold schema (requires DB_ADMIN_URL for privileges)

2.

Grants schema privileges to app user (DB_URL)

3.

Creates all dimension and fact tables with CREATE TABLE IF NOT EXISTS

4.

Creates indexes on date_key, apartment_key, room_key for fast BI queries

5.

Verifies all tables exist after creation

dimension tables (5)

table

grain

key format

source

dim_date

1 day

YYYYMMDD

sensor_events timestamps

dim_datetime

1 minute

YYYYMMDDHHMM

sensor_events timestamps

dim_apartment

apartment

serial

sensor_events + dim_buildings enrichment

dim_room

room

serial

sensor_events rooms per apartment

dim_device

device

serial

synthetic: apartment_room_sensortype

fact tables (4)

table

grain

measures

sensors

fact_energy_minute

1 min / device

power_w, energy_wh, energy_kwh, counter_total, cost_chf, is_valid

plug, consumption

fact_environment_minute

1 min / room

temperature_c, humidity_pct, co2_ppm, noise_db, pressure_hpa, window/door flags, is_anomaly

meteo, humidity, door, window

fact_presence_minute

1 min / room

motion_count, door_open_flag, presence_flag, presence_prob (ML — future)

motion, door

fact_device_health_day

1 day / device

battery_min_pct, battery_avg_pct, error_count, missing_readings, uptime_pct

all (battery field)

Usagepython etl/silver_to_gold/create_gold.py

populate
03Populate — populate_gold.pydone
populate_gold.py

etl/silver_to_gold/populate_gold.py — 9-step process, idempotent upserts

1.

dim_dateExtract unique dates from sensor_events, compute day_of_week, week, month, year, is_weekend

2.

dim_datetimeExtract minute-level timestamps, truncate to minute grain, compute time attributes

3.

dim_apartmentDistinct apartments from sensor_events, then enrich from silver.dim_buildings (building_name, building_id)

4.

dim_roomDistinct rooms per apartment, linked via apartment_key foreign key

5.

dim_deviceSynthetic device_id from apartment + room + sensor_type, linked to room_key

6.

fact_energy_minuteGROUP BY minute/device — MAX(power), MAX(total energy), convert Wh to kWh, BOOL_AND(not outlier) for is_valid

7.

fact_environment_minuteGROUP BY minute/room — MAX per metric, BOOL_OR for window/door open flags, BOOL_OR(outlier) for is_anomaly

8.

fact_presence_minuteGROUP BY minute/room — SUM(motion events), BOOL_OR(door open), presence_flag = motion OR door activity

9.

fact_device_health_dayGROUP BY day/device — MIN/AVG battery percentage from battery field readings

Pattern — All fact tables use GROUP BY date_trunc('minute', timestamp) with CASE WHEN pivots to extract specific fields. Upserts via ON CONFLICT DO UPDATE — safe to rerun at any time.

Usagepython etl/silver_to_gold/populate_gold.py

design
04Design decisions

Minute grain

Preserves temporal resolution for BI dashboards. Aggregation to hourly/daily done at query time.

Synthetic device_id

No real device IDs in sensor data — constructed as apartment_room_sensortype (e.g. jimmy_kitchen_plug).

Presence logic

Simple boolean: motion detected OR door opened = present. ML model (presence_prob) will replace this later.

Outlier passthrough

Silver flags outliers but keeps them. Gold uses is_valid / is_anomaly to let BI dashboards filter.

Full reload

Every run re-processes all Silver data (ON CONFLICT DO UPDATE). Future: watermark-based incremental.

output
05Gold layer — outputdone

5 dimension tables

datetime (minute), date (day), apartment, room, device

4 fact tables

energy, environment, presence (minute grain) + device health (day grain)