This repository contains my personal contributions and experimental work developed during my collaboration with the Artificial Intelligence Department (E3) at the Jožef Stefan Institute, as part of the HumAIne-JSI project.
Official repository: energy-ea
The project focuses on exploratory data analysis and classification for smart energy systems. It is part of the EU-funded HUMAINE project, aiming to develop transparent, human-centered AI tools in the energy domain.
To be completed.
smart-energy-ml-analysis-jsi/
├── data/
├── docs/
├── figures/
├── models/
├── notebooks/
├── reports/
├── tables/
├── README.md
This guide helps you run everything today without MinIO.
Runs AL with entropy strategy, measures KPIs (sim calls, sim time), and saves CSVs.
python /mnt/data/run_simulated_active_learning.pyOutputs:
tables/metrics_simulated_entropy.csvtables/kpis_simulated_entropy.csv
Interactive app showing learning curve vs random baseline and KPI cards.
streamlit run /mnt/data/streamlit_al_dashboard.py --server.headless trueNotes:
- The app automatically loads the dataset from either
data/simulation_security_labels_n-1.csvor/mnt/data/simulation_security_labels_n-1.csv. - Single run uses the simulator on-demand; baseline random run uses offline labels (no simulator calls).
Run an experiment grid over strategies, initial sizes, batch sizes, and iteration counts. Saves figures and CSV tables.
python -c "from al_experiment_code import run_experiment_grid; df=run_experiment_grid(csv_path='simulation_security_labels_n-1.csv', strategies=['entropy','uncertainty','margin','random'], initial_sizes=[10,20], batch_sizes=[5,10], iteration_counts=[10,20], test_size=0.1, random_state=42, figures_dir='figures', tables_dir='tables'); print(df.head())"Outputs:
tables/experiment_results_summary.csvtables/experiment_iteration_metrics.csv- figures under
figures/(if plotting is enabled in your code).
Reduce simulator usage/time via Active Learning (AL) while maintaining classification performance for N-1 security assessment.
- Dataset:
simulation_security_labels_n-1.csv(secure/insecure) - Digital twin:
digital_twin_ext_grid.json
- Classifier: Random Forest (100 trees)
- Query strategies: entropy, uncertainty, margin, random
- AL loop with on-demand simulator labels (caching enabled)
- Total labeled samples
- Sample saving (%) = 1 - labeled / pool_size
- Simulator calls (cumulative)
- Simulator time (cumulative seconds)
- Single run: entropy vs baseline random (same init, batch, iterations)
- Grid: small grid (entropy, uncertainty, margin, random) with two initial sizes and batches
- Final accuracy (AL): ___%
- Final accuracy (baseline): ___%
- Labeled samples (AL): ___ / Pool: ___ → saving: ___%
- Simulator calls/time (AL): ___ / ___ s
- Observation about random vs sequential splits: ___
- Learning curve (accuracy vs iterations): AL vs baseline
- KPI table (per iteration): labeled_count, sim_calls_cum, sim_time_sec_cum
- Integrate MinIO writes (results + models)
- Prepare HumAIne dashboard binding
- Expand simulator parameterization (seasonality scenarios)
- Python
- Jupyter Notebooks
- Streamlit
- Git & GitHub
- VS Code
Gašper Leskovec
MSc student in Electrical Engineering (ICT) – University of Ljubljana
Contributor at E3, Jožef Stefan Institute
GitHub: @leskovecg
This guide explains what each script does, how they fit together, which functions matter, what they take as input and return as output, and how to run everything end‑to‑end. It’s written for first‑time readers and for you when you return to the project later.
You have two complementary ways to run experiments:
-
Online (with simulator calls) — labels are obtained on demand by calling the digital‑twin simulator.
Entry point:run_simulated_active_learning.py→ usesactive_learning_with_simulator.py→ callssimulator_interface.py. -
Offline (no simulator calls) — labels are taken from the CSV, used to benchmark Active Learning (AL) strategies quickly.
Entry point:al_experiment_code.py.
For a UI, use streamlit_al_dashboard.py to run both modes from a dashboard and download results.
End‑to‑end online run that splits data, runs AL with simulator labels on demand, and saves results (CSV + XLSX).
Key responsibilities:
- Time‑based split when a
timestampexists: pool = past, validation = future (no overlap). Falls back to stratified split otherwise. - Feature whitelist to avoid leakage (e.g., keep only
load_*,gen_*,sgen_*columns). - Calls
active_learning_with_simulator.run_active_learning(simulate_on_demand=True). - Writes per‑iteration metrics and a KPI summary to disk.
Main CLI arguments
--data <path> Path to CSV (must contain 'status' = secure/insecure)
--strategy <str> entropy | uncertainty | margin | random
--init <int> initial labeled size
--batch <int> queries per iteration
--iters <int> number of AL iterations
--test-size <float> validation fraction (0–1)
--seed <int> random seed
--avg-sim-sec <float> optional, to compute estimated simulator time
--tables-dir <path> output folder for CSV/XLSX
Outputs created
tables/metrics_simulated_<strategy>_init<...>_b<...>_it<...>_<timestamp>.csv(per‑iteration)- corresponding
.xlsxwith sheetsper_iterationandkpi_summary
Implements the Active Learning loop that can query the simulator only for the samples you choose.
Core ideas:
- Train
RandomForestClassifier(class_weight="balanced")on currently labeled pool. - Score unlabeled points with a strategy (
uncertainty,entropy,margin,random). - Pick top‑K, query labels via simulator if
simulate_on_demand=True, add them to labeled set. - Track metrics over iterations (Accuracy, Macro‑Precision/Recall/F1, safe ROC‑AUC) and KPI counters (sim calls/time, wall time, etc.).
Key functions
compute_query_scores(proba, strategy) -> np.ndarray- Input:
proba(N×2 class probabilities),strategy∈ {uncertainty, entropy, margin, random} - Output: higher = more informative score for each unlabeled sample
run_active_learning(X_pool, y_pool, X_val, y_val, strategy,
initial_size, batch_size, iterations,
random_state=42,
simulate_on_demand=False,
avg_sim_time_sec=None)
-> (metrics_per_iteration, duration_wall_sec, kpi_summary)- If
simulate_on_demand=True, labels for selected samples are fetched via the simulator (cached). - Returns:
metrics_per_iteration: list of dicts with metrics + KPI counters per iterationduration_wall_sec: total wall‑clock timekpi_summary: final snapshot (accuracy/AUC, how many labels used, #sim calls, measured/estimated sim time, etc.)
Thin wrapper around the pandapower model of your grid (digital twin), with robust path resolution and LRU‑cached queries.
Key pieces
query_simulator(sample: dict) -> "secure" | "insecure"Runs base‑case + N‑1 contingencies; returns "secure" only if all checks pass (line loading within 100%, bus voltages within [0.9, 1.1] pu).
query_simulator_cached(sample: dict) -> "secure" | "insecure"Adds a stable cache key → massively reduces repeated simulator work.
Inputs expected in sample
- Feature names like
load_<i>_p_mw,gen_<i>_p_mw,sgen_<i>_p_mwmapped to floats.
Implements offline AL sweeps (fast baselines). Labels are read from CSV.
Highlights:
load_dataset()parses & sorts bytimestamp(if present), mapsstatus→ binary, and drops target/timestamp from features.- Three split modes: random (stratified), sequential, and time‑based (cut at quantile).
check_split_diagnostics()prints class balance and time‑range info (helps debug AUC issues).run_active_learning()(offline variant) returns per‑iteration metrics and duration.run_experiment_grid()runs a parameter grid (strategies × init × batch × iters × split), then saves:tables/active_learning_results_<timestamp>.csv(summary)tables/active_learning_results_<timestamp>.xlsx(summary +per_iterationsheet)tables/al_metrics_per_iteration_<timestamp>.csv(full curves)
A simple Streamlit app to run either mode interactively and download results.
- Mode 1: Single Run (Simulator) — performs a stratified split by the true labels, then calls the online AL loop.
- Mode 2: Offline Grid — lets you choose strategies and grid params, runs
run_experiment_grid(), previews a summary, and provides quick comparison charts.
Run it
streamlit run streamlit_al_dashboard.pyYour CSV is expected to include:
statuscolumn with values"secure"or"insecure"(mandatory)- Optional
timestamp(recommended for strict time‑based evaluation) - Exogenous features such as
load_*,gen_*,sgen_*, … (and other domain inputs likepv_*,wind_*,weather_*if you add them)
Important: We explicitly drop status and timestamp from the model’s input features to avoid leakage.
python run_simulated_active_learning.py \
--data "C:\path\to\simulation_security_labels_n-1.csv" \
--strategy entropy \
--init 100 \
--batch 50 \
--iters 40 \
--test-size 0.1 \
--seed 42 \
--avg-sim-sec 2.3 \
--tables-dir "tables"python al_experiment_code.py(Edit the __main__ constants or call run_experiment_grid() from another script/notebook.)
streamlit run streamlit_al_dashboard.pyPick a mode in the sidebar, set parameters, click Run, and download CSV/XLSX.
Per iteration you get:
- Accuracy, Macro‑Precision, Macro‑Recall, Macro‑F1
- ROC‑AUC (safe) — returns
NaNwhen only one class is present in validation to avoid misleading warnings - KPI counters (online mode) — cumulative simulator calls, simulator time (measured), estimated simulator time (optional), training time, wall time, and total labeled count.
Goal of AL: achieve similar accuracy with far fewer labeled samples, translating to lower simulator time.
- Add new AL strategies: implement a scorer in
compute_query_scores()and add it to the accepted choice list. - Add more exogenous features: extend the whitelist in
run_simulated_active_learning._select_feature_columns()(e.g.,"pv_","wind_","weather_"). - Swap models: replace
RandomForestClassifierwith your model (keepclass_weight="balanced"if classes are skewed).
- ROC‑AUC is NaN — validation contains only one class; use a time split with enough positives/negatives or expand the validation window.
- digital_twin_ext_grid.json not found — check
data/path; the simulator loader tries multiple locations but you may need to drop the JSON intodata/. - No features after whitelist — you’ll see a warning and it will fall back to “all except labels/timestamp”. Prefer to fix the whitelist so only true exogenous inputs remain.
- AL (Active Learning) — iteratively selects the most informative samples to label next.
- Uncertainty/Entropy/Margin — three standard uncertainty‑based selection heuristics.
- On‑demand labels — ground truth obtained by calling the simulator as needed, not pre‑labeling everything.
- N‑1 — grid security check under single‑element outages (lines/generators).
Author notes
- Model defaults:
RandomForestClassifier(n_estimators=100, class_weight="balanced", random_state=42) - All outputs are timestamped to keep experiment logs clean and comparable.
- Caching in the simulator layer dramatically speeds up repeated queries with identical features.
Happy experimenting! 🚀