Skip to content

Commit 2415bc3

Browse files
Merge pull request #20 from regulate-tech/add-demo
Add demo - new feature store code
2 parents c59ba5a + 11fef8c commit 2415bc3

File tree

11 files changed

+925
-0
lines changed

11 files changed

+925
-0
lines changed

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,3 +26,5 @@ __pycache__/
2626

2727
# Visual Studio working files
2828
.vs
29+
/demo-feature-store/diabetes_model.pkl
30+
/demo-feature-store/feature_list.pkl

demo-feature-store/README.md

Lines changed: 198 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,198 @@
1+
2+
-----
3+
4+
# NHS Diabetes Risk Calculator (Feature Store Demo)
5+
6+
This project is a functional, interactive web application that demonstrates the **Feature Store** approach in a healthcare context. It uses a synthetic dataset of patient records to train a diabetes risk model and provides a simple web interface to get real-time predictions.
7+
8+
The core idea is to show how a feature store separates the **data engineering** (creating features like "BMI") from the **data science** (training models) and the **application** (getting a real-time risk score). This creates a "single source of truth" for features, ensuring that the data used to train a model is the same as the data used for a real-time prediction.
9+
10+
This demo uses:
11+
12+
* **Feast:** An open-source feature store.
13+
* **Flask:** A lightweight Python web server.
14+
* **Scikit-learn:** For training the risk model.
15+
* **Parquet / SQLite:** As the offline (training) and online (real-time) databases.
16+
17+
## System Architecture
18+
19+
The application demonstrates the full MLOps loop, from data generation to real-time inference.
20+
21+
```mermaid
22+
flowchart TD
23+
subgraph DataEngineering["1. Data Engineering (lib.py / generate_data.py)"]
24+
A["Add New Patients"] -- "Appends to" --> B(Offline Store patient_gp_data.parquet)
25+
B -- "feast materialize" --> C(Online Store online_store.db_)
26+
end
27+
28+
subgraph DataScience["2. Data Science (lib.py / train_model.py)"]
29+
B -- "get_historical_features" --> D["Train New Model (train_and_save_model)"]
30+
D -- "Saves" --> E("Risk Model diabetes_model.pkl")
31+
end
32+
33+
subgraph Application["3. Real-time Application (app.py)"]
34+
F(User Web Browser) <--> G["Flask Web App app.py"]
35+
G -- "Loads" --> E
36+
G -- "get_online_features" --> C
37+
G -- "Displays Prediction" --> F
38+
end
39+
40+
F -- "Clicks Button" --> A
41+
F -- "Clicks Button" --> D
42+
```
43+
44+
-----
45+
46+
## Project Structure
47+
48+
Your repository should contain the following key files:
49+
50+
* **`app.py`**: The main Flask web server.
51+
* **`lib.py`**: A library containing all core logic for data generation and model training.
52+
* **`check_population.py`**: A utility script to analyze the risk distribution of your patient database.
53+
* **`generate_data.py`**: A helper script to manually run data generation (optional, can be done from the app).
54+
* **`train_model.py`**: A helper script to manually run model training (optional, can be done from the app).
55+
* **`templates/index.html`**: The web page template.
56+
* **`nhs_risk_calculator/`**: The Feast feature store repository.
57+
* **`definitions.py`**: The formal definitions of our patient features.
58+
* **`feature_store.yaml`**: The Feast configuration file.
59+
* **`.gitignore`**: (Recommended) To exclude generated files (`*.db`, `*.pkl`, `*.parquet`, `.venv/`) from GitHub.
60+
61+
-----
62+
63+
## Local PC Setup
64+
65+
Follow these steps to get the application running locally.
66+
67+
1. **Clone the Repository:**
68+
69+
```bash
70+
git clone <your-repo-url>
71+
cd <your-repo-name>
72+
```
73+
74+
2. **Create a Virtual Environment:**
75+
76+
```bash
77+
python -m venv .venv
78+
```
79+
80+
3. **Activate the Environment:**
81+
82+
* On Windows:
83+
```bash
84+
.venv\Scripts\activate
85+
```
86+
* On macOS/Linux:
87+
```bash
88+
source .venv/bin/activate
89+
```
90+
91+
4. **Install Required Packages:**
92+
93+
```bash
94+
pip install feast pandas faker scikit-learn flask joblib
95+
```
96+
97+
5. **Generate Initial Data:**
98+
You must create an initial dataset before the app can run. This script creates the first batch of patients, registers them with Feast, and populates the databases.
99+
100+
```bash
101+
python generate_data.py
102+
```
103+
104+
6. **Train the First Model:**
105+
Now that you have data, you must train the first version of the risk model.
106+
107+
```bash
108+
python train_model.py
109+
```
110+
111+
7. **Run the Web App:**
112+
You're all set. Start the Flask server:
113+
114+
```bash
115+
python app.py
116+
```
117+
118+
Now, open `http://127.0.0.1:5000` in your web browser.
119+
120+
-----
121+
122+
## Using the Application
123+
124+
The web app provides an interactive way to simulate the MLOps lifecycle.
125+
126+
### 1\. Calculate Patient Risk (The "GP View")
127+
128+
This is the main function of the app.
129+
130+
* **Action:** Enter a Patient ID (e.g., 1-500) and click "Calculate Risk".
131+
* **What it does:** The app queries the **online store (SQLite)** for the *latest* features for that patient, feeds them into the loaded **model (.pkl file)**, and displays the resulting risk score (LOW/MEDIUM/HIGH).
132+
133+
### 2\. Add New Patients (The "Data Engineering" View)
134+
135+
This simulates new patient data arriving in the health system.
136+
137+
* **Action:** Click the "Add 500 New Patients" button.
138+
* **What it does:**
139+
1. Generates 500 new synthetic patients and appends them to the **offline store (Parquet file)**.
140+
2. Runs `feast materialize` to scan the offline store and update the **online store (SQLite)** with the latest features for all patients (including the new ones).
141+
142+
### 3\. Retrain Risk Model (The "Data Science" View)
143+
144+
This simulates a data scientist updating the risk model with new data.
145+
146+
* **Action:** Click the "Retrain Risk Model" button.
147+
* **What it does:**
148+
1. Fetches the *entire patient history* from the **offline store (Parquet file)**.
149+
2. Uses our new, more realistic logic to generate `True/False` diabetes labels based on their risk factors.
150+
3. Trains a *new* `LogisticRegression` model on this fresh, complete dataset.
151+
4. Saves the new model over the old `diabetes_model.pkl` and reloads it into the app's memory.
152+
153+
-----
154+
155+
## Test the Full Loop (A "What If" Scenario)
156+
157+
This is the best way to see the whole system in action. The predictions you see depend on two things: the **Patient's Data** and the **Model's "Brain"**. You can change both.
158+
159+
### The Experiment
160+
161+
Follow these steps to see how a prediction can change for the *same patient*.
162+
163+
1. **Get a Baseline:**
164+
165+
* Run the app (`python app.py`).
166+
* Enter Patient ID **10**.
167+
* Note their features (e.g., BMI, Age) and their risk (e.g., **28.5% - MEDIUM RISK**).
168+
169+
2. **Check the Population:**
170+
171+
* In your terminal (while the app is still running), run the `check_population.py` script:
172+
```bash
173+
python check_population.py
174+
```
175+
* Note the total number of HIGH risk patients (e.g., `HIGH RISK: 212`).
176+
177+
3. **Add More Data:**
178+
179+
* Go back to the browser and click the **"Add 500 New Patients"** button.
180+
* After it reloads, click it **again**. You have now added 1,000 new patients to the database.
181+
* **Test Patient 10 again:** Enter Patient ID **10**. Their risk score will be **identical (28.5% - MEDIUM RISK)**.
182+
* **Why?** Because their personal data hasn't changed, and the *model is still the same old one*.
183+
184+
4. **Retrain the Model:**
185+
186+
* Now, click the **"Retrain Risk Model"** button.
187+
* The app will now "learn" from the *entire* database, including the 1,000 new patients you added. This new data will slightly change the model's understanding of how features (like BMI) correlate with risk.
188+
189+
5. **See the Change:**
190+
191+
* **Test Patient 10 one last time:** Enter Patient ID **10**.
192+
* You will see that their risk score has changed\! (e.g., it might now be **30.1% - MEDIUM RISK**).
193+
* **Why?** Their personal data is the same, but the **model's "brain" has been updated**. It has a more refined understanding of risk, so its prediction for the *same patient* is now different.
194+
195+
6. **Check the Population Again:**
196+
197+
* Run `python check_population.py` one more time.
198+
* You will see that the total number of LOW/MEDIUM/HIGH risk patients has changed, reflecting the new model's predictions across the entire population.

demo-feature-store/app.py

Lines changed: 165 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,165 @@
1+
2+
import joblib
3+
import pandas as pd
4+
from flask import Flask, request, render_template, flash, redirect, url_for
5+
from feast import FeatureStore
6+
import os
7+
8+
from lib import (
9+
generate_and_save_data,
10+
run_feast_commands,
11+
train_and_save_model,
12+
MODEL_FILE,
13+
FEATURES_FILE,
14+
FEAST_REPO_PATH
15+
)
16+
17+
app = Flask(__name__)
18+
app.secret_key = os.urandom(24)
19+
20+
model = None
21+
feature_list = None
22+
feast_features = []
23+
store = None
24+
25+
def load_resources():
26+
global model, feature_list, feast_features, store
27+
28+
try:
29+
model = joblib.load(MODEL_FILE)
30+
feature_list = joblib.load(FEATURES_FILE)
31+
feast_features = [f"gp_records:{name}" for name in feature_list]
32+
print(f"Model and feature list loaded. Features: {feature_list}")
33+
except FileNotFoundError:
34+
print("WARNING: Model or feature list not found.")
35+
print("Please run train_model.py to generate them.")
36+
model = None
37+
feature_list = None
38+
39+
try:
40+
store = FeatureStore(repo_path=FEAST_REPO_PATH)
41+
print("Connected to Feast feature store.")
42+
except Exception as e:
43+
print(f"FATAL: Could not connect to Feast feature store: {e}")
44+
store = None
45+
46+
47+
@app.route('/')
48+
def home():
49+
return render_template('index.html')
50+
51+
# ... (imports, app = Flask(...), load_resources(), home(), etc.) ...
52+
53+
@app.route('/predict', methods=['POST'])
54+
def predict():
55+
"""Handles the form submission and returns a prediction."""
56+
57+
# We define two thresholds to create three levels
58+
HIGH_RISK_THRESHOLD = 0.35 # (35%)
59+
MEDIUM_RISK_THRESHOLD = 0.15 # (15%)
60+
61+
if not model or not store or not feature_list:
62+
return render_template('index.html',
63+
error="Server error: Model or feature store not loaded.")
64+
65+
patient_id_str = request.form.get('patient_id', '').strip()
66+
if not patient_id_str.isdigit():
67+
return render_template('index.html', error="Invalid Patient ID. Must be a number.")
68+
69+
patient_id = int(patient_id_str)
70+
71+
try:
72+
entity_rows = [{"patient_id": patient_id}]
73+
online_features_dict = store.get_online_features(
74+
features=feast_features,
75+
entity_rows=entity_rows
76+
).to_dict()
77+
78+
features_df = pd.DataFrame(online_features_dict)
79+
80+
if features_df.empty or features_df['patient_id'][0] is None:
81+
return render_template('index.html',
82+
error=f"Patient ID {patient_id} not found.")
83+
84+
X_predict = features_df[feature_list]
85+
prediction_error = None
86+
87+
if X_predict.isnull().values.any():
88+
prediction_error = "Patient data is incomplete. Prediction may be inaccurate."
89+
X_predict = X_predict.fillna(0) # Fill with 0 for demo
90+
91+
92+
# 1. Get the probability of "True" (diabetes)
93+
probability_true = model.predict_proba(X_predict)[0][1]
94+
95+
# 2. Compare against our new thresholds
96+
if probability_true >= HIGH_RISK_THRESHOLD:
97+
prediction_text = "HIGH RISK"
98+
elif probability_true >= MEDIUM_RISK_THRESHOLD:
99+
prediction_text = "MEDIUM RISK"
100+
else:
101+
prediction_text = "LOW RISK"
102+
103+
# 3. Format the results
104+
risk_percent = round(probability_true * 100, 1)
105+
106+
107+
return render_template(
108+
'index.html',
109+
patient_id=patient_id,
110+
patient_data=X_predict.to_dict('records')[0],
111+
prediction=prediction_text,
112+
probability=risk_percent,
113+
error=prediction_error
114+
)
115+
116+
except Exception as e:
117+
return render_template('index.html', error=f"An error occurred: {e}")
118+
119+
# ... (add_data(), retrain_model(), if __name__ == '__main__':, etc.) ...
120+
@app.route('/add-data', methods=['POST'])
121+
def add_data():
122+
"""Generates new data, materializes it, and reloads the store."""
123+
global store
124+
try:
125+
generate_and_save_data()
126+
run_feast_commands()
127+
128+
print("Reloading feature store...")
129+
store = FeatureStore(repo_path=FEAST_REPO_PATH)
130+
131+
flash("Successfully added 500 new patients and updated feature store.", "success")
132+
except Exception as e:
133+
print(f"Error adding data: {e}")
134+
flash(f"Error adding data: {e}", "error")
135+
136+
return redirect(url_for('home'))
137+
138+
139+
@app.route('/retrain-model', methods=['POST'])
140+
def retrain_model():
141+
"""Retrains the model, saves it, and reloads it into the app."""
142+
global model, feature_list, feast_features, store
143+
144+
try:
145+
train_and_save_model()
146+
147+
model = joblib.load(MODEL_FILE)
148+
feature_list = joblib.load(FEATURES_FILE)
149+
feast_features = [f"gp_records:{name}" for name in feature_list]
150+
151+
print("Reloading feature store...")
152+
store = FeatureStore(repo_path=FEAST_REPO_PATH)
153+
154+
155+
flash("Successfully retrained and reloaded the risk model.", "success")
156+
except Exception as e:
157+
print(f"Error retraining model: {e}")
158+
flash(f"Error retraining model: {e}", "error")
159+
160+
return redirect(url_for('home'))
161+
162+
163+
if __name__ == '__main__':
164+
load_resources()
165+
app.run(debug=True)

0 commit comments

Comments
 (0)