|
| 1 | + |
| 2 | +----- |
| 3 | + |
| 4 | +# NHS Diabetes Risk Calculator (Feature Store Demo) |
| 5 | + |
| 6 | +This project is a functional, interactive web application that demonstrates the **Feature Store** approach in a healthcare context. It uses a synthetic dataset of patient records to train a diabetes risk model and provides a simple web interface to get real-time predictions. |
| 7 | + |
| 8 | +The core idea is to show how a feature store separates the **data engineering** (creating features like "BMI") from the **data science** (training models) and the **application** (getting a real-time risk score). This creates a "single source of truth" for features, ensuring that the data used to train a model is the same as the data used for a real-time prediction. |
| 9 | + |
| 10 | +This demo uses: |
| 11 | + |
| 12 | + * **Feast:** An open-source feature store. |
| 13 | + * **Flask:** A lightweight Python web server. |
| 14 | + * **Scikit-learn:** For training the risk model. |
| 15 | + * **Parquet / SQLite:** As the offline (training) and online (real-time) databases. |
| 16 | + |
| 17 | +## System Architecture |
| 18 | + |
| 19 | +The application demonstrates the full MLOps loop, from data generation to real-time inference. |
| 20 | + |
| 21 | +```mermaid |
| 22 | +flowchart TD |
| 23 | + subgraph DataEngineering["1. Data Engineering (lib.py / generate_data.py)"] |
| 24 | + A["Add New Patients"] -- "Appends to" --> B(Offline Store patient_gp_data.parquet) |
| 25 | + B -- "feast materialize" --> C(Online Store online_store.db_) |
| 26 | + end |
| 27 | +
|
| 28 | + subgraph DataScience["2. Data Science (lib.py / train_model.py)"] |
| 29 | + B -- "get_historical_features" --> D["Train New Model (train_and_save_model)"] |
| 30 | + D -- "Saves" --> E("Risk Model diabetes_model.pkl") |
| 31 | + end |
| 32 | +
|
| 33 | + subgraph Application["3. Real-time Application (app.py)"] |
| 34 | + F(User Web Browser) <--> G["Flask Web App app.py"] |
| 35 | + G -- "Loads" --> E |
| 36 | + G -- "get_online_features" --> C |
| 37 | + G -- "Displays Prediction" --> F |
| 38 | + end |
| 39 | +
|
| 40 | + F -- "Clicks Button" --> A |
| 41 | + F -- "Clicks Button" --> D |
| 42 | +``` |
| 43 | + |
| 44 | +----- |
| 45 | + |
| 46 | +## Project Structure |
| 47 | + |
| 48 | +Your repository should contain the following key files: |
| 49 | + |
| 50 | + * **`app.py`**: The main Flask web server. |
| 51 | + * **`lib.py`**: A library containing all core logic for data generation and model training. |
| 52 | + * **`check_population.py`**: A utility script to analyze the risk distribution of your patient database. |
| 53 | + * **`generate_data.py`**: A helper script to manually run data generation (optional, can be done from the app). |
| 54 | + * **`train_model.py`**: A helper script to manually run model training (optional, can be done from the app). |
| 55 | + * **`templates/index.html`**: The web page template. |
| 56 | + * **`nhs_risk_calculator/`**: The Feast feature store repository. |
| 57 | + * **`definitions.py`**: The formal definitions of our patient features. |
| 58 | + * **`feature_store.yaml`**: The Feast configuration file. |
| 59 | + * **`.gitignore`**: (Recommended) To exclude generated files (`*.db`, `*.pkl`, `*.parquet`, `.venv/`) from GitHub. |
| 60 | + |
| 61 | +----- |
| 62 | + |
| 63 | +## Local PC Setup |
| 64 | + |
| 65 | +Follow these steps to get the application running locally. |
| 66 | + |
| 67 | +1. **Clone the Repository:** |
| 68 | + |
| 69 | + ```bash |
| 70 | + git clone <your-repo-url> |
| 71 | + cd <your-repo-name> |
| 72 | + ``` |
| 73 | + |
| 74 | +2. **Create a Virtual Environment:** |
| 75 | + |
| 76 | + ```bash |
| 77 | + python -m venv .venv |
| 78 | + ``` |
| 79 | + |
| 80 | +3. **Activate the Environment:** |
| 81 | + |
| 82 | + * On Windows: |
| 83 | + ```bash |
| 84 | + .venv\Scripts\activate |
| 85 | + ``` |
| 86 | + * On macOS/Linux: |
| 87 | + ```bash |
| 88 | + source .venv/bin/activate |
| 89 | + ``` |
| 90 | + |
| 91 | +4. **Install Required Packages:** |
| 92 | + |
| 93 | + ```bash |
| 94 | + pip install feast pandas faker scikit-learn flask joblib |
| 95 | + ``` |
| 96 | + |
| 97 | +5. **Generate Initial Data:** |
| 98 | + You must create an initial dataset before the app can run. This script creates the first batch of patients, registers them with Feast, and populates the databases. |
| 99 | + |
| 100 | + ```bash |
| 101 | + python generate_data.py |
| 102 | + ``` |
| 103 | + |
| 104 | +6. **Train the First Model:** |
| 105 | + Now that you have data, you must train the first version of the risk model. |
| 106 | + |
| 107 | + ```bash |
| 108 | + python train_model.py |
| 109 | + ``` |
| 110 | + |
| 111 | +7. **Run the Web App:** |
| 112 | + You're all set. Start the Flask server: |
| 113 | +
|
| 114 | + ```bash |
| 115 | + python app.py |
| 116 | + ``` |
| 117 | +
|
| 118 | + Now, open `http://127.0.0.1:5000` in your web browser. |
| 119 | +
|
| 120 | +----- |
| 121 | +
|
| 122 | +## Using the Application |
| 123 | +
|
| 124 | +The web app provides an interactive way to simulate the MLOps lifecycle. |
| 125 | +
|
| 126 | +### 1\. Calculate Patient Risk (The "GP View") |
| 127 | +
|
| 128 | +This is the main function of the app. |
| 129 | +
|
| 130 | + * **Action:** Enter a Patient ID (e.g., 1-500) and click "Calculate Risk". |
| 131 | + * **What it does:** The app queries the **online store (SQLite)** for the *latest* features for that patient, feeds them into the loaded **model (.pkl file)**, and displays the resulting risk score (LOW/MEDIUM/HIGH). |
| 132 | +
|
| 133 | +### 2\. Add New Patients (The "Data Engineering" View) |
| 134 | +
|
| 135 | +This simulates new patient data arriving in the health system. |
| 136 | +
|
| 137 | + * **Action:** Click the "Add 500 New Patients" button. |
| 138 | + * **What it does:** |
| 139 | + 1. Generates 500 new synthetic patients and appends them to the **offline store (Parquet file)**. |
| 140 | + 2. Runs `feast materialize` to scan the offline store and update the **online store (SQLite)** with the latest features for all patients (including the new ones). |
| 141 | +
|
| 142 | +### 3\. Retrain Risk Model (The "Data Science" View) |
| 143 | +
|
| 144 | +This simulates a data scientist updating the risk model with new data. |
| 145 | +
|
| 146 | + * **Action:** Click the "Retrain Risk Model" button. |
| 147 | + * **What it does:** |
| 148 | + 1. Fetches the *entire patient history* from the **offline store (Parquet file)**. |
| 149 | + 2. Uses our new, more realistic logic to generate `True/False` diabetes labels based on their risk factors. |
| 150 | + 3. Trains a *new* `LogisticRegression` model on this fresh, complete dataset. |
| 151 | + 4. Saves the new model over the old `diabetes_model.pkl` and reloads it into the app's memory. |
| 152 | + |
| 153 | +----- |
| 154 | + |
| 155 | +## Test the Full Loop (A "What If" Scenario) |
| 156 | + |
| 157 | +This is the best way to see the whole system in action. The predictions you see depend on two things: the **Patient's Data** and the **Model's "Brain"**. You can change both. |
| 158 | + |
| 159 | +### The Experiment |
| 160 | + |
| 161 | +Follow these steps to see how a prediction can change for the *same patient*. |
| 162 | + |
| 163 | +1. **Get a Baseline:** |
| 164 | + |
| 165 | + * Run the app (`python app.py`). |
| 166 | + * Enter Patient ID **10**. |
| 167 | + * Note their features (e.g., BMI, Age) and their risk (e.g., **28.5% - MEDIUM RISK**). |
| 168 | + |
| 169 | +2. **Check the Population:** |
| 170 | + |
| 171 | + * In your terminal (while the app is still running), run the `check_population.py` script: |
| 172 | + ```bash |
| 173 | + python check_population.py |
| 174 | + ``` |
| 175 | + * Note the total number of HIGH risk patients (e.g., `HIGH RISK: 212`). |
| 176 | + |
| 177 | +3. **Add More Data:** |
| 178 | + |
| 179 | + * Go back to the browser and click the **"Add 500 New Patients"** button. |
| 180 | + * After it reloads, click it **again**. You have now added 1,000 new patients to the database. |
| 181 | + * **Test Patient 10 again:** Enter Patient ID **10**. Their risk score will be **identical (28.5% - MEDIUM RISK)**. |
| 182 | + * **Why?** Because their personal data hasn't changed, and the *model is still the same old one*. |
| 183 | +
|
| 184 | +4. **Retrain the Model:** |
| 185 | +
|
| 186 | + * Now, click the **"Retrain Risk Model"** button. |
| 187 | + * The app will now "learn" from the *entire* database, including the 1,000 new patients you added. This new data will slightly change the model's understanding of how features (like BMI) correlate with risk. |
| 188 | + |
| 189 | +5. **See the Change:** |
| 190 | + |
| 191 | + * **Test Patient 10 one last time:** Enter Patient ID **10**. |
| 192 | + * You will see that their risk score has changed\! (e.g., it might now be **30.1% - MEDIUM RISK**). |
| 193 | + * **Why?** Their personal data is the same, but the **model's "brain" has been updated**. It has a more refined understanding of risk, so its prediction for the *same patient* is now different. |
| 194 | +
|
| 195 | +6. **Check the Population Again:** |
| 196 | +
|
| 197 | + * Run `python check_population.py` one more time. |
| 198 | + * You will see that the total number of LOW/MEDIUM/HIGH risk patients has changed, reflecting the new model's predictions across the entire population. |
0 commit comments