-
Notifications
You must be signed in to change notification settings - Fork 2
[Add] Initial HealthTable setup and support #30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 14 commits
Commits
Show all changes
33 commits
Select commit
Hold shift + click to select a range
6a5e645
enabled extension on DataFrames + OMOP CDM load
kosuri-indu 414e423
added omop stub and updated HealthBase accordingly
kosuri-indu 654d9b4
updated omop cdm extension
kosuri-indu 4312f9d
added sketch notes on the interface
kosuri-indu 4bd0f7e
Merge branch 'master' into health-table
kosuri-indu 6a1c61a
refactored and updated in docs
kosuri-indu 7c83da5
added sample tests
kosuri-indu 431344a
Merge branch 'master' into health-table
kosuri-indu 72374b7
added struct healthtable in comments
kosuri-indu 3b24436
Merge branch 'health-table' of https://github.com/kosuri-indu/HealthB…
kosuri-indu 6e1bfce
Merge branch 'master' into health-table
kosuri-indu fc8e0b7
addons and working test code
kosuri-indu 8111e25
made it Tables.jl compatible
kosuri-indu 688d47c
resolved review changes and added preprocessing utilities
kosuri-indu fae3026
added other preprocessing functions
kosuri-indu 452ebb1
added new strategies and tests
kosuri-indu a35ff7a
Merge branch 'master' into health-table
kosuri-indu 9884f6a
updated verification metadata and onehotencoding function
kosuri-indu 5983bfd
updated map_concepts function
kosuri-indu 876bdc9
update functions and review changes
kosuri-indu 572e190
updated docs
kosuri-indu 7c33d3b
updated all docs
kosuri-indu edec226
updated all test and docs
kosuri-indu 094f126
final changes
kosuri-indu 7c69a8b
updated docs, removed unnecessary dependency
kosuri-indu 5f56640
julia-actions errors fix
kosuri-indu 6e96078
removed stats from runtests
kosuri-indu dd1a2db
updated tests for code coverage
kosuri-indu c22c3db
updated omopcdm ext tests for code coverage
kosuri-indu 142ec04
removed redundant tests
kosuri-indu 6a83ab4
updated ext edge case tests for coverage
kosuri-indu beac389
validation tests to cover remaining lines
kosuri-indu db2caec
Re-run CI
kosuri-indu File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file not shown.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,30 @@ | ||
| # HealthTable: Tables.jl Interface (General) | ||
|
|
||
| ## The `HealthTable` Struct | ||
|
|
||
| The core of the interface is the `HealthTable` struct. | ||
|
|
||
| ```julia | ||
| @kwdef struct HealthTable <: Tables.AbstractTable | ||
| source::DataFrame | ||
| omopcdm_version::String | ||
| function HealthTable(source) | ||
| # code goes here | ||
| return new(source, omopcdm_version) | ||
| end | ||
| end | ||
| ``` | ||
|
|
||
| ## `Tables.jl` API Implementation | ||
kosuri-indu marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| The `HealthTable` wrapper types will implement key `Tables.jl` methods: | ||
|
|
||
| - `Tables.istable` | ||
| - `Tables.rowaccess` | ||
| - `Tables.rows` | ||
| - `Tables.columnaccess` | ||
| - `Tables.columns` | ||
| - `Tables.schema` | ||
| - `Tables.materializer` | ||
|
|
||
| Source: https://tables.juliadata.org/stable/implementing-the-interface/ | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,27 @@ | ||
| # OMOP CDM Support for HealthTable | ||
|
|
||
| ## Core Goals & Features | ||
|
|
||
| The proposed interface aims to provide: | ||
|
|
||
| - Schema-Aware Validation: Instead of just wrapping your data, `HealthBase.jl` actively validates it against the official OMOP Common Data Model specification. "Schema-aware" means it understands the expected structure for a given OMOP CDM version (e.g., "v5.4.0") by using `OMOPCommonDataModel.jl`. This includes: | ||
| - Column Type Enforcement: It checks that the data types in your table (e.g., a `DataFrame`) match the official types. For example, it ensures that a `person_id` is an integer and a `condition_start_date` is a `Date`. | ||
| - Error Reporting: If there are mismatches, it provides clear, actionable error messages listing all columns that do not conform to the schema, helping you fix your data quickly. | ||
| - Preprocessing Utilities: Built-in or easily integrable support for common preprocessing tasks, including: | ||
| - One-hot encoding. | ||
| - Normalization. | ||
| - Handling missing values. | ||
| - Vocabulary compression for high-cardinality categorical variables. | ||
| - JuliaHealth Integration: Seamless interoperability with existing and future JuliaHealth tools, such as: | ||
| - `OMOPCDMCohortCreator.jl` | ||
| - `MLJ.jl` (for machine learning pipelines) | ||
| - `OHDSICohortExpressions.jl` | ||
| - Foundation for Interoperability: Serve as a foundational layer for broader interoperability across the JuliaHealth ecosystem, supporting researchers working with OMOP CDM-styled data. | ||
|
|
||
| ## Proposed `Tables.jl` Interface Sketch | ||
|
|
||
| Before data is wrapped by the `Tables.jl` interface described below, it's generally expected to undergo initial validation and preparation. This is typically handled by the `HealthBase.HealthTable` function (itself an extension within `HealthBase.jl` that uses `OMOPCommonDataModel.jl`). `HealthTable` takes a source (like a `DataFrame`), validates its structure and column types against the specific OMOP CDM table schema, attaches relevant metadata, and returns a conformed `DataFrame`. | ||
|
|
||
| The `HealthTable` wrappers discussed next would then ideally consume this validated `DataFrame` (the output of `HealthTable`) to provide a standardized, schema-aware `Tables.jl` view for further operations and interoperability. | ||
|
|
||
| The core idea is to define wrapper types around OMOP CDM data sources. Initially, we can focus on in-memory `DataFrame`s, but the design should be extensible to database connections or other `Tables.jl`-compatible sources. These wrapper types will implement the `Tables.jl` interface. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,79 @@ | ||
| # OMOP CDM Workflow with HealthTable | ||
kosuri-indu marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| ## Typical Workflow | ||
kosuri-indu marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| The envisioned process for working with OMOP CDM data using these `HealthBase.jl` components typically follows these steps: | ||
|
|
||
| 1. **Data Loading**: | ||
| Raw data is loaded into a suitable tabular structure, most commonly a `DataFrame`. | ||
|
|
||
| 2. **Validation and Conformance with `HealthTable`:** | ||
| The raw `DataFrame` is then processed by the `HealthBase.HealthTable` function. This function takes the `DataFrame` and an OMOP CDM version string (example: "5.4") as arguments, validating its structure and column types against the general OMOP CDM schema for that version. | ||
| * It checks if the data types in those columns are compatible with the official OMOP CDM types (as defined in `OMOPCommonDataModel.jl`). | ||
| * It can warn about discrepancies or, if `disable_type_enforcement=false`, potentially error or attempt safe conversions. | ||
| * Crucially, it attaches metadata to the columns, indicating their official OMOP CDM types. | ||
| * The output is a `DataFrame` that is now validated and conformed to the specified OMOP CDM table structure. | ||
|
|
||
| 3. **Wrapping with `HealthTable`:** | ||
| The validated and conformed `DataFrame` (output from `HealthTable`) is then wrapped using the `HealthTable` to provide a schema-aware `Tables.jl` interface. This wrapper uses the same `OMOPCommonDataModel.jl` type to ensure consistency. | ||
|
|
||
| 4. **Interacting via `Tables.jl`:** | ||
| Once wrapped, the `HealthTable` instance can be seamlessly used with any `Tables.jl`-compatible tools and standard `Tables.jl` functions | ||
|
|
||
| 5. **Applying Preprocessing Utilities:** | ||
| Once the data is an `HealthTable`, common preprocessing steps essential for analysis or predictive modeling can be applied. These methods, built upon the `Tables.jl` interface, include: | ||
| * One-hot encoding. | ||
| * Handling of high-cardinality categorical variables. | ||
| * Concept mapping utilities to group related codes (example: SNOMED conditions). | ||
| * Normalization, missing value imputation, etc. | ||
| These utilities would typically return a new (or modified) `HealthTable` or a materialized `DataFrame`, ready for further use. | ||
|
|
||
|
|
||
| ## Example Usage (Conceptual) | ||
|
|
||
| ```julia | ||
| using HealthBase # (once the OMOP Tables interface is part of it) | ||
| using OMOPCommonDataModel | ||
| using DataFrames # an example source | ||
|
|
||
| # Assume 'condition_occurrence_df' is a DataFrame loaded from a CSV/database | ||
| condition_occurrence_df = DataFrame( | ||
| condition_occurrence_id = [1, 2, 3], | ||
| person_id = [101, 102, 101], | ||
| condition_concept_id = [201826, 433736, 317009], | ||
| condition_start_date = [Date(2010,1,1), Date(2012,5,10), Date(2011,3,15)] | ||
| # ... other fields | ||
| ) | ||
|
|
||
| # Validate and wrap the DataFrame with HealthTable | ||
| ht_conditions = HealthTable(condition_occurrence_df; omop_cdm_version="v5.4.0") | ||
|
|
||
|
|
||
| # 1. Schema Inspection | ||
| sch = Tables.schema(ht_conditions) | ||
| println("Schema Names: ", sch.names) | ||
| println("Schema Types: ", sch.types) | ||
| # This should output the names and types from the validated DataFrame | ||
|
|
||
| # 2. Iteration (Rows) | ||
| for row in Tables.rows(ht_conditions) | ||
| # 'row' is a Tables.Row, with fields matching the OMOP schema | ||
| println("Person ID: $(row.person_id), Condition: $(row.condition_concept_id)") | ||
| end | ||
|
|
||
| # 3. Integration with other packages (example: MLJ.jl) | ||
| # 4. Materialization | ||
| ... | ||
| # and so on | ||
| ``` | ||
|
|
||
| ## Preprocessing and Utilities Sketch | ||
|
|
||
| Preprocessing utilities can operate on `HealthTable` objects (or their materialized versions), leveraging the `Tables.jl` interface and schema awareness derived via `Tables.schema`. Examples include: | ||
|
|
||
| - `one_hot_encode(ht::HealthTable, column_symbol::Symbol; drop_original=true)` | ||
| - `normalize_column(ht::HealthTable, column_symbol::Symbol; method=:z_score)` | ||
| - `apply_vocabulary_compression(ht::HealthTable, column_symbol::Symbol, mapping_dict::Dict)` | ||
| - `map_concepts(ht::HealthTable, column_symbol::Symbol, concept_map::AbstractDict)` | ||
|
|
||
| These functions would align with the principle of optional, user triggered transformations, possibly controlled by keyword arguments. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,86 @@ | ||
| # Quickstart: Preprocessing OMOP Data | ||
|
|
||
| This guide demonstrates a practical, end-to-end workflow for cleaning and transforming raw patient data into a format suitable for machine learning using the `HealthBase.jl` preprocessing utilities. | ||
|
|
||
| ### 1. Load Packages | ||
|
|
||
| First, start a Julia session in your project environment and load the necessary packages. | ||
|
|
||
| NOTE: For the workflow to work, we need to load the trigger packages `DataFrames`, `OMOPCommonDataModel`, `InlineStrings`, `Serialization`, `Statistics`, and `Dates` before loading `HealthBase.jl`. See the "For Developers" section below for more information. | ||
|
|
||
| ```julia | ||
| # First, load the trigger packages | ||
| using DataFrames, OMOPCommonDataModel, InlineStrings, Serialization, Statistics, Dates | ||
|
|
||
| # Then, load HealthBase | ||
| using HealthBase | ||
| ``` | ||
|
|
||
| ### 2. Create a Sample Dataset | ||
|
|
||
| We'll start with a sample `DataFrame` that mimics raw data from a clinical database. It includes missing values, categorical data, and different data types. | ||
|
|
||
| ```julia | ||
| raw_df = DataFrame( | ||
| person_id = 101:108, | ||
| gender_concept_id = [8507, 8532, 8507, 8532, 8507, 8532, 8507, 8507], | ||
| year_of_birth = [1985, 1992, 1985, 1978, 2000, 2001, 1992, 1988], | ||
| race_concept_id = [8527, 8515, 8527, 8516, 8527, 8515, 8516, 8527], | ||
| cholesterol = [189, 210, 240, missing, 195, 220, missing, 205.0] | ||
| ) | ||
|
|
||
| ht = HealthTable(source=raw_df, omop_cdm_version="v5.4.1") | ||
| ``` | ||
|
|
||
| ### 3. Preprocessing Pipeline | ||
|
|
||
| Now, we'll apply a series of transformations to clean and prepare the data. | ||
|
|
||
| #### Step A: Impute Missing Values | ||
|
|
||
| First, we'll fill in the `missing` values for `cholesterol` using the mean of each column. | ||
|
|
||
| ```julia | ||
| ht_imputed = impute_missing(ht; cols=[:cholesterol], strategy=:mean) | ||
| ``` | ||
|
|
||
| #### Step B: One-Hot Encode Categorical Features | ||
|
|
||
| Next, we convert the categorical `gender_concept_id` and `race_concept_id` columns into numerical, binary columns. | ||
|
|
||
| ```julia | ||
| ht_onehot = one_hot_encode(ht_imputed; cols=[:gender_concept_id, :race_concept_id]) | ||
| ``` | ||
|
|
||
| #### Step C: Normalize Numerical Features | ||
|
|
||
| Finally, we scale the `cholesterol` column to have a mean of 0 and a standard deviation of 1. This helps many machine learning algorithms perform better. | ||
|
|
||
| ```julia | ||
| ht_final = normalize_column(ht_onehot; cols=[:cholesterol]) | ||
|
|
||
| println("--- Final model-ready data ---") | ||
| println(ht_final.source) | ||
| ``` | ||
|
|
||
| After these steps, `ht_final` contains a fully preprocessed, numerical dataset that is ready to be used for model training. | ||
|
|
||
| ### For Developers: Interactive Use in the REPL | ||
|
|
||
| When working with `HealthBase.jl` interactively in the Julia REPL, especially during development, it's important to load packages in the correct order to ensure that package extensions are activated. | ||
|
|
||
| If you try to call a function from an extension (like `impute_missing`) and get a `MethodError`, it's likely because the extension was not loaded. To fix this, make sure you load the "trigger packages" **before** you load `HealthBase`. | ||
|
|
||
| For the OMOP CDM extension, the trigger packages are `DataFrames`, `OMOPCommonDataModel`, `InlineStrings`, `Serialization`, `Statistics`, and `Dates`. | ||
|
|
||
| **Correct Loading Order:** | ||
| ```julia | ||
| # First, load the trigger packages | ||
| using DataFrames, OMOPCommonDataModel, InlineStrings, Serialization, Statistics, Dates | ||
|
|
||
| # Then, load HealthBase | ||
| using HealthBase | ||
|
|
||
| # Now, functions from the extension will be available | ||
| # ht_imputed = impute_missing(ht; cols=[:cholesterol], strategy=:mean) # This will now work | ||
| ``` |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.