-
Notifications
You must be signed in to change notification settings - Fork 2
[Add] Initial HealthTable setup and support #30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 4 commits
Commits
Show all changes
33 commits
Select commit
Hold shift + click to select a range
6a5e645
enabled extension on DataFrames + OMOP CDM load
kosuri-indu 414e423
added omop stub and updated HealthBase accordingly
kosuri-indu 654d9b4
updated omop cdm extension
kosuri-indu 4312f9d
added sketch notes on the interface
kosuri-indu 4bd0f7e
Merge branch 'master' into health-table
kosuri-indu 6a1c61a
refactored and updated in docs
kosuri-indu 7c83da5
added sample tests
kosuri-indu 431344a
Merge branch 'master' into health-table
kosuri-indu 72374b7
added struct healthtable in comments
kosuri-indu 3b24436
Merge branch 'health-table' of https://github.com/kosuri-indu/HealthB…
kosuri-indu 6e1bfce
Merge branch 'master' into health-table
kosuri-indu fc8e0b7
addons and working test code
kosuri-indu 8111e25
made it Tables.jl compatible
kosuri-indu 688d47c
resolved review changes and added preprocessing utilities
kosuri-indu fae3026
added other preprocessing functions
kosuri-indu 452ebb1
added new strategies and tests
kosuri-indu a35ff7a
Merge branch 'master' into health-table
kosuri-indu 9884f6a
updated verification metadata and onehotencoding function
kosuri-indu 5983bfd
updated map_concepts function
kosuri-indu 876bdc9
update functions and review changes
kosuri-indu 572e190
updated docs
kosuri-indu 7c33d3b
updated all docs
kosuri-indu edec226
updated all test and docs
kosuri-indu 094f126
final changes
kosuri-indu 7c69a8b
updated docs, removed unnecessary dependency
kosuri-indu 5f56640
julia-actions errors fix
kosuri-indu 6e96078
removed stats from runtests
kosuri-indu dd1a2db
updated tests for code coverage
kosuri-indu c22c3db
updated omopcdm ext tests for code coverage
kosuri-indu 142ec04
removed redundant tests
kosuri-indu 6a83ab4
updated ext edge case tests for coverage
kosuri-indu beac389
validation tests to cover remaining lines
kosuri-indu db2caec
Re-run CI
kosuri-indu File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,148 @@ | ||
| # Sketch: Tables.jl Interface for OMOP Common Data Model in HealthBase.jl | ||
|
|
||
| ## Core Goals & Features | ||
|
|
||
| The proposed interface aims to provide: | ||
|
|
||
| - Schema-Aware Access: Standardized and schema-aware access to core OMOP CDM tables (example: `PERSON`, `CONDITION_OCCURRENCE`, `DRUG_EXPOSURE`, `OBSERVATION_PERIOD` etc). Schema awareness will be derived from `OMOPCommonDataModel.jl`. | ||
| - Preprocessing Utilities: Built-in or easily integrable support for common preprocessing tasks, including: | ||
| - One-hot encoding. | ||
| - Normalization. | ||
| - Handling missing values. | ||
| - Vocabulary compression for high-cardinality categorical variables. | ||
| - Concept Mapping: Utilities to aggregate or map related medical codes (example: grouping SNOMED conditions). | ||
| - JuliaHealth Integration: Seamless interoperability with existing and future JuliaHealth tools, such as: | ||
| - `OMOPCDMCohortCreator.jl` | ||
| - `MLJ.jl` (for machine learning pipelines) | ||
| - `OHDSICohortExpressions.jl` | ||
| - Foundation for Interoperability: Serve as a foundational layer for broader interoperability across the JuliaHealth ecosystem, supporting researchers working with OMOP CDM-styled data. | ||
|
|
||
| ## Proposed `Tables.jl` Interface Sketch | ||
|
|
||
| Before data is wrapped by the `Tables.jl` interface described below, it's generally expected to undergo initial validation and preparation. This is typically handled by the `HealthBase.HealthTable` function (itself an extension within `HealthBase.jl` that uses `OMOPCommonDataModel.jl`). `HealthTable` takes a source (like a `DataFrame`), validates its structure and column types against the specific OMOP CDM table schema, attaches relevant metadata, and returns a conformed `DataFrame`. | ||
|
|
||
| The `OMOPCDMTable` wrappers discussed next would then ideally consume this validated `DataFrame` (the output of `HealthTable`) to provide a standardized, schema-aware `Tables.jl` view for further operations and interoperability. | ||
|
|
||
| The core idea is to define wrapper types around OMOP CDM data sources. Initially, we can focus on in-memory `DataFrame`s, but the design should be extensible to database connections or other `Tables.jl`-compatible sources. These wrapper types will implement the `Tables.jl` interface. | ||
|
|
||
| ## Typical Workflow | ||
|
|
||
| The envisioned process for working with OMOP CDM data using these `HealthBase.jl` components typically follows these steps: | ||
|
|
||
| 1. **Data Loading**: | ||
| Raw data is loaded into a suitable tabular structure, most commonly a `DataFrame`. | ||
|
|
||
| 2. **Validation and Conformance with `HealthTable`:** | ||
| The raw `DataFrame` is then processed by the `HealthBase.HealthTable` function. This function takes the `DataFrame` and an OMOP CDM version string (example: "5.4") as arguments, validating its structure and column types against the general OMOP CDM schema for that version. | ||
| * It checks if the data types in those columns are compatible with the official OMOP CDM types (as defined in `OMOPCommonDataModel.jl`). | ||
| * It can warn about discrepancies or, if `disable_type_enforcement=false`, potentially error or attempt safe conversions. | ||
| * Crucially, it attaches metadata to the columns, indicating their official OMOP CDM types. | ||
| * The output is a `DataFrame` that is now validated and conformed to the specified OMOP CDM table structure. | ||
|
|
||
| 3. **Wrapping with `OMOPCDMTable`:** | ||
| The validated and conformed `DataFrame` (output from `HealthTable`) is then wrapped using the `OMOPCDMTable` to provide a schema-aware `Tables.jl` interface. This wrapper uses the same `OMOPCommonDataModel.jl` type to ensure consistency. | ||
|
|
||
| 4. **Interacting via `Tables.jl`:** | ||
| Once wrapped, the `OMOPCDMTable` instance can be seamlessly used with any `Tables.jl`-compatible tools and standard `Tables.jl` functions | ||
|
|
||
| 5. **Applying Preprocessing Utilities:** | ||
| Once the data is an `OMOPCDMTable`, common preprocessing steps essential for analysis or predictive modeling can be applied. These methods, built upon the `Tables.jl` interface, include: | ||
| * One-hot encoding. | ||
| * Handling of high-cardinality categorical variables. | ||
| * Concept mapping utilities to group related codes (example: SNOMED conditions). | ||
| * Normalization, missing value imputation, etc. | ||
| These utilities would typically return a new (or modified) `OMOPCDMTable` or a materialized `DataFrame`, ready for further use. | ||
|
|
||
|
|
||
| ## OMOP CDM Table Wrapper Types | ||
|
|
||
| We could define a generic wrapper or specific types for each OMOP CDM table: | ||
|
|
||
| ```julia | ||
| # A possible generic wrapper: | ||
| # T_CDM is the ::Type object from OMOPCommonDataModel, example: OMOPCommonDataModel.Person | ||
| # S is the type of the data source, example: a DataFrame | ||
| struct OMOPCDMTable{T_CDM <: OMOPCommonDataModel.CDMType, S} <: Tables.AbstractTable | ||
kosuri-indu marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| source::S | ||
| end | ||
|
|
||
| # Example of how it might be constructed: | ||
| # person_df = DataFrame(...) # Data loaded into a DataFrame | ||
| # omop_person_table = OMOPCDMTable{OMOPCommonDataModel.Person}(person_df) | ||
|
|
||
| # Alternatively, specific types might be more discoverable for users: | ||
| struct PersonTable{S} <: Tables.AbstractTable | ||
| source::S | ||
| end | ||
| # Constructor: PersonTable(source_df) | ||
| # Internally, PersonTable would know it corresponds to OMOPCommonDataModel.Person. | ||
|
|
||
| # Similar structs could be defined for ConditionOccurrenceTable, DrugExposureTable, etc. | ||
| ``` | ||
|
|
||
| ## `Tables.jl` API Implementation | ||
|
|
||
| The `OMOPCDMTable` wrapper types will implement key `Tables.jl` methods: | ||
|
|
||
| - `Tables.istable` | ||
| - `Tables.rowaccess` | ||
| - `Tables.rows` | ||
| - `Tables.columnaccess` | ||
| - `Tables.columns` | ||
| - `Tables.schema` | ||
| - `Tables.materializer` | ||
|
|
||
| Source: https://tables.juliadata.org/stable/implementing-the-interface/ | ||
|
|
||
| ## Example Usage (Conceptual) | ||
|
|
||
| ```julia | ||
| using HealthBase # (once the OMOP Tables interface is part of it) | ||
| using OMOPCommonDataModel | ||
| using DataFrames # an example source | ||
|
|
||
| # Assume 'condition_occurrence_df' is a DataFrame loaded from a CSV/database | ||
| condition_occurrence_df = DataFrame( | ||
| condition_occurrence_id = [1, 2, 3], | ||
| person_id = [101, 102, 101], | ||
| condition_concept_id = [201826, 433736, 317009], | ||
| condition_start_date = [Date(2010,1,1), Date(2012,5,10), Date(2011,3,15)] | ||
| # ... other fields | ||
| ) | ||
|
|
||
| # Wrap it with the schema-aware OMOPCDMTable | ||
| # Here, OMOPCommonDataModel.ConditionOccurrence is the specific OMOP CDM type | ||
| omop_conditions = OMOPCDMTable{OMOPCommonDataModel.ConditionOccurrence}(condition_occurrence_df) | ||
| # Or, if using specific types: | ||
| # omop_conditions = ConditionOccurrenceTable(condition_occurrence_df) | ||
kosuri-indu marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
|
|
||
| # 1. Schema Inspection | ||
| sch = Tables.schema(omop_conditions) | ||
| println("Schema Names: ", sch.names) | ||
| println("Schema Types: ", sch.types) | ||
| # This should output names and types corresponding to OMOPCommonDataModel.ConditionOccurrence | ||
|
|
||
| # 2. Iteration (Rows) | ||
| for row in Tables.rows(omop_conditions) | ||
| # 'row' would be a NamedTuple or similar, with fields matching the OMOP schema | ||
| println("Person ID: $(row.person_id), Condition: $(row.condition_concept_id)") | ||
| end | ||
|
|
||
| # 3. Integration with other packages (example: MLJ.jl) | ||
| # 4. Materialization | ||
| ... | ||
| # and so on | ||
| ``` | ||
kosuri-indu marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ## Preprocessing and Utilities Sketch | ||
|
|
||
| Preprocessing utilities can operate on `OMOPCDMTable` objects (or their materialized versions), leveraging the `Tables.jl` interface and schema awareness derived via `Tables.schema`. Examples include: | ||
|
|
||
| - `one_hot_encode(table::OMOPCDMTable, column_symbol::Symbol; drop_original=true)` | ||
| - `normalize_column(table::OMOPCDMTable, column_symbol::Symbol; method=:z_score)` | ||
| - `apply_vocabulary_compression(table::OMOPCDMTable, column_symbol::Symbol, mapping_dict::Dict)` | ||
| - `map_concepts(table::OMOPCDMTable, column_symbol::Symbol, concept_map::AbstractDict)` | ||
|
|
||
| These functions would align with the principle of optional, user triggered transformations, possibly controlled by keyword arguments. | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
kosuri-indu marked this conversation as resolved.
Show resolved
Hide resolved
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,51 @@ | ||
| module HealthBaseOMOPExt | ||
|
|
||
| using HealthBase | ||
| using DataFrames | ||
| using OMOPCommonDataModel | ||
|
|
||
| __init__() = @info "OMOP CDM extension for HealthBase has been loaded!" | ||
|
|
||
| """ | ||
| HealthTable(df::DataFrame, omop_cdm_version="5.4"; disable_type_enforcement=false) | ||
|
|
||
| Validate a DataFrame against the OMOP CDM specification for the given version. | ||
|
|
||
| Checks column names/types, attaches OMOP metadata to columns, and returns the DataFrame. | ||
|
|
||
| If `disable_type_enforcement` is true, type mismatches emit warnings instead of errors. | ||
| """ | ||
| function HealthBase.HealthTable(df::DataFrame, omop_cdm_version="5.4"; disable_type_enforcement=false) | ||
| # TODO: have to add logic for version specific fields types | ||
| omop_fields = Dict{String, Dict{Symbol, Any}}() | ||
|
|
||
| for t in subtypes(OMOPCommonDataModel.CDMType) | ||
| for f in fieldnames(t) | ||
| actual_field_type = fieldtype(t, f) | ||
| omop_fields[string(f)] = Dict(:type => actual_field_type) | ||
| end | ||
| end | ||
|
|
||
| for col in names(df) | ||
| if haskey(omop_fields, col) | ||
| fieldinfo = omop_fields[col] | ||
| expected_type = get(fieldinfo, :type, Any) | ||
| actual_type = eltype(df[!, col]) | ||
|
|
||
| if !(actual_type <: expected_type) | ||
| msg = "Column '$(col)' has type $(actual_type), expected $(expected_type)" | ||
| if disable_type_enforcement | ||
| @warn msg | ||
| else | ||
| throw(ArgumentError(msg)) | ||
| end | ||
| end | ||
|
|
||
| for (key, val) in fieldinfo | ||
| colmetadata!(df, col, string(key), string(val), style=:note) | ||
| end | ||
| end | ||
| end | ||
|
|
||
| return df | ||
| end |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
kosuri-indu marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,3 @@ | ||
| function HealthTable end | ||
kosuri-indu marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| export HealthTable | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.