Skip to content
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
6a5e645
enabled extension on DataFrames + OMOP CDM load
kosuri-indu Jun 3, 2025
414e423
added omop stub and updated HealthBase accordingly
kosuri-indu Jun 3, 2025
654d9b4
updated omop cdm extension
kosuri-indu Jun 4, 2025
4312f9d
added sketch notes on the interface
kosuri-indu Jun 5, 2025
4bd0f7e
Merge branch 'master' into health-table
kosuri-indu Jun 10, 2025
6a1c61a
refactored and updated in docs
kosuri-indu Jun 10, 2025
7c83da5
added sample tests
kosuri-indu Jun 11, 2025
431344a
Merge branch 'master' into health-table
kosuri-indu Jun 11, 2025
72374b7
added struct healthtable in comments
kosuri-indu Jun 11, 2025
3b24436
Merge branch 'health-table' of https://github.com/kosuri-indu/HealthB…
kosuri-indu Jun 11, 2025
6e1bfce
Merge branch 'master' into health-table
kosuri-indu Jun 20, 2025
fc8e0b7
addons and working test code
kosuri-indu Jun 22, 2025
8111e25
made it Tables.jl compatible
kosuri-indu Jun 22, 2025
688d47c
resolved review changes and added preprocessing utilities
kosuri-indu Jun 25, 2025
fae3026
added other preprocessing functions
kosuri-indu Jun 27, 2025
452ebb1
added new strategies and tests
kosuri-indu Jun 27, 2025
a35ff7a
Merge branch 'master' into health-table
kosuri-indu Jun 30, 2025
9884f6a
updated verification metadata and onehotencoding function
kosuri-indu Jul 1, 2025
5983bfd
updated map_concepts function
kosuri-indu Jul 1, 2025
876bdc9
update functions and review changes
kosuri-indu Jul 5, 2025
572e190
updated docs
kosuri-indu Jul 6, 2025
7c33d3b
updated all docs
kosuri-indu Jul 6, 2025
edec226
updated all test and docs
kosuri-indu Jul 7, 2025
094f126
final changes
kosuri-indu Jul 11, 2025
7c69a8b
updated docs, removed unnecessary dependency
kosuri-indu Jul 11, 2025
5f56640
julia-actions errors fix
kosuri-indu Jul 11, 2025
6e96078
removed stats from runtests
kosuri-indu Jul 11, 2025
dd1a2db
updated tests for code coverage
kosuri-indu Aug 23, 2025
c22c3db
updated omopcdm ext tests for code coverage
kosuri-indu Aug 23, 2025
142ec04
removed redundant tests
kosuri-indu Aug 24, 2025
6a83ab4
updated ext edge case tests for coverage
kosuri-indu Aug 24, 2025
beac389
validation tests to cover remaining lines
kosuri-indu Aug 24, 2025
db2caec
Re-run CI
kosuri-indu Aug 24, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,13 @@ authors = ["Dilum Aluthge", "contributors"]
version = "2.0.0"

[weakdeps]
DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
DrWatson = "634d3b9d-ee7a-5ddf-bec9-22491ea816e1"
OMOPCommonDataModel = "ba65db9e-6590-4054-ab8a-101ed9124986"

[extensions]
HealthBaseDrWatsonExt = "DrWatson"
HealthBaseOMOPCDMExt = ["DataFrames", "OMOPCommonDataModel"]

[compat]
julia = "1.10"
Expand Down
55 changes: 55 additions & 0 deletions docs/src/HealthTable.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# Tables.jl Interface for OMOP Common Data Model in HealthBase.jl

## Core Goals & Features

The proposed interface aims to provide:

- Schema-Aware Access: Standardized and schema-aware access to core OMOP CDM tables (example: `PERSON`, `CONDITION_OCCURRENCE`, `DRUG_EXPOSURE`, `OBSERVATION_PERIOD` etc). Schema awareness will be derived from `OMOPCommonDataModel.jl`.
- Preprocessing Utilities: Built-in or easily integrable support for common preprocessing tasks, including:
- One-hot encoding.
- Normalization.
- Handling missing values.
- Vocabulary compression for high-cardinality categorical variables.
- Concept Mapping: Utilities to aggregate or map related medical codes (example: grouping SNOMED conditions).
- JuliaHealth Integration: Seamless interoperability with existing and future JuliaHealth tools, such as:
- `OMOPCDMCohortCreator.jl`
- `MLJ.jl` (for machine learning pipelines)
- `OHDSICohortExpressions.jl`
- Foundation for Interoperability: Serve as a foundational layer for broader interoperability across the JuliaHealth ecosystem, supporting researchers working with OMOP CDM-styled data.

## Proposed `Tables.jl` Interface Sketch

Before data is wrapped by the `Tables.jl` interface described below, it's generally expected to undergo initial validation and preparation. This is typically handled by the `HealthBase.HealthTable` function (itself an extension within `HealthBase.jl` that uses `OMOPCommonDataModel.jl`). `HealthTable` takes a source (like a `DataFrame`), validates its structure and column types against the specific OMOP CDM table schema, attaches relevant metadata, and returns a conformed `DataFrame`.

The `OMOPCDMTable` wrappers discussed next would then ideally consume this validated `DataFrame` (the output of `HealthTable`) to provide a standardized, schema-aware `Tables.jl` view for further operations and interoperability.

The core idea is to define wrapper types around OMOP CDM data sources. Initially, we can focus on in-memory `DataFrame`s, but the design should be extensible to database connections or other `Tables.jl`-compatible sources. These wrapper types will implement the `Tables.jl` interface.

## The `HealthTable` Struct

The core of the interface is the `HealthTable` struct.

```julia
@kwdef struct HealthTable <: Tables.AbstractTable
source::DataFrame
omopcdm_version::String
function HealthTable(source)
# code goes here
return new(source, omopcdm_version)
end
end
```

## `Tables.jl` API Implementation

The `OMOPCDMTable` wrapper types will implement key `Tables.jl` methods:

- `Tables.istable`
- `Tables.rowaccess`
- `Tables.rows`
- `Tables.columnaccess`
- `Tables.columns`
- `Tables.schema`
- `Tables.materializer`

Source: https://tables.juliadata.org/stable/implementing-the-interface/
79 changes: 79 additions & 0 deletions docs/src/OMOPCDMWorkflow.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# OMOP CDM Workflow with HealthTable

## Typical Workflow

The envisioned process for working with OMOP CDM data using these `HealthBase.jl` components typically follows these steps:

1. **Data Loading**:
Raw data is loaded into a suitable tabular structure, most commonly a `DataFrame`.

2. **Validation and Conformance with `HealthTable`:**
The raw `DataFrame` is then processed by the `HealthBase.HealthTable` function. This function takes the `DataFrame` and an OMOP CDM version string (example: "5.4") as arguments, validating its structure and column types against the general OMOP CDM schema for that version.
* It checks if the data types in those columns are compatible with the official OMOP CDM types (as defined in `OMOPCommonDataModel.jl`).
* It can warn about discrepancies or, if `disable_type_enforcement=false`, potentially error or attempt safe conversions.
* Crucially, it attaches metadata to the columns, indicating their official OMOP CDM types.
* The output is a `DataFrame` that is now validated and conformed to the specified OMOP CDM table structure.

3. **Wrapping with `OMOPCDMTable`:**
The validated and conformed `DataFrame` (output from `HealthTable`) is then wrapped using the `OMOPCDMTable` to provide a schema-aware `Tables.jl` interface. This wrapper uses the same `OMOPCommonDataModel.jl` type to ensure consistency.

4. **Interacting via `Tables.jl`:**
Once wrapped, the `OMOPCDMTable` instance can be seamlessly used with any `Tables.jl`-compatible tools and standard `Tables.jl` functions

5. **Applying Preprocessing Utilities:**
Once the data is an `OMOPCDMTable`, common preprocessing steps essential for analysis or predictive modeling can be applied. These methods, built upon the `Tables.jl` interface, include:
* One-hot encoding.
* Handling of high-cardinality categorical variables.
* Concept mapping utilities to group related codes (example: SNOMED conditions).
* Normalization, missing value imputation, etc.
These utilities would typically return a new (or modified) `OMOPCDMTable` or a materialized `DataFrame`, ready for further use.


## Example Usage (Conceptual)

```julia
using HealthBase # (once the OMOP Tables interface is part of it)
using OMOPCommonDataModel
using DataFrames # an example source

# Assume 'condition_occurrence_df' is a DataFrame loaded from a CSV/database
condition_occurrence_df = DataFrame(
condition_occurrence_id = [1, 2, 3],
person_id = [101, 102, 101],
condition_concept_id = [201826, 433736, 317009],
condition_start_date = [Date(2010,1,1), Date(2012,5,10), Date(2011,3,15)]
# ... other fields
)

# Validate and wrap the DataFrame with HealthTable
ht_conditions = HealthTable(condition_occurrence_df; omop_cdm_version="5.4")


# 1. Schema Inspection
sch = Tables.schema(ht_conditions)
println("Schema Names: ", sch.names)
println("Schema Types: ", sch.types)
# This should output the names and types from the validated DataFrame

# 2. Iteration (Rows)
for row in Tables.rows(ht_conditions)
# 'row' is a Tables.Row, with fields matching the OMOP schema
println("Person ID: $(row.person_id), Condition: $(row.condition_concept_id)")
end

# 3. Integration with other packages (example: MLJ.jl)
# 4. Materialization
...
# and so on
```

## Preprocessing and Utilities Sketch

Preprocessing utilities can operate on `OMOPCDMTable` objects (or their materialized versions), leveraging the `Tables.jl` interface and schema awareness derived via `Tables.schema`. Examples include:

- `one_hot_encode(table::OMOPCDMTable, column_symbol::Symbol; drop_original=true)`
- `normalize_column(table::OMOPCDMTable, column_symbol::Symbol; method=:z_score)`
- `apply_vocabulary_compression(table::OMOPCDMTable, column_symbol::Symbol, mapping_dict::Dict)`
- `map_concepts(table::OMOPCDMTable, column_symbol::Symbol, concept_map::AbstractDict)`

These functions would align with the principle of optional, user triggered transformations, possibly controlled by keyword arguments.
58 changes: 58 additions & 0 deletions ext/HealthBaseOMOPCDMExt.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
module HealthBaseOMOPCDMExt

using HealthBase
using DataFrames
using OMOPCommonDataModel

function __init__()
@info "OMOP CDM extension for HealthBase has been loaded!"
end

"""
HealthTable(df::DataFrame; omop_cdm_version="5.4", disable_type_enforcement=false)
Constructs a `HealthTable` for an OMOP CDM dataset by validating the given `DataFrame`.
This constructor validates the `DataFrame` against the specified OMOP CDM version. It checks that
column names are valid OMOP CDM fields and that their data types are compatible. It then
attaches all available metadata from the OMOP CDM specification to the DataFrame's columns.
If `disable_type_enforcement` is true, type mismatches will emit warnings instead of errors.
Returns a `HealthTable` object wrapping the validated DataFrame.
"""
function HealthBase.HealthTable(df::DataFrame; omop_cdm_version::String="5.4", disable_type_enforcement=false)
if !haskey(OMOPCommonDataModel.OMOPCDM_VERSIONS, omop_cdm_version)
throw(ArgumentError("OMOP CDM version '$(omop_cdm_version)' is not supported. Available versions: $(keys(OMOPCommonDataModel.OMOPCDM_VERSIONS))"))
end

omop_fields = OMOPCommonDataModel.OMOPCDM_VERSIONS[omop_cdm_version][:fields]

for col in names(df)
col_symbol = Symbol(col)
if haskey(omop_fields, col_symbol)
fieldinfo = omop_fields[col_symbol]
expected_type = get(fieldinfo, :type, Any)
actual_type = eltype(df[!, col_symbol])

if !(actual_type <: expected_type)
msg = "Column '$(col)' has type $(actual_type), but expected a subtype of $(expected_type)"
if disable_type_enforcement
@warn msg
else
throw(ArgumentError(msg))
end
end

for (key, val) in fieldinfo
if !isnothing(val)
colmetadata!(df, col_symbol, String(key), string(val); style=:note)
end
end
end
end

return HealthBase.HealthTable(source=df, omopcdm_version=omop_cdm_version)
end

end
16 changes: 16 additions & 0 deletions src/HealthBase.jl
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,19 @@ module HealthBase

using Base: get_extension

# GIVING AN PRECOMPILING ERROR, if we do Tables in here
# Issue regarding the dependencies i believe

# using DataFrames
# using Tables
# using Base: @kwdef
using Base.Experimental: register_error_hint

# @kwdef struct HealthTable <: Tables.AbstractTable
# source::DataFrame
# omopcdm_version::String
# end

include("exceptions.jl")

function __init__()
Expand All @@ -12,6 +23,10 @@ function __init__()
if isnothing(get_extension(HealthBase, :HealthBaseDrWatsonExt))
_extension_message("DrWatson", cohortsdir, io)
end
elseif exc.f == HealthTable
if isnothing(get_extension(HealthBase, :HealthBaseOMOPCDMExt))
_extension_message("OMOPCommonDataModel and DataFrames", HealthTable, io)
end
elseif exc.f == initialize_study
if isnothing(get_extension(HealthBase, :HealthBaseDrWatsonExt))
_extension_message("DrWatson", initialize_study, io)
Expand All @@ -25,5 +40,6 @@ function __init__()
end

include("drwatson_stub.jl")
include("omopcdm_stub.jl")

end
3 changes: 3 additions & 0 deletions src/omopcdm_stub.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
struct HealthTable end

export HealthTable
3 changes: 3 additions & 0 deletions test/Project.toml
Original file line number Diff line number Diff line change
@@ -1,4 +1,7 @@
[deps]
DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
DrWatson = "634d3b9d-ee7a-5ddf-bec9-22491ea816e1"
OMOPCommonDataModel = "ba65db9e-6590-4054-ab8a-101ed9124986"
Pkg = "44cfe95a-1eb2-52ea-b672-e2afdf69b78f"
Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"
Dates = "ade2ca70-3891-5945-98fb-dc099432e06a"
17 changes: 17 additions & 0 deletions test/omopcdmext.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
@testset "Simple HealthTable OMOP CDM Extension Test" begin
person_df = DataFrame(
person_id=1,
gender_concept_id=8507,
year_of_birth=1990,
month_of_birth=1,
day_of_birth=1,
birth_datetime=DateTime(1990, 1, 1),
race_concept_id=0,
ethnicity_concept_id=0
)

ht = HealthTable(person_df; omop_cdm_version="5.4")

@test ht isa HealthBase.HealthTable
@test ht.omopcdm_version == "5.4"
end
8 changes: 7 additions & 1 deletion test/runtests.jl
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
using DrWatson
using HealthBase
using Pkg
using Test
using DataFrames
using OMOPCommonDataModel
using Dates

@testset "Exceptions" begin
include("exceptions.jl")
Expand All @@ -10,3 +12,7 @@ end
@testset "HealthBaseDrWatsonExt" begin
include("drwatsonext.jl")
end

@testset "HealthBaseOMOPCDMExt" begin
include("omopcdmext.jl")
end
Loading