JuliaHealth · TheCedarPrince · Aug 27, 2025 · Jun 3, 2025 · Jun 3, 2025 · Jun 4, 2025
diff --git a/Project.toml b/Project.toml
@@ -3,13 +3,25 @@ uuid = "94e1309d-ccf4-42de-905f-515f1d7b1cae"
 authors = ["Dilum Aluthge", "contributors"]
 version = "2.0.0"
 
+[deps]
+Tables = "bd369af6-aec1-5ad0-b16a-f7cc5008161c"
+
 [weakdeps]
+DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
+Dates = "ade2ca70-3891-5945-98fb-dc099432e06a"
 DrWatson = "634d3b9d-ee7a-5ddf-bec9-22491ea816e1"
+Serialization = "9e88b42a-f829-5b0c-bbe9-9e923198166b"
+Statistics = "10745b16-79ce-11e8-11f9-7d13ad32a3b2"
+InlineStrings = "842dd82b-1e85-43dc-bf29-5d0ee9dffc48"
+OMOPCommonDataModel = "ba65db9e-6590-4054-ab8a-101ed9124986"
 
 [extensions]
 HealthBaseDrWatsonExt = "DrWatson"
+HealthBaseOMOPCDMExt = ["DataFrames", "OMOPCommonDataModel", "InlineStrings", "Serialization", "Statistics", "Dates"]
 
 [compat]
+Dates = "1.11.0"
+Tables = "1.12.1"
 julia = "1.10"
 
 [extras]

diff --git a/assets/version_info b/assets/version_info
diff --git a/docs/src/HealthTableGeneral.md b/docs/src/HealthTableGeneral.md
@@ -0,0 +1,30 @@
+# HealthTable: Tables.jl Interface (General)
+
+## The `HealthTable` Struct
+
+The core of the interface is the `HealthTable` struct. 
+
+```julia
+@kwdef struct HealthTable <: Tables.AbstractTable
+    source::DataFrame
+    omopcdm_version::String
+    function HealthTable(source)
+        # code goes here
+        return new(source, omopcdm_version)
+    end
+end
+```
+
+## `Tables.jl` API Implementation
+
+The `HealthTable` wrapper types will implement key `Tables.jl` methods:
+
+- `Tables.istable`
+- `Tables.rowaccess`
+- `Tables.rows`
+- `Tables.columnaccess`
+- `Tables.columns`
+- `Tables.schema`
+- `Tables.materializer`
+
+Source: https://tables.juliadata.org/stable/implementing-the-interface/
diff --git a/docs/src/HealthTableOMOPCDM.md b/docs/src/HealthTableOMOPCDM.md
@@ -0,0 +1,27 @@
+# OMOP CDM Support for HealthTable
+
+## Core Goals & Features
+
+The proposed interface aims to provide:
+
+- Schema-Aware Validation: Instead of just wrapping your data, `HealthBase.jl` actively validates it against the official OMOP Common Data Model specification. "Schema-aware" means it understands the expected structure for a given OMOP CDM version (e.g., "v5.4.0") by using `OMOPCommonDataModel.jl`. This includes:
+    - Column Type Enforcement: It checks that the data types in your table (e.g., a `DataFrame`) match the official types. For example, it ensures that a `person_id` is an integer and a `condition_start_date` is a `Date`.
+    - Error Reporting: If there are mismatches, it provides clear, actionable error messages listing all columns that do not conform to the schema, helping you fix your data quickly.
+- Preprocessing Utilities: Built-in or easily integrable support for common preprocessing tasks, including:
+    - One-hot encoding.
+    - Normalization.
+    - Handling missing values.
+    - Vocabulary compression for high-cardinality categorical variables.
+- JuliaHealth Integration: Seamless interoperability with existing and future JuliaHealth tools, such as:
+    - `OMOPCDMCohortCreator.jl`
+    - `MLJ.jl` (for machine learning pipelines)
+    - `OHDSICohortExpressions.jl`
+- Foundation for Interoperability: Serve as a foundational layer for broader interoperability across the JuliaHealth ecosystem, supporting researchers working with OMOP CDM-styled data.
+
+## Proposed `Tables.jl` Interface Sketch
+
+Before data is wrapped by the `Tables.jl` interface described below, it's generally expected to undergo initial validation and preparation. This is typically handled by the `HealthBase.HealthTable` function (itself an extension within `HealthBase.jl` that uses `OMOPCommonDataModel.jl`). `HealthTable` takes a source (like a `DataFrame`), validates its structure and column types against the specific OMOP CDM table schema, attaches relevant metadata, and returns a conformed `DataFrame`.
+
+The `HealthTable` wrappers discussed next would then ideally consume this validated `DataFrame` (the output of `HealthTable`) to provide a standardized, schema-aware `Tables.jl` view for further operations and interoperability.
+
+The core idea is to define wrapper types around OMOP CDM data sources. Initially, we can focus on in-memory `DataFrame`s, but the design should be extensible to database connections or other `Tables.jl`-compatible sources. These wrapper types will implement the `Tables.jl` interface.
diff --git a/docs/src/OMOPCDMWorkflow.md b/docs/src/OMOPCDMWorkflow.md
@@ -0,0 +1,79 @@
+# OMOP CDM Workflow with HealthTable
+
+## Typical Workflow
+
+The envisioned process for working with OMOP CDM data using these `HealthBase.jl` components typically follows these steps:
+
+1.  **Data Loading**:
+    Raw data is loaded into a suitable tabular structure, most commonly a `DataFrame`.
+
+2.  **Validation and Conformance with `HealthTable`:**
+    The raw `DataFrame` is then processed by the `HealthBase.HealthTable` function. This function takes the `DataFrame` and an OMOP CDM version string (example: "5.4") as arguments, validating its structure and column types against the general OMOP CDM schema for that version.
+    *   It checks if the data types in those columns are compatible with the official OMOP CDM types (as defined in `OMOPCommonDataModel.jl`).
+    *   It can warn about discrepancies or, if `disable_type_enforcement=false`, potentially error or attempt safe conversions.
+    *   Crucially, it attaches metadata to the columns, indicating their official OMOP CDM types.
+    *   The output is a `DataFrame` that is now validated and conformed to the specified OMOP CDM table structure.
+
+3.  **Wrapping with `HealthTable`:**
+    The validated and conformed `DataFrame` (output from `HealthTable`) is then wrapped using the `HealthTable` to provide a schema-aware `Tables.jl` interface. This wrapper uses the same `OMOPCommonDataModel.jl` type to ensure consistency.
+
+4.  **Interacting via `Tables.jl`:**
+    Once wrapped, the `HealthTable` instance can be seamlessly used with any `Tables.jl`-compatible tools and standard `Tables.jl` functions
+
+5.  **Applying Preprocessing Utilities:**
+    Once the data is an `HealthTable`, common preprocessing steps essential for analysis or predictive modeling can be applied. These methods, built upon the `Tables.jl` interface, include:
+    *   One-hot encoding.
+    *   Handling of high-cardinality categorical variables.
+    *   Concept mapping utilities to group related codes (example: SNOMED conditions).
+    *   Normalization, missing value imputation, etc.
+    These utilities would typically return a new (or modified) `HealthTable` or a materialized `DataFrame`, ready for further use.
+
+
+## Example Usage (Conceptual)
+
+```julia
+using HealthBase # (once the OMOP Tables interface is part of it)
+using OMOPCommonDataModel
+using DataFrames # an example source
+
+# Assume 'condition_occurrence_df' is a DataFrame loaded from a CSV/database
+condition_occurrence_df = DataFrame(
+    condition_occurrence_id = [1, 2, 3],
+    person_id = [101, 102, 101],
+    condition_concept_id = [201826, 433736, 317009],
+    condition_start_date = [Date(2010,1,1), Date(2012,5,10), Date(2011,3,15)]
+    # ... other fields
+)
+
+# Validate and wrap the DataFrame with HealthTable
+ht_conditions = HealthTable(condition_occurrence_df; omop_cdm_version="v5.4.0")
+
+
+# 1. Schema Inspection
+sch = Tables.schema(ht_conditions)
+println("Schema Names: ", sch.names)
+println("Schema Types: ", sch.types)
+# This should output the names and types from the validated DataFrame
+
+# 2. Iteration (Rows)
+for row in Tables.rows(ht_conditions)
+    # 'row' is a Tables.Row, with fields matching the OMOP schema
+    println("Person ID: $(row.person_id), Condition: $(row.condition_concept_id)")
+end
+
+# 3. Integration with other packages (example: MLJ.jl)
+# 4. Materialization
+...
+# and so on
+```
+
+## Preprocessing and Utilities Sketch
+
+Preprocessing utilities can operate on `HealthTable` objects (or their materialized versions), leveraging the `Tables.jl` interface and schema awareness derived via `Tables.schema`. Examples include:
+
+- `one_hot_encode(ht::HealthTable, column_symbol::Symbol; drop_original=true)`
+- `normalize_column(ht::HealthTable, column_symbol::Symbol; method=:z_score)`
+- `apply_vocabulary_compression(ht::HealthTable, column_symbol::Symbol, mapping_dict::Dict)`
+- `map_concepts(ht::HealthTable, column_symbol::Symbol, concept_map::AbstractDict)`
+
+These functions would align with the principle of optional, user triggered transformations, possibly controlled by keyword arguments.
diff --git a/docs/src/quickstart.md b/docs/src/quickstart.md
@@ -0,0 +1,86 @@
+# Quickstart: Preprocessing OMOP Data
+
+This guide demonstrates a practical, end-to-end workflow for cleaning and transforming raw patient data into a format suitable for machine learning using the `HealthBase.jl` preprocessing utilities.
+
+### 1. Load Packages
+
+First, start a Julia session in your project environment and load the necessary packages.
+
+NOTE: For the workflow to work, we need to load the trigger packages `DataFrames`, `OMOPCommonDataModel`, `InlineStrings`, `Serialization`, `Statistics`, and `Dates` before loading `HealthBase.jl`. See the "For Developers" section below for more information.
+
+```julia
+# First, load the trigger packages
+using DataFrames, OMOPCommonDataModel, InlineStrings, Serialization, Statistics, Dates
+
+# Then, load HealthBase
+using HealthBase
+```
+
+### 2. Create a Sample Dataset
+
+We'll start with a sample `DataFrame` that mimics raw data from a clinical database. It includes missing values, categorical data, and different data types.
+
+```julia
+raw_df = DataFrame(
+    person_id = 101:108,
+    gender_concept_id = [8507, 8532, 8507, 8532, 8507, 8532, 8507, 8507],
+    year_of_birth = [1985, 1992, 1985, 1978, 2000, 2001, 1992, 1988],
+    race_concept_id = [8527, 8515, 8527, 8516, 8527, 8515, 8516, 8527],
+    cholesterol = [189, 210, 240, missing, 195, 220, missing, 205.0]
+)
+
+ht = HealthTable(source=raw_df, omop_cdm_version="v5.4.1")
+```
+
+### 3. Preprocessing Pipeline
+
+Now, we'll apply a series of transformations to clean and prepare the data.
+
+#### Step A: Impute Missing Values
+
+First, we'll fill in the `missing` values for `cholesterol` using the mean of each column.
+
+```julia
+ht_imputed = impute_missing(ht; cols=[:cholesterol], strategy=:mean)
+```
+
+#### Step B: One-Hot Encode Categorical Features
+
+Next, we convert the categorical `gender_concept_id` and `race_concept_id` columns into numerical, binary columns.
+
+```julia
+ht_onehot = one_hot_encode(ht_imputed; cols=[:gender_concept_id, :race_concept_id])
+```
+
+#### Step C: Normalize Numerical Features
+
+Finally, we scale the `cholesterol` column to have a mean of 0 and a standard deviation of 1. This helps many machine learning algorithms perform better.
+
+```julia
+ht_final = normalize_column(ht_onehot; cols=[:cholesterol])
+
+println("--- Final model-ready data ---")
+println(ht_final.source)
+```
+
+After these steps, `ht_final` contains a fully preprocessed, numerical dataset that is ready to be used for model training.
+
+### For Developers: Interactive Use in the REPL
+
+When working with `HealthBase.jl` interactively in the Julia REPL, especially during development, it's important to load packages in the correct order to ensure that package extensions are activated.
+
+If you try to call a function from an extension (like `impute_missing`) and get a `MethodError`, it's likely because the extension was not loaded. To fix this, make sure you load the "trigger packages" **before** you load `HealthBase`.
+
+For the OMOP CDM extension, the trigger packages are `DataFrames`, `OMOPCommonDataModel`, `InlineStrings`, `Serialization`, `Statistics`, and `Dates`.
+
+**Correct Loading Order:**
+```julia
+# First, load the trigger packages
+using DataFrames, OMOPCommonDataModel, InlineStrings, Serialization, Statistics, Dates
+
+# Then, load HealthBase
+using HealthBase
+
+# Now, functions from the extension will be available
+# ht_imputed = impute_missing(ht; cols=[:cholesterol], strategy=:mean) # This will now work
+```