-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Use narwhals as dataframe-agnostic backend
#7964
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces narwhals as a dataframe-agnostic backend for PyMC, enabling support for multiple dataframe libraries (pandas, polars, dask.dataframe) through a unified interface. The implementation uses a registration mechanism to handle different dataframe backends dynamically.
Key changes:
- Replaces direct pandas usage with narwhals compatibility layer
- Implements singledispatch-based coordinate determination for different data types
- Adds support for polars and dask.dataframe in addition to pandas
Reviewed Changes
Copilot reviewed 11 out of 12 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| pymc/pytensorf.py | Implements dataframe backend registration and tensor conversion using narwhals |
| pymc/data.py | Refactors coordinate determination with singledispatch and narwhals integration |
| tests/test_pytensorf.py | Extends tests to cover pandas, polars, and dask.dataframe |
| tests/test_data.py | Adds polars-specific tests for series and dataframe coordinate inference |
| requirements.txt | Adds narwhals>=2.11.0 as a core dependency |
| requirements-dev.txt | Adds narwhals>=2.11.0 to development dependencies |
| conda-envs/*.yml | Adds narwhals>=2.11.0 to all conda environment files |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
72cfa96 to
9bcd9de
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #7964 +/- ##
==========================================
- Coverage 91.49% 91.45% -0.04%
==========================================
Files 116 116
Lines 18962 19020 +58
==========================================
+ Hits 17349 17395 +46
- Misses 1613 1625 +12
🚀 New features to boost your workflow:
|
9501ac3 to
48dcc85
Compare
|
I love this! I will be reviewing this soon :) |
| return coords, _handle_none_dims(dims, value.ndim) | ||
|
|
||
|
|
||
| def _dataframe_agnostic_coords( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can add a simple test for this function?
juanitorduz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Just a small comment / request
Description
Continues/completes #7462. I didn't have permission to push into that PR, so I'm opening this one.
The purpose of this PR is to use narwhals as a one-stop shop dataframe backend. Currently, we use pandas in
data.pyandpytensorf.pyto allow users to pass dataframe objects intopm.Dataandpt.as_tensor, respectively. I add a narwhals compatibility layer between the input and the pymc model to allow the user to bring his data in any form that narwhals supports, provided we register the libraries.(If we could eliminate registration that would also be great, but I wasn't clever enough to figure out the multiple dispatch using only narwhals as a dependency. Maybe @MarcoGorelli could help 👉 👈 )
Some notes:
pm.Data. Instead, we look for a column matching that dimension name. If it is found, it is used as labels, and excluded from the values.nw.LazyFrameandnw.LazySeries. I don't think we can do anything with those at the modeling level (maybe in the future with minibatching?). For now, I'm just calling.collect()on them to make them eager.dask.dataframebackend as an example of how we could extend things.As of this PR,
pandascould be made an optional or dev-only dependency for us. I didn't do it right away because I wanted to take people's temperature on the idea.Related Issue
Checklist
Type of change