Skip to content

Conversation

@jessegrabowski
Copy link
Member

Description

Continues/completes #7462. I didn't have permission to push into that PR, so I'm opening this one.

The purpose of this PR is to use narwhals as a one-stop shop dataframe backend. Currently, we use pandas in data.py and pytensorf.py to allow users to pass dataframe objects into pm.Data and pt.as_tensor, respectively. I add a narwhals compatibility layer between the input and the pymc model to allow the user to bring his data in any form that narwhals supports, provided we register the libraries.

(If we could eliminate registration that would also be great, but I wasn't clever enough to figure out the multiple dispatch using only narwhals as a dependency. Maybe @MarcoGorelli could help 👉 👈 )

Some notes:

  1. Since generalized dataframes don't have a notion of an index, we don't look at the index to find the labels for the left-most dimension provided to pm.Data. Instead, we look for a column matching that dimension name. If it is found, it is used as labels, and excluded from the values.
  2. Narwhals has a lazy API via nw.LazyFrame and nw.LazySeries. I don't think we can do anything with those at the modeling level (maybe in the future with minibatching?). For now, I'm just calling .collect() on them to make them eager.
  3. As mentioned above, we don't get all of narwhals for free yet -- the PR as it currently stands forces us to register each backend library we want to support. I don't think this is so bad, because it forces us to write tests for say DuckDB if someone comes along and really wants that. But it's a bit ugly. I added a dask.dataframe backend as an example of how we could extend things.

As of this PR, pandas could be made an optional or dev-only dependency for us. I didn't do it right away because I wanted to take people's temperature on the idea.

Related Issue

Checklist

Type of change

  • New feature / enhancement
  • Bug fix
  • Documentation
  • Maintenance
  • Other (please specify):

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces narwhals as a dataframe-agnostic backend for PyMC, enabling support for multiple dataframe libraries (pandas, polars, dask.dataframe) through a unified interface. The implementation uses a registration mechanism to handle different dataframe backends dynamically.

Key changes:

  • Replaces direct pandas usage with narwhals compatibility layer
  • Implements singledispatch-based coordinate determination for different data types
  • Adds support for polars and dask.dataframe in addition to pandas

Reviewed Changes

Copilot reviewed 11 out of 12 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
pymc/pytensorf.py Implements dataframe backend registration and tensor conversion using narwhals
pymc/data.py Refactors coordinate determination with singledispatch and narwhals integration
tests/test_pytensorf.py Extends tests to cover pandas, polars, and dask.dataframe
tests/test_data.py Adds polars-specific tests for series and dataframe coordinate inference
requirements.txt Adds narwhals>=2.11.0 as a core dependency
requirements-dev.txt Adds narwhals>=2.11.0 to development dependencies
conda-envs/*.yml Adds narwhals>=2.11.0 to all conda environment files

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@codecov
Copy link

codecov bot commented Nov 17, 2025

Codecov Report

❌ Patch coverage is 80.20833% with 19 lines in your changes missing coverage. Please review.
✅ Project coverage is 91.45%. Comparing base (869503b) to head (5b7b38a).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
pymc/data.py 77.02% 17 Missing ⚠️
pymc/pytensorf.py 90.90% 2 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #7964      +/-   ##
==========================================
- Coverage   91.49%   91.45%   -0.04%     
==========================================
  Files         116      116              
  Lines       18962    19020      +58     
==========================================
+ Hits        17349    17395      +46     
- Misses       1613     1625      +12     
Files with missing lines Coverage Δ
pymc/pytensorf.py 88.11% <90.90%> (+<0.01%) ⬆️
pymc/data.py 82.85% <77.02%> (-2.22%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@juanitorduz
Copy link
Contributor

juanitorduz commented Nov 19, 2025

I love this! I will be reviewing this soon :)

return coords, _handle_none_dims(dims, value.ndim)


def _dataframe_agnostic_coords(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can add a simple test for this function?

Copy link
Contributor

@juanitorduz juanitorduz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just a small comment / request

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ENH: Replace pandas dependence/use with narwhals

3 participants