Add `model_to_minibatch` transformation to convert all `pm.Data` to `pm.Minibatch` #7785

jessegrabowski · 2025-05-15T12:28:17Z

Description

A pain point for me when testing different algorithms (e.g. MCMC vs VI) is that I don't want to write a 2nd version of the model with pm.Minibatch on the data.

This PR adds a model transformation that does that for the user. It's the reverse of the remove_minibatched_nodes transformer that @zaxtax implemented recently.

This is a WIP, it doesn't actually work now, because I can't figure out how to rebuild the observed variable with the total_size set correctly. Help wanted.

Related Issue

Closes #
Related to #

Checklist

Checked that the pre-commit linting/style checks pass
Included tests that prove the fix is effective or that the new feature works
Added necessary documentation (docstrings and/or example notebooks)
If you are a pro: each commit corresponds to a relevant logical change

Type of change

📚 Documentation preview 📚: https://pymc--7785.org.readthedocs.build/en/7785/

ricardoV94 · 2025-05-15T12:56:21Z

This is a WIP, it doesn't actually work now, because I can't figure out how to rebuild the observed variable with the total_size set correctly. Help wanted.

You can use the lower level utility:

pymc/pymc/variational/minibatch_rv.py

Line 53 in ef26ae8

def create_minibatch_rv(

Then make that a vanilla observed RV

ricardoV94 · 2025-05-15T12:59:51Z

Ah you already did that, so your question is how to get total size? Grab the batch shape of the variable and constant fold it without raising if it can't be fully folded

jessegrabowski · 2025-05-15T13:10:51Z

My real issue was not understanding what needs to be the key and value in the replacements, between:

The model variable
The memo variable
The fgraph variable

ricardoV94 · 2025-05-15T13:25:41Z

the best is usual to replace the whole fgraph ModelObservedRV by a new one. You probably have to discard any dims on the batch dimension which is an input to that op

jessegrabowski · 2025-05-15T13:28:20Z

I don't really understand what that answer means

ricardoV94 · 2025-05-15T13:32:23Z

dprint the fgraph and it will perhaps be more obvious what I am mumbling

jessegrabowski · 2025-05-15T13:34:06Z

The problem i was running into was that I ended up with two beta RVs after doing the replace. Beta was the only RV implicated in the ModelObservedRV sub-graph

zaxtax · 2025-05-15T14:25:29Z

Because Minibatch assumes the data variables have the same length, it might make sense to take a variables argument. Or have some way to group data variables of the same size (same dim name maybe?)

…

On Thu, 15 May 2025, 15:35 Ricardo Vieira, ***@***.***> wrote: *ricardoV94* left a comment (pymc-devs/pymc#7785) <#7785 (comment)> dprint the fgraph and it will perhaps be more obvious what I am mumbling — Reply to this email directly, view it on GitHub <#7785 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAACCUMC5VCN6VAAJKNHEMT26SJPZAVCNFSM6AAAAAB5F7LYYKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDQOBTHAZTINZXG4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

ricardoV94 · 2025-06-11T21:02:54Z

pymc/model/transform/basic.py

+    minibatch_vars = Minibatch(*data_vars, batch_size=batch_size)
+    replacements = {datum: minibatch_vars[i] for i, datum in enumerate(data_vars)}
+    assert 0
+    # Add total_size to all observed RVs


Should only add to those that depend on the minibatch data no?

The correct thing would be a dim analysis like we do for MarginaModel to confirm the first dim of the data maps to the first dim of the observed rvs, which is when the rewrite is valid. We may not want to do that, but we should be clear about the assumptions in the docstrings.

Example where minibatch rewrite will fail / do the wrong thing, is if you tranpose the data before you used it in the observations.

ricardoV94 · 2025-06-11T21:03:40Z

pymc/model/transform/basic.py

+    replacements = {datum: minibatch_vars[i] for i, datum in enumerate(data_vars)}
+    assert 0
+    # Add total_size to all observed RVs
+    total_size = data_vars[0].get_value().shape[0]


total size can be symbolic I think?

ricardoV94 · 2025-06-11T21:06:42Z

pymc/model/transform/basic.py

+
+    data_vars = [
+        memo[datum].owner.inputs[0]
+        for datum in (model.named_vars[datum_name] for datum_name in model.named_vars)


There's a model.data_vars. You should however allow users to specify which data vars to be minibatched (default to all is fine). Alternatively we could restrict this to models with dims, and the user has to tell us which dim is being minibatched?

That makes the graph analysis easier

zaxtax · 2025-06-11T21:08:32Z

Yep, I have reworked this code and need to push my changes!

…

On Wed, 11 Jun 2025, 23:07 Ricardo Vieira, ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In pymc/model/transform/basic.py <#7785 (comment)>: > @@ -62,6 +66,47 @@ def parse_vars(model: Model, vars: ModelVariable | Sequence[ModelVariable]) -> l return [model[var] if isinstance(var, str) else var for var in vars_seq] +def model_to_minibatch(model: Model, batch_size: int) -> Model: + """Replace all Data containers with pm.Minibatch, and add total_size to all observed RVs.""" + from pymc.variational.minibatch_rv import create_minibatch_rv + + fgraph, memo = fgraph_from_model(model, inlined_views=True) + + # obs_rvs, data_vars = model.rvs_to_values.items() + + data_vars = [ + memo[datum].owner.inputs[0] + for datum in (model.named_vars[datum_name] for datum_name in model.named_vars) There's a model.data_vars. You should however allow users to specify which data vars to be minibatched (default to all is fine). Alternatively we could restrict this to models with dims, and the user has to tell us which dim is being minibatched? That makes the graph analysis easier — Reply to this email directly, view it on GitHub <#7785 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAACCUKBANF33XOQRR2ISCD3DCK7PAVCNFSM6AAAAAB5F7LYYKVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDSMJYG42DKNRZGA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

codecov · 2025-11-16T16:33:50Z

Codecov Report

❌ Patch coverage is 1.81818% with 54 lines in your changes missing coverage. Please review.
✅ Project coverage is 91.22%. Comparing base (869503b) to head (0fbc7d9).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
pymc/model/transform/minibatch.py	0.00%	54 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #7785      +/-   ##
==========================================
- Coverage   91.49%   91.22%   -0.27%     
==========================================
  Files         116      117       +1     
  Lines       18962    18999      +37     
==========================================
- Hits        17349    17332      -17     
- Misses       1613     1667      +54

Files with missing lines	Coverage Δ
pymc/model/transform/basic.py	`95.00% <100.00%> (-2.30%)`	⬇️
pymc/model/transform/minibatch.py	`0.00% <0.00%> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ricardoV94 · 2025-11-17T13:25:56Z

pymc/data.py

+    # FIXME: __props__ should not be empty
+    __props__ = ()


The underlying issue is that OpFromGraph doesn't have equality implemented: pymc-devs/pytensor#1606

You know this but for future readers, the reason these lines were added in this PR was to let the assert_model_equality check to pass on models with MinibatchRV, which is an OpFromGraph

Yeah but we should remove this from the PR and test differently

This reverts commit 4d18e37.

ricardoV94 · 2025-11-17T15:03:31Z

I pushed intermediate changes that I'll clean better, probably broken atm

ricardoV94 · 2025-11-17T18:11:07Z

I pulled changes from pymc-devs/pymc-extras#211

In general this rewrite is type-unsafe. If the variable you're trying to minibatch has static shape you can't apply the rewrite. Instead of doing a lame failure I added the functionality from that draft PR to rebuild the graph when types change. This is a very useful functionality that together with toposort_replace we probably want to upstream to PyTensor later.

This can be used in the remove_minibatch pre-existing transform, which was doing clone_replace. This rewrite wasn't complete, because it didn't remove minibatch RVs which it should as well. To do that properly we need the new implementation because we also need to substitute the non-minibatch variables in the minibatch RV graph.

I'll clean everything. One small thing still failing in the new code

jessegrabowski requested a review from zaxtax May 15, 2025 12:28

zaxtax force-pushed the model-to-minibatch branch from c1168de to 8d1b479 Compare June 9, 2025 12:52

ricardoV94 reviewed Jun 11, 2025

View reviewed changes

jessegrabowski force-pushed the model-to-minibatch branch from 8d1b479 to 9df25b9 Compare November 16, 2025 16:15

ricardoV94 reviewed Nov 17, 2025

View reviewed changes

jessegrabowski and others added 6 commits November 17, 2025 15:14

initial PR

c4b0d57

Working model_to_minibatch implementation

093a7be

Add assert_equivalent_models test helper

4d18e37

Cleanup implementation and test

9ab26a8

Revert "Add assert_equivalent_models test helper"

a350467

This reverts commit 4d18e37.

broken WIP move minibatch transform and rename/rework

0fbc7d9

ricardoV94 force-pushed the model-to-minibatch branch from 7e581f6 to 0fbc7d9 Compare November 17, 2025 15:03

WIP Allow rebuilding graph in toposort_replace

a3bc54b

ricardoV94 force-pushed the model-to-minibatch branch from 2fc116c to a3bc54b Compare November 17, 2025 18:07

Add model_to_minibatch transformation to convert all pm.Data to pm.Minibatch #7785

Are you sure you want to change the base?

Add model_to_minibatch transformation to convert all pm.Data to pm.Minibatch #7785

Conversation

jessegrabowski commented May 15, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue

Checklist

Type of change

Uh oh!

ricardoV94 commented May 15, 2025

Uh oh!

ricardoV94 commented May 15, 2025

Uh oh!

jessegrabowski commented May 15, 2025

Uh oh!

ricardoV94 commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jessegrabowski commented May 15, 2025

Uh oh!

ricardoV94 commented May 15, 2025

Uh oh!

jessegrabowski commented May 15, 2025

Uh oh!

zaxtax commented May 15, 2025 via email • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zaxtax commented Jun 11, 2025 via email

Uh oh!

codecov bot commented Nov 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ricardoV94 commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ricardoV94 commented Nov 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add `model_to_minibatch` transformation to convert all `pm.Data` to `pm.Minibatch` #7785

Add `model_to_minibatch` transformation to convert all `pm.Data` to `pm.Minibatch` #7785

jessegrabowski commented May 15, 2025 •

edited by github-actions bot

Loading

ricardoV94 commented May 15, 2025 •

edited

Loading

zaxtax commented May 15, 2025 via email •

edited

Loading

codecov bot commented Nov 16, 2025 •

edited

Loading

ricardoV94 commented Nov 17, 2025 •

edited

Loading