fix(granitemoe*): Only create block_sparse_moe if num_local_experts > 0 #42036

gabe-l-hart · 2025-11-05T13:56:45Z

Branch: GraniteMoeAsDenseFix

What does this PR do?

With the introduction of modular_granitemoe.py in #40132, the conditional that allowed GraniteMoe to also encapsulate dense models as a degenerate case was accidentally removed. This is never actually needed for the GraniteMoe architecture directly, but GraniteMoe is reused in GraniteMoeShared and then GraniteMoeHybrid which do need this ability to also encapsulate dense FFN blocks in place of the MoE block.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@ArthurZucker I believe this came in with your PR for MoE in vLLM, so I'd love your sanity check on this fix.

gabe-l-hart · 2025-11-05T13:57:51Z

Looks like I need to regenerate the modeling_ code everywhere

gabe-l-hart · 2025-11-05T14:14:28Z

Interestingly, when I run make fix-copies, the python utils/check_modular_conversion.py --fix_and_overwrite script makes far more changes than I would expect from adding this conditional. In particular, for each architecture in the GraniteMoe* chain, it adds GraniteMoe<qualifier>SparseMoeBlock, then in the __init__ for GraniteMoe<qualifier>DecoderLayer, it adds self.block_sparse_moe = GraniteMoe<qualifier>SparseMoeBlock(config) (without any conditional) AND the guarded conditional block self.block_sparse_moe = GraniteMoeHybridMoE(config).

Looking a little deeper, this seems to be caused by the inheritance from modular_mixtral.py. I've added a clause that explicitly uses delattr to remove self.block_sparse_moe when not used, but that seems a bit backwards. Alternatively, we could not inherit from MixtralDecoderLayer or we could move the conditional up to MixtralDecoderLayer.__init__.

The part that I'm still confused about is why regenerating the modeling_* files is adding these SparseMoeBlock implementations at all. The inheritance from Mixtral was already there prior to my change, so I would expect that those should have already been added unless somehow making creation of self.block_sparse_moe conditional triggered logic in the generation to require that those be added?

gabe-l-hart · 2025-11-05T14:21:36Z

It appears that putting creation of self.block_sparse_moe behind a conditional does, in fact, trigger the inclusion of those SparseMoeBlock pieces in the generation. I've used an inline conditional now which seems to prevent this.

src/transformers/models/granitemoehybrid/modeling_granitemoehybrid.py

src/transformers/models/granitemoe/modeling_granitemoe.py

gabe-l-hart · 2025-11-05T14:27:30Z

🤦 Ok, all that was because of a bad copy-paste somewhere that had me creating a completely incorrect block for self.block_sparse_moe. Fixed and cleaned up history.

…erts > 0 Branch: GraniteMoeAsDenseFix Signed-off-by: Gabe Goodhart <[email protected]>

Branch: GraniteMoeAsDenseFix Signed-off-by: Gabe Goodhart <[email protected]>

gabe-l-hart · 2025-11-05T14:52:34Z

One more redo: Based on advice from @ArthurZucker, since there are no models using either GraniteMoe or GraniteMoeShared with the degenerate dense configuration, it's preferable to only have this conditional override in GraniteMoeHybrid where it is needed for various flavors of granite-4.0-* models.

gabe-l-hart · 2025-11-05T14:53:38Z

src/transformers/models/granitemoehybrid/modeling_granitemoehybrid.py

        return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)


-class GraniteFlashAttentionKwargs(TypedDict, total=False):


It looks like these just got moved during the regeneration. I'm not sure if should be included (to enforce consistency with the generation script) or excluded (to minimize the size of the change).

Rocketknight1 · 2025-11-06T14:51:05Z

Hi @gabe-l-hart, thanks for the PR! You can get the code style tests to pass with pip install -e .[quality] followed by make fixup or make style.

Overall it looks good to me, and it does seem like the zero-experts case was accidentally deleted. Will wait for @ArthurZucker to confirm before merging!

gabe-l-hart · 2025-11-06T16:44:29Z

@Rocketknight1 Thanks! I'll get it cleaned up and hopefully green today

Branch: GraniteMoeAsDenseFix Signed-off-by: Gabe Goodhart <[email protected]>

github-actions · 2025-11-06T18:35:18Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: granitemoehybrid

gabe-l-hart commented Nov 5, 2025

View reviewed changes

src/transformers/models/granitemoehybrid/modeling_granitemoehybrid.py Outdated Show resolved Hide resolved

gabe-l-hart commented Nov 5, 2025

View reviewed changes

src/transformers/models/granitemoe/modeling_granitemoe.py Outdated Show resolved Hide resolved

gabe-l-hart force-pushed the GraniteMoeAsDenseFix branch from e1e87db to be876f8 Compare November 5, 2025 14:26

gabe-l-hart force-pushed the GraniteMoeAsDenseFix branch from be876f8 to 1257ced Compare November 5, 2025 14:50

gabe-l-hart added 2 commits November 5, 2025 07:51

fix(granitemoehybid): Only set self.block_sparse_moe if num_local_exp…

7a6b4eb

…erts > 0 Branch: GraniteMoeAsDenseFix Signed-off-by: Gabe Goodhart <[email protected]>

fix(granitemoehybrid): Regenerate modeling_granitemoehybrid.py

f0cebbb

Branch: GraniteMoeAsDenseFix Signed-off-by: Gabe Goodhart <[email protected]>

gabe-l-hart force-pushed the GraniteMoeAsDenseFix branch from 1257ced to f0cebbb Compare November 5, 2025 14:51

gabe-l-hart commented Nov 5, 2025

View reviewed changes

style: Fix import order

c10b1cc

Branch: GraniteMoeAsDenseFix Signed-off-by: Gabe Goodhart <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(granitemoe*): Only create block_sparse_moe if num_local_experts > 0 #42036

fix(granitemoe*): Only create block_sparse_moe if num_local_experts > 0 #42036

gabe-l-hart commented Nov 5, 2025

Uh oh!

gabe-l-hart commented Nov 5, 2025

Uh oh!

gabe-l-hart commented Nov 5, 2025

Uh oh!

gabe-l-hart commented Nov 5, 2025

Uh oh!

Uh oh!

Uh oh!

gabe-l-hart commented Nov 5, 2025

Uh oh!

gabe-l-hart commented Nov 5, 2025

Uh oh!

gabe-l-hart Nov 5, 2025

Uh oh!

Rocketknight1 commented Nov 6, 2025

Uh oh!

gabe-l-hart commented Nov 6, 2025

Uh oh!

github-actions bot commented Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)


		class GraniteFlashAttentionKwargs(TypedDict, total=False):

fix(granitemoe*): Only create block_sparse_moe if num_local_experts > 0 #42036

Are you sure you want to change the base?

fix(granitemoe*): Only create block_sparse_moe if num_local_experts > 0 #42036

Conversation

gabe-l-hart commented Nov 5, 2025

What does this PR do?

Before submitting

Who can review?

Uh oh!

gabe-l-hart commented Nov 5, 2025

Uh oh!

gabe-l-hart commented Nov 5, 2025

Uh oh!

gabe-l-hart commented Nov 5, 2025

Uh oh!

Uh oh!

Uh oh!

gabe-l-hart commented Nov 5, 2025

Uh oh!

gabe-l-hart commented Nov 5, 2025

Uh oh!

gabe-l-hart Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Rocketknight1 commented Nov 6, 2025

Uh oh!

gabe-l-hart commented Nov 6, 2025

Uh oh!

github-actions bot commented Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants