Skip to content

Conversation

@gabe-l-hart
Copy link
Contributor

Branch: GraniteMoeAsDenseFix

What does this PR do?

With the introduction of modular_granitemoe.py in #40132, the conditional that allowed GraniteMoe to also encapsulate dense models as a degenerate case was accidentally removed. This is never actually needed for the GraniteMoe architecture directly, but GraniteMoe is reused in GraniteMoeShared and then GraniteMoeHybrid which do need this ability to also encapsulate dense FFN blocks in place of the MoE block.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@ArthurZucker I believe this came in with your PR for MoE in vLLM, so I'd love your sanity check on this fix.

@gabe-l-hart
Copy link
Contributor Author

Looks like I need to regenerate the modeling_ code everywhere

@gabe-l-hart
Copy link
Contributor Author

Interestingly, when I run make fix-copies, the python utils/check_modular_conversion.py --fix_and_overwrite script makes far more changes than I would expect from adding this conditional. In particular, for each architecture in the GraniteMoe* chain, it adds GraniteMoe<qualifier>SparseMoeBlock, then in the __init__ for GraniteMoe<qualifier>DecoderLayer, it adds self.block_sparse_moe = GraniteMoe<qualifier>SparseMoeBlock(config) (without any conditional) AND the guarded conditional block self.block_sparse_moe = GraniteMoeHybridMoE(config).

Looking a little deeper, this seems to be caused by the inheritance from modular_mixtral.py. I've added a clause that explicitly uses delattr to remove self.block_sparse_moe when not used, but that seems a bit backwards. Alternatively, we could not inherit from MixtralDecoderLayer or we could move the conditional up to MixtralDecoderLayer.__init__.

The part that I'm still confused about is why regenerating the modeling_* files is adding these SparseMoeBlock implementations at all. The inheritance from Mixtral was already there prior to my change, so I would expect that those should have already been added unless somehow making creation of self.block_sparse_moe conditional triggered logic in the generation to require that those be added?

@gabe-l-hart
Copy link
Contributor Author

It appears that putting creation of self.block_sparse_moe behind a conditional does, in fact, trigger the inclusion of those SparseMoeBlock pieces in the generation. I've used an inline conditional now which seems to prevent this.

@gabe-l-hart
Copy link
Contributor Author

🤦 Ok, all that was because of a bad copy-paste somewhere that had me creating a completely incorrect block for self.block_sparse_moe. Fixed and cleaned up history.

…erts > 0

Branch: GraniteMoeAsDenseFix

Signed-off-by: Gabe Goodhart <[email protected]>
Branch: GraniteMoeAsDenseFix

Signed-off-by: Gabe Goodhart <[email protected]>
@gabe-l-hart
Copy link
Contributor Author

One more redo: Based on advice from @ArthurZucker, since there are no models using either GraniteMoe or GraniteMoeShared with the degenerate dense configuration, it's preferable to only have this conditional override in GraniteMoeHybrid where it is needed for various flavors of granite-4.0-* models.

return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)


class GraniteFlashAttentionKwargs(TypedDict, total=False):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like these just got moved during the regeneration. I'm not sure if should be included (to enforce consistency with the generation script) or excluded (to minimize the size of the change).

@Rocketknight1
Copy link
Member

Hi @gabe-l-hart, thanks for the PR! You can get the code style tests to pass with pip install -e .[quality] followed by make fixup or make style.

Overall it looks good to me, and it does seem like the zero-experts case was accidentally deleted. Will wait for @ArthurZucker to confirm before merging!

@gabe-l-hart
Copy link
Contributor Author

@Rocketknight1 Thanks! I'll get it cleaned up and hopefully green today

Branch: GraniteMoeAsDenseFix

Signed-off-by: Gabe Goodhart <[email protected]>
@github-actions
Copy link
Contributor

github-actions bot commented Nov 6, 2025

[For maintainers] Suggested jobs to run (before merge)

run-slow: granitemoehybrid

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants