-
Notifications
You must be signed in to change notification settings - Fork 13.6k
# Add Megrez-MoE Architecture Support ggml-org#16724 #17052
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@pwilkin @filippide |
src/llama-graph.cpp
Outdated
| // For Megrez: sigmoid THEN add bias (not the other way around!) | ||
| normalized_logits = ggml_sigmoid(ctx0, logits); // [n_expert, n_tokens] | ||
| cb(normalized_logits, "ffn_moe_logits_normalize", il); | ||
| probs = ggml_add(ctx0, normalized_logits, exp_probs_b); // Add bias AFTER sigmoid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about ffn_moe_probs_biased in build_moe_ffn ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question!
The key difference is that Megrez-MoE uses sigmoid + bias instead of softmax, with a different computation order.
In standard build_moe_ffn, the biased probabilities are: probs_biased = softmax(logits + bias) with bias BEFORE softmax.
But Megrez-MoE requires: probs = sigmoid(logits) + bias with bias AFTER sigmoid.
The bias is added after the activation function (not before), and we use sigmoid (per-expert independent scores) instead of softmax (normalized distribution). This is specific to the Megrez-MoE architecture design.
That's why I couldn't reuse ffn_moe_probs_biased directly - both the activation function and computation order are fundamentally different.
Thank's!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what ffn_moe_probs_biased does though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In standard build_moe_ffn, the biased probabilities are: probs_biased = softmax(logits + bias) with bias BEFORE softmax.
This is ffn_moe_logits_biased, not ffn_moe_probs_biased
|
Seems this has been badly merged. |
Implements complete support for Megrez-MoE (Mixture of Experts) models: - Add LLM_ARCH_MEGREZ_MOE architecture enum and mappings - Implement build_mergez_moe_ffn() with sigmoid+bias gating - Add llm_build_megrez_moe class for full model graph construction - Support 31-layer architecture (layer 0: dense FFN, layers 1-30: MoE) - Implement expert sharing pattern with 64 experts, 6 used per token, 4 shared - Load all model hyperparameters and 372 tensors correctly - Configure NEOX RoPE type for proper positional encoding Tested with Megrez2-3x7B-A3B_Q4_K_M.gguf model. All 39 llama.cpp tests pass successfully. Output verified to match infinigence/llama.cpp reference implementation. Note: Use --no-warmup flag to avoid warmup memory allocation issue.
Megrez-MoE creates many intermediate tensors during MoE FFN construction: - sigmoid, add, reshape (3x), get_rows, sum_rows, div, view_2d, mul_mat operations - ggml_top_k internally calls ggml_argsort + ggml_view_4d (2 more tensors per layer) - Each of 30 MoE layers creates ~35 intermediate tensors during graph construction During warmup, the graph is built 3 times with different batch sizes, requiring sufficient memory pool space for all intermediate tensors. Add 4096 node overhead for LLM_ARCH_MEGREZ_MOE to accommodate these intermediate tensors (30 layers × 35 tensors/layer ≈ 1050 nodes, doubled for safety margin). This fixes the 'not enough space in the context's memory pool' error during warmup, allowing Megrez-MoE to work without the --no-warmup flag. Tested: - All 39 tests pass - Megrez-MoE works with warmup enabled (no crashes) - Other models (e.g., Gemma-2) are unaffected - Verified with outputs up to 100 tokens
- Move llm_build_megrez_moe from llama-model.cpp to src/models/megrez-moe.cpp - Add declaration to src/models/models.h - Update CMakeLists.txt to include megrez-moe.cpp in build - Resolve merge conflicts in llama-arch.cpp and llama-model.cpp - Fix PANGU_EMBED case statement closing braces The model loads successfully, all tests pass (40/40), and inference works correctly.
…oe_ffn - Remove custom build_mergez_moe_ffn implementation (100+ lines) - Use existing build_moe_ffn with LLAMA_EXPERT_GATING_FUNC_TYPE_SIGMOID - Pre-compute gate logits from pre_gate_hidden (Megrez-MoE's unique gating) - Pass pre-computed logits via probs_in parameter - Maintain exact same behavior and output quality This addresses review feedback to reuse existing MoE infrastructure instead of duplicating code. The sigmoid gating + bias after activation is already supported by build_moe_ffn.
0989d06 to
053234f
Compare
Better, but there's still some fixes to be made, check the diffs in |
- Restore PANGU_EMBED and COGVLM tensor mappings in llama-arch.cpp - Remove extra blank line in llama-context.cpp
|
Hi @CISC, Thanks for the review! I've fixed the merge issues in commit. |
Still missing conversion code. :) |
|
Hi @CISC, Thank you for your feedback! Thank you for your consideration! |
No, conversion code must be included in the PR. |
|
| // Use standard build_moe_ffn but with pre-computed gate logits | ||
| ggml_tensor * moe_out = build_moe_ffn(cur, | ||
| model.layers[il].ffn_gate_inp, | ||
| model.layers[((il - 1) / (3) * (3)) + 1].ffn_up_exps, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how is ((il - 1) / (3) * (3)) + 1 different from il ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because il is an integer, the division /3 performs integer division — so this expression effectively groups layers in blocks of 3 and returns the first layer index of the block.
|
You are making what appears to be complete random changes that have nothing to do with this model, you can open this PR again once you've sorted it out properly. |
|
Got it, thanks for the feedback. I’ll clean up the unrelated changes and make sure the PR only includes the model-specific parts for MegrezMoE. Once it’s properly scoped, I’ll reopen it. |
|
Hi @CISC, Fixed the conversion code and tested with a real model (Infinigence/Megrez2-3x7B-A3B).
Could you please reopen this PR? Thanks! |
Sorry, GitHub gives this error, guess you'll have to resubmit after all: |
Thanks for trying to reopen! Would you prefer: I open a new PR with the fixed code? Thanks! |
Yes. |
Summary
This PR adds full support for the Megrez-MoE (Mixture of Experts) architecture to llama.cpp, enabling inference on Megrez2-3x7B models and similar MoE variants.
Architecture Details
Megrez-MoE is a Mixture of Experts architecture with:
Changes Made
1. Architecture Registration
LLM_ARCH_MEGREZ_MOEarchitecture enum2. MoE FFN Implementation
Implemented
build_mergez_moe_ffn()with:ggml_top_k()3. Model Loading
llm_build_megrez_moeclass4. Graph Memory Fix
graph_max_nodes()for Megrez-MoE (30 layers × 35 tensors ≈ 1050 nodes, doubled for safety)Testing
Comparison