glm4.5 support mtp with piecewise graph #4258

1092626063 · 2025-11-18T11:23:58Z

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

github-actions · 2025-11-18T11:24:06Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request adds support for glm4.5 with multi-token prediction (MTP). The changes primarily involve configuration updates and model loading adjustments. I've identified a critical issue in mtp_proposer.py where the model loading is hardcoded, which would break functionality for other MTP models. I've also noted a high-severity maintainability concern in patch_config.py due to significant code duplication and have recommended a refactoring to a more data-driven approach.

gemini-code-assist · 2025-11-18T11:25:58Z

vllm_ascend/spec_decode/mtp_proposer.py

            else:
-                self.model = DeepSeekMTP(
+                from vllm.model_executor.models.glm4_moe_mtp import Glm4MoeMTP
+                self.model = Glm4MoeMTP(
                    vllm_config=self.vllm_config).to(target_device)


This change hardcodes Glm4MoeMTP as the model for the MtpProposer. However, MtpProposer is now used for multiple methods (deepseek_mtp, qwen3_next_mtp, glm4_moe_mtp). This will cause incorrect model loading for methods other than glm4_moe_mtp.

The model should be selected dynamically based on the speculative decoding method. You can get the method from self.vllm_config.speculative_config.method.

else: method = self.vllm_config.speculative_config.method if method == "deepseek_mtp": self.model = DeepSeekMTP( vllm_config=self.vllm_config).to(target_device) elif method == "glm4_moe_mtp": from vllm.model_executor.models.glm4_moe_mtp import Glm4MoeMTP self.model = Glm4MoeMTP( vllm_config=self.vllm_config).to(target_device) else: raise NotImplementedError( f"MTP method '{method}' is not supported in eager mode." )

gemini-code-assist · 2025-11-18T11:25:58Z

vllm_ascend/patch/platform/patch_config.py

+            elif (self.draft_model_config.hf_config.model_type ==
+                  "glm4_moe_mtp"):
+                self.method = "glm4_moe_mtp"
+                if self.num_speculative_tokens > 1:
+                    logger.warning(
+                            "All GLM4 MTP models only have " \
+                            "one layer. Might need some code changes " \
+                            "to support multiple layers."
+                        )


This elif block is almost identical to several others for different MTP models (e.g., deepseek_mtp, ernie_mtp, qwen3_next_mtp). This introduces significant code duplication, making the code harder to read and maintain.

Consider refactoring this if/elif chain into a data-driven approach. You could use a dictionary to map model types to their corresponding method and warning message components. This would eliminate the repeated logic and make it easier to add or modify support for MTP models in the future.

For example:

MTP_CONFIGS = { "deepseek_mtp": ("deepseek_mtp", "Deepseek"), "mimo_mtp": ("deepseek_mtp", "Deepseek"), "ernie_mtp": ("ernie_mtp", "Ernie"), "glm4_moe_mtp": ("glm4_moe_mtp", "GLM4"), "qwen3_next_mtp": ("qwen3_next_mtp", "Qwen3Next"), "longcat_flash_mtp": ("longcat_flash_mtp", "LongCat"), } model_type = self.draft_model_config.hf_config.model_type if model_type in MTP_CONFIGS: method, model_name = MTP_CONFIGS[model_type] self.method = method if self.num_speculative_tokens > 1: logger.warning( f"All {model_name} MTP models only have one layer. " "Might need some code changes to support multiple layers." ) # ... then handle other non-MTP cases

While a full refactoring is a larger change, applying this pattern would greatly improve code quality.

wangxiyuan · 2025-12-02T11:24:23Z

won't merge for dev branch.

glm4.5 support mtp with piecewise graph

fc445d3

gemini-code-assist bot reviewed Nov 18, 2025

View reviewed changes

fix device sync

166b382

wangxiyuan added the hold-on The PR should be hold-on but no need to release label Dec 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

glm4.5 support mtp with piecewise graph #4258

glm4.5 support mtp with piecewise graph #4258

1092626063 commented Nov 18, 2025

Uh oh!

github-actions bot commented Nov 18, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 18, 2025

Uh oh!

gemini-code-assist bot Nov 18, 2025

Uh oh!

wangxiyuan commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

glm4.5 support mtp with piecewise graph #4258

Are you sure you want to change the base?

glm4.5 support mtp with piecewise graph #4258

Conversation

1092626063 commented Nov 18, 2025

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Nov 18, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

wangxiyuan commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants