-
Notifications
You must be signed in to change notification settings - Fork 624
glm4.5 support mtp with piecewise graph #4258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: v0.11.0-dev
Are you sure you want to change the base?
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for glm4.5 with multi-token prediction (MTP). The changes primarily involve configuration updates and model loading adjustments. I've identified a critical issue in mtp_proposer.py where the model loading is hardcoded, which would break functionality for other MTP models. I've also noted a high-severity maintainability concern in patch_config.py due to significant code duplication and have recommended a refactoring to a more data-driven approach.
| else: | ||
| self.model = DeepSeekMTP( | ||
| from vllm.model_executor.models.glm4_moe_mtp import Glm4MoeMTP | ||
| self.model = Glm4MoeMTP( | ||
| vllm_config=self.vllm_config).to(target_device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change hardcodes Glm4MoeMTP as the model for the MtpProposer. However, MtpProposer is now used for multiple methods (deepseek_mtp, qwen3_next_mtp, glm4_moe_mtp). This will cause incorrect model loading for methods other than glm4_moe_mtp.
The model should be selected dynamically based on the speculative decoding method. You can get the method from self.vllm_config.speculative_config.method.
else:
method = self.vllm_config.speculative_config.method
if method == "deepseek_mtp":
self.model = DeepSeekMTP(
vllm_config=self.vllm_config).to(target_device)
elif method == "glm4_moe_mtp":
from vllm.model_executor.models.glm4_moe_mtp import Glm4MoeMTP
self.model = Glm4MoeMTP(
vllm_config=self.vllm_config).to(target_device)
else:
raise NotImplementedError(
f"MTP method '{method}' is not supported in eager mode."
)| elif (self.draft_model_config.hf_config.model_type == | ||
| "glm4_moe_mtp"): | ||
| self.method = "glm4_moe_mtp" | ||
| if self.num_speculative_tokens > 1: | ||
| logger.warning( | ||
| "All GLM4 MTP models only have " \ | ||
| "one layer. Might need some code changes " \ | ||
| "to support multiple layers." | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This elif block is almost identical to several others for different MTP models (e.g., deepseek_mtp, ernie_mtp, qwen3_next_mtp). This introduces significant code duplication, making the code harder to read and maintain.
Consider refactoring this if/elif chain into a data-driven approach. You could use a dictionary to map model types to their corresponding method and warning message components. This would eliminate the repeated logic and make it easier to add or modify support for MTP models in the future.
For example:
MTP_CONFIGS = {
"deepseek_mtp": ("deepseek_mtp", "Deepseek"),
"mimo_mtp": ("deepseek_mtp", "Deepseek"),
"ernie_mtp": ("ernie_mtp", "Ernie"),
"glm4_moe_mtp": ("glm4_moe_mtp", "GLM4"),
"qwen3_next_mtp": ("qwen3_next_mtp", "Qwen3Next"),
"longcat_flash_mtp": ("longcat_flash_mtp", "LongCat"),
}
model_type = self.draft_model_config.hf_config.model_type
if model_type in MTP_CONFIGS:
method, model_name = MTP_CONFIGS[model_type]
self.method = method
if self.num_speculative_tokens > 1:
logger.warning(
f"All {model_name} MTP models only have one layer. "
"Might need some code changes to support multiple layers."
)
# ... then handle other non-MTP casesWhile a full refactoring is a larger change, applying this pattern would greatly improve code quality.
|
won't merge for dev branch. |
What this PR does / why we need it?
Does this PR introduce any user-facing change?
How was this patch tested?