Skip to content

Conversation

@1092626063
Copy link
Contributor

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for glm4.5 with multi-token prediction (MTP). The changes primarily involve configuration updates and model loading adjustments. I've identified a critical issue in mtp_proposer.py where the model loading is hardcoded, which would break functionality for other MTP models. I've also noted a high-severity maintainability concern in patch_config.py due to significant code duplication and have recommended a refactoring to a more data-driven approach.

Comment on lines 91 to 94
else:
self.model = DeepSeekMTP(
from vllm.model_executor.models.glm4_moe_mtp import Glm4MoeMTP
self.model = Glm4MoeMTP(
vllm_config=self.vllm_config).to(target_device)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This change hardcodes Glm4MoeMTP as the model for the MtpProposer. However, MtpProposer is now used for multiple methods (deepseek_mtp, qwen3_next_mtp, glm4_moe_mtp). This will cause incorrect model loading for methods other than glm4_moe_mtp.

The model should be selected dynamically based on the speculative decoding method. You can get the method from self.vllm_config.speculative_config.method.

            else:
                method = self.vllm_config.speculative_config.method
                if method == "deepseek_mtp":
                    self.model = DeepSeekMTP(
                        vllm_config=self.vllm_config).to(target_device)
                elif method == "glm4_moe_mtp":
                    from vllm.model_executor.models.glm4_moe_mtp import Glm4MoeMTP
                    self.model = Glm4MoeMTP(
                        vllm_config=self.vllm_config).to(target_device)
                else:
                    raise NotImplementedError(
                        f"MTP method '{method}' is not supported in eager mode."
                    )

Comment on lines +138 to +146
elif (self.draft_model_config.hf_config.model_type ==
"glm4_moe_mtp"):
self.method = "glm4_moe_mtp"
if self.num_speculative_tokens > 1:
logger.warning(
"All GLM4 MTP models only have " \
"one layer. Might need some code changes " \
"to support multiple layers."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This elif block is almost identical to several others for different MTP models (e.g., deepseek_mtp, ernie_mtp, qwen3_next_mtp). This introduces significant code duplication, making the code harder to read and maintain.

Consider refactoring this if/elif chain into a data-driven approach. You could use a dictionary to map model types to their corresponding method and warning message components. This would eliminate the repeated logic and make it easier to add or modify support for MTP models in the future.

For example:

MTP_CONFIGS = {
    "deepseek_mtp": ("deepseek_mtp", "Deepseek"),
    "mimo_mtp": ("deepseek_mtp", "Deepseek"),
    "ernie_mtp": ("ernie_mtp", "Ernie"),
    "glm4_moe_mtp": ("glm4_moe_mtp", "GLM4"),
    "qwen3_next_mtp": ("qwen3_next_mtp", "Qwen3Next"),
    "longcat_flash_mtp": ("longcat_flash_mtp", "LongCat"),
}

model_type = self.draft_model_config.hf_config.model_type
if model_type in MTP_CONFIGS:
    method, model_name = MTP_CONFIGS[model_type]
    self.method = method
    if self.num_speculative_tokens > 1:
        logger.warning(
            f"All {model_name} MTP models only have one layer. "
            "Might need some code changes to support multiple layers."
        )
# ... then handle other non-MTP cases

While a full refactoring is a larger change, applying this pattern would greatly improve code quality.

@wangxiyuan wangxiyuan added the hold-on The PR should be hold-on but no need to release label Dec 2, 2025
@wangxiyuan
Copy link
Collaborator

won't merge for dev branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hold-on The PR should be hold-on but no need to release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants