Skip to content

Conversation

@zhenwenqi2024
Copy link
Contributor

@zhenwenqi2024 zhenwenqi2024 commented Dec 16, 2025

What this PR does / why we need it?

now _dummy_run in npu_model_runner is different from gpu_model_runner,as some important feature like graph_dispatch and build_atten_metadata, we do this to be consistent with gpu_model_runner

Does this PR introduce any user-facing change?

NA

How was this patch tested?
vLLM version: v0.12.0
vLLM main: vllm-project/vllm@ad32e3e

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the _dummy_run method in NPUModelRunner to align it more closely with the GPUModelRunner implementation. This is a significant and positive change for maintainability. However, I've identified a critical bug and a high-severity issue that need to be addressed.

  1. A premature return statement within a loop in the new _build_attention_metadata method will cause it to exit early, leading to incomplete metadata and likely runtime errors.
  2. In the refactored profile_run method, the in_profile_run flag is not set for a _dummy_run call, which can lead to incorrect behavior during profiling.

I've provided specific comments and suggestions to fix these issues.

spec_decode_common_attn_metadata = (
spec_decode_common_attn_metadata.unpadded(
num_tokens, num_reqs))
return attn_metadata, spec_decode_common_attn_metadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The return statement is inside the for loop that iterates over kv_cache_groups. This will cause the function to exit after processing only the first KV cache group, potentially leaving attn_metadata incomplete if there are multiple groups. The return statement should be moved outside the loop to ensure all KV cache groups are processed.

mc2_tokens_capacity = get_mc2_tokens_capacity()
if self.max_num_tokens > mc2_tokens_capacity and \
select_moe_comm_method(mc2_tokens_capacity, self.vllm_config) == MoECommType.MC2:
self._dummy_run(mc2_tokens_capacity, is_profile=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The in_profile_run attribute is not set for the initial _dummy_run call within profile_run. This can lead to incorrect behavior during profiling, as some logic (e.g., _skip_all_reduce_acorss_dp_group) depends on this flag to distinguish profiling runs from regular execution. The in_profile_run flag should be set to True before this _dummy_run call and reset to False after, similar to how it's handled in the base GPUModelRunner.profile_run and the old implementation.

Suggested change
self._dummy_run(mc2_tokens_capacity, is_profile=True)
self.in_profile_run = True
self._dummy_run(mc2_tokens_capacity, is_profile=True)
self.in_profile_run = False

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@github-actions
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant