-
Notifications
You must be signed in to change notification settings - Fork 665
[Feature]Refactor _dummy_run in model_runner #5090
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: zhenwenqi2024 <[email protected]>
Signed-off-by: zhenwenqi2024 <[email protected]>
Signed-off-by: zhenwenqi2024 <[email protected]>
Signed-off-by: zhenwenqi2024 <[email protected]>
Signed-off-by: zhenwenqi2024 <[email protected]>
Signed-off-by: zhenwenqi2024 <[email protected]>
Signed-off-by: zhenwenqi2024 <[email protected]>
Signed-off-by: zhenwenqi2024 <[email protected]>
Signed-off-by: zhenwenqi2024 <[email protected]>
Signed-off-by: zhenwenqi2024 <[email protected]>
Signed-off-by: zhenwenqi2024 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the _dummy_run method in NPUModelRunner to align it more closely with the GPUModelRunner implementation. This is a significant and positive change for maintainability. However, I've identified a critical bug and a high-severity issue that need to be addressed.
- A premature
returnstatement within a loop in the new_build_attention_metadatamethod will cause it to exit early, leading to incomplete metadata and likely runtime errors. - In the refactored
profile_runmethod, thein_profile_runflag is not set for a_dummy_runcall, which can lead to incorrect behavior during profiling.
I've provided specific comments and suggestions to fix these issues.
| spec_decode_common_attn_metadata = ( | ||
| spec_decode_common_attn_metadata.unpadded( | ||
| num_tokens, num_reqs)) | ||
| return attn_metadata, spec_decode_common_attn_metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return statement is inside the for loop that iterates over kv_cache_groups. This will cause the function to exit after processing only the first KV cache group, potentially leaving attn_metadata incomplete if there are multiple groups. The return statement should be moved outside the loop to ensure all KV cache groups are processed.
| mc2_tokens_capacity = get_mc2_tokens_capacity() | ||
| if self.max_num_tokens > mc2_tokens_capacity and \ | ||
| select_moe_comm_method(mc2_tokens_capacity, self.vllm_config) == MoECommType.MC2: | ||
| self._dummy_run(mc2_tokens_capacity, is_profile=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The in_profile_run attribute is not set for the initial _dummy_run call within profile_run. This can lead to incorrect behavior during profiling, as some logic (e.g., _skip_all_reduce_acorss_dp_group) depends on this flag to distinguish profiling runs from regular execution. The in_profile_run flag should be set to True before this _dummy_run call and reset to False after, similar to how it's handled in the base GPUModelRunner.profile_run and the old implementation.
| self._dummy_run(mc2_tokens_capacity, is_profile=True) | |
| self.in_profile_run = True | |
| self._dummy_run(mc2_tokens_capacity, is_profile=True) | |
| self.in_profile_run = False |
Signed-off-by: zhenwenqi2024 <[email protected]>
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
Signed-off-by: zhenwenqi2024 <[email protected]>
Signed-off-by: zhenwenqi2024 <[email protected]>
Signed-off-by: zhenwenqi2024 <[email protected]>
Signed-off-by: zhenwenqi2024 <[email protected]>
Signed-off-by: zhenwenqi2024 <[email protected]>
Signed-off-by: zhenwenqi2024 <[email protected]>
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
What this PR does / why we need it?
now _dummy_run in npu_model_runner is different from gpu_model_runner,as some important feature like graph_dispatch and build_atten_metadata, we do this to be consistent with gpu_model_runner
Does this PR introduce any user-facing change?
NA
How was this patch tested?
vLLM version: v0.12.0
vLLM main: vllm-project/vllm@ad32e3e