Skip to content

Conversation

@Ronald1995
Copy link
Contributor

@Ronald1995 Ronald1995 commented Nov 27, 2025

What this PR does / why we need it?

this pr aims to support async_scheduling for mtp, which refer to vllm pr vllm-project/vllm#24799.
and this pr fix some synchronize problem in vllm-ascend.

Does this PR introduce any user-facing change?

How was this patch tested?

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for asynchronous Multi-Token Prediction (MTP), which is a significant feature for improving performance with speculative decoding. The changes involve modifications to the rejection sampler, MTP proposer, and the v1 model runner to handle the asynchronous state updates required for this feature. The use of .pin_memory() and other optimizations are good performance improvements.

Additionally, this PR includes a substantial refactoring by adding a new v2 worker implementation (vllm_ascend/worker/v2/). This new implementation seems to align with the upstream vLLM v2 worker API, which is a positive direction for maintainability.

However, the new v2 worker files (aclgraph_utils.py, async_utils.py, model_runner.py) contain several critical issues. They appear to be partially copy-pasted from CUDA-specific code and use torch.cuda APIs instead of the required torch.npu APIs for Ascend devices. This will lead to runtime errors. There is also a bug in the kv_cache_dtype initialization in the new NPUModelRunner. These issues must be addressed before this PR can be merged.

@Ronald1995 Ronald1995 force-pushed the async_mtp3 branch 2 times, most recently from c7c093f to c3ad518 Compare November 28, 2025 09:24
Signed-off-by: Ronald1995 <[email protected]>
Signed-off-by: Ronald1995 <[email protected]>
Signed-off-by: Ronald1995 <[email protected]>
Signed-off-by: Ronald1995 <[email protected]>
Signed-off-by: Ronald1995 <[email protected]>
Signed-off-by: Ronald1995 <[email protected]>
Signed-off-by: Ronald1995 <[email protected]>
@github-actions
Copy link

github-actions bot commented Dec 2, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: Ronald1995 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants