support async mtp #4511

Ronald1995 · 2025-11-27T13:01:54Z

What this PR does / why we need it?

this pr aims to support async_scheduling for mtp, which refer to vllm pr vllm-project/vllm#24799.
and this pr fix some synchronize problem in vllm-ascend.

Does this PR introduce any user-facing change?

How was this patch tested?

vLLM version: v0.12.0
vLLM main: https://github.com/vllm-project/vllm/commit/v0.12.0

Signed-off-by: Ronald1995 <[email protected]>

github-actions · 2025-11-27T13:02:03Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request introduces support for asynchronous Multi-Token Prediction (MTP), which is a significant feature for improving performance with speculative decoding. The changes involve modifications to the rejection sampler, MTP proposer, and the v1 model runner to handle the asynchronous state updates required for this feature. The use of .pin_memory() and other optimizations are good performance improvements.

Additionally, this PR includes a substantial refactoring by adding a new v2 worker implementation (vllm_ascend/worker/v2/). This new implementation seems to align with the upstream vLLM v2 worker API, which is a positive direction for maintainability.

However, the new v2 worker files (aclgraph_utils.py, async_utils.py, model_runner.py) contain several critical issues. They appear to be partially copy-pasted from CUDA-specific code and use torch.cuda APIs instead of the required torch.npu APIs for Ascend devices. This will lead to runtime errors. There is also a bug in the kv_cache_dtype initialization in the new NPUModelRunner. These issues must be addressed before this PR can be merged.

Signed-off-by: Ronald1995 <[email protected]>

vllm_ascend/attention/attention_v1.py

Signed-off-by: Ronald1995 <[email protected]>

github-actions · 2025-12-02T14:15:12Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: Ronald1995 <[email protected]>

Ronald1995 added 3 commits November 27, 2025 09:24

implement async scheduling for mtp

6201bfd

Signed-off-by: Ronald1995 <[email protected]>

fix synchronize error

ba323da

Signed-off-by: Ronald1995 <[email protected]>

fix indent error

01ae70c

Signed-off-by: Ronald1995 <[email protected]>

gemini-code-assist bot reviewed Nov 27, 2025

View reviewed changes

Ronald1995 force-pushed the async_mtp3 branch 2 times, most recently from c7c093f to c3ad518 Compare November 28, 2025 09:24

Ronald1995 added 9 commits November 28, 2025 19:47

fix synchronize

add7852

Signed-off-by: Ronald1995 <[email protected]>

fix synchronize error of repeat_interleave

03755ef

Signed-off-by: Ronald1995 <[email protected]>

fix synchronize error

16eb688

Signed-off-by: Ronald1995 <[email protected]>

fix synchronize error in _calc_spec_decode_metadata

27bc0f9

Signed-off-by: Ronald1995 <[email protected]>

delete v2

6d70c76

Signed-off-by: Ronald1995 <[email protected]>

fix sync error of seq_lens tolist

935e0d7

Signed-off-by: Ronald1995 <[email protected]>

set pin_memory=True

33c1c56

Signed-off-by: Ronald1995 <[email protected]>

disable mtp graph when use async scheduling

430d371

Signed-off-by: Ronald1995 <[email protected]>

fix pin_memory error

c0317c9

Signed-off-by: Ronald1995 <[email protected]>

Ronald1995 force-pushed the async_mtp3 branch from c3ad518 to 6580cb7 Compare November 28, 2025 11:58

fix yapf error

d9a1b9c

Signed-off-by: Ronald1995 <[email protected]>

Ronald1995 force-pushed the async_mtp3 branch from 6580cb7 to d9a1b9c Compare November 28, 2025 11:59

Ronald1995 added 2 commits November 28, 2025 20:08

revert rejection_sampler

fb77399

Signed-off-by: Ronald1995 <[email protected]>

fix yapf error

18cd2f8

Signed-off-by: Ronald1995 <[email protected]>

github-actions bot added the module:tests label Nov 29, 2025

Ronald1995 force-pushed the async_mtp3 branch from 3a2de93 to fdfaed6 Compare November 29, 2025 03:51

fix ut pin_memory error

d989c43

Signed-off-by: Ronald1995 <[email protected]>

Ronald1995 force-pushed the async_mtp3 branch from fdfaed6 to d989c43 Compare November 29, 2025 06:21

Ronald1995 added 4 commits November 29, 2025 16:46

handle kv cache

ecc7c64

Signed-off-by: Ronald1995 <[email protected]>

fix hang

161c500

Signed-off-by: Ronald1995 <[email protected]>

fix prev_sampled_token_ids wrong position

d1e2d13

Signed-off-by: Ronald1995 <[email protected]>

fix yapf error

8a6d9b6

Signed-off-by: Ronald1995 <[email protected]>

Ronald1995 force-pushed the async_mtp3 branch from 029c5b9 to 670764b Compare December 2, 2025 07:54

realliujiaxu reviewed Dec 2, 2025

View reviewed changes

vllm_ascend/attention/attention_v1.py Show resolved Hide resolved

_sync_metadata_across_dp with cpu group

1d94556

Signed-off-by: Ronald1995 <[email protected]>

github-actions bot added the merge-conflicts label Dec 2, 2025

Ronald1995 added 2 commits December 3, 2025 09:36

add e2e test for async_scheduling

a0da596

Signed-off-by: Ronald1995 <[email protected]>

merge main

8461e6c

Signed-off-by: Ronald1995 <[email protected]>

Ronald1995 force-pushed the async_mtp3 branch from 670764b to 8461e6c Compare December 3, 2025 08:07

github-actions bot removed the merge-conflicts label Dec 3, 2025

Ronald1995 added 3 commits December 3, 2025 16:55

fix assert error of sampled_token_ids shape

682ab95

Signed-off-by: Ronald1995 <[email protected]>

fix RejectionSampler.parse_output

a59e660

Signed-off-by: Ronald1995 <[email protected]>

fix ut of test_async_scheduling

9dc6e45

Signed-off-by: Ronald1995 <[email protected]>

Ronald1995 force-pushed the async_mtp3 branch from 01d668e to 9dc6e45 Compare December 4, 2025 09:12

Ronald1995 added 2 commits December 4, 2025 21:11

fix yapf error

886dad8

Signed-off-by: Ronald1995 <[email protected]>

implement out of place Increment of seq_lens_cpu

01327de

Signed-off-by: Ronald1995 <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

support async mtp #4511

support async mtp #4511

Ronald1995 commented Nov 27, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Nov 27, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

github-actions bot commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

support async mtp #4511

Are you sure you want to change the base?

support async mtp #4511

Conversation

Ronald1995 commented Nov 27, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Nov 27, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

github-actions bot commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Ronald1995 commented Nov 27, 2025 •

edited by github-actions bot

Loading