-
Notifications
You must be signed in to change notification settings - Fork 624
support async mtp #4511
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
support async mtp #4511
Conversation
Signed-off-by: Ronald1995 <[email protected]>
Signed-off-by: Ronald1995 <[email protected]>
Signed-off-by: Ronald1995 <[email protected]>
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for asynchronous Multi-Token Prediction (MTP), which is a significant feature for improving performance with speculative decoding. The changes involve modifications to the rejection sampler, MTP proposer, and the v1 model runner to handle the asynchronous state updates required for this feature. The use of .pin_memory() and other optimizations are good performance improvements.
Additionally, this PR includes a substantial refactoring by adding a new v2 worker implementation (vllm_ascend/worker/v2/). This new implementation seems to align with the upstream vLLM v2 worker API, which is a positive direction for maintainability.
However, the new v2 worker files (aclgraph_utils.py, async_utils.py, model_runner.py) contain several critical issues. They appear to be partially copy-pasted from CUDA-specific code and use torch.cuda APIs instead of the required torch.npu APIs for Ascend devices. This will lead to runtime errors. There is also a bug in the kv_cache_dtype initialization in the new NPUModelRunner. These issues must be addressed before this PR can be merged.
c7c093f to
c3ad518
Compare
Signed-off-by: Ronald1995 <[email protected]>
Signed-off-by: Ronald1995 <[email protected]>
Signed-off-by: Ronald1995 <[email protected]>
Signed-off-by: Ronald1995 <[email protected]>
Signed-off-by: Ronald1995 <[email protected]>
Signed-off-by: Ronald1995 <[email protected]>
Signed-off-by: Ronald1995 <[email protected]>
Signed-off-by: Ronald1995 <[email protected]>
Signed-off-by: Ronald1995 <[email protected]>
c3ad518 to
6580cb7
Compare
Signed-off-by: Ronald1995 <[email protected]>
6580cb7 to
d9a1b9c
Compare
Signed-off-by: Ronald1995 <[email protected]>
Signed-off-by: Ronald1995 <[email protected]>
3a2de93 to
fdfaed6
Compare
Signed-off-by: Ronald1995 <[email protected]>
fdfaed6 to
d989c43
Compare
Signed-off-by: Ronald1995 <[email protected]>
Signed-off-by: Ronald1995 <[email protected]>
Signed-off-by: Ronald1995 <[email protected]>
Signed-off-by: Ronald1995 <[email protected]>
029c5b9 to
670764b
Compare
Signed-off-by: Ronald1995 <[email protected]>
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: Ronald1995 <[email protected]>
Signed-off-by: Ronald1995 <[email protected]>
670764b to
8461e6c
Compare
Signed-off-by: Ronald1995 <[email protected]>
Signed-off-by: Ronald1995 <[email protected]>
Signed-off-by: Ronald1995 <[email protected]>
01d668e to
9dc6e45
Compare
Signed-off-by: Ronald1995 <[email protected]>
Signed-off-by: Ronald1995 <[email protected]>
What this PR does / why we need it?
this pr aims to support async_scheduling for mtp, which refer to vllm pr vllm-project/vllm#24799.
and this pr fix some synchronize problem in vllm-ascend.
Does this PR introduce any user-facing change?
How was this patch tested?