Skip to content

Conversation

@whx-sjtu
Copy link
Collaborator

This PR moves the communication operation of shared experts out of extra stream because I found that this might cause rtMemcpy related errors when running shared experts multistream with aclgraph.

Furthermore, I utilize a global variable as extra stream object to avoid allocating streams for each layer in full-graph mode.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the multi-stream handling for Mixture-of-Experts (MoE) layers, specifically for shared experts. The changes involve moving communication operations for shared experts out of the extra computation stream to prevent potential errors with aclgraph. Additionally, it introduces a global stream for shared expert calculations to avoid re-creating streams for each layer, which is a good optimization for graph mode. The implementation looks correct and aligns with the stated goals. I have one piece of feedback regarding a misleading comment in a new utility function, which should be corrected for clarity.

…llm-project#3582)

This PR moves the communication operation of shared experts out of extra
stream because I found that this might cause rtMemcpy related errors
when running shared experts multistream with aclgraph.

Furthermore, I utilize a global variable as extra stream object to avoid
allocating streams for each layer in full-graph mode.

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: whx-sjtu <[email protected]>
@whx-sjtu whx-sjtu added the ready read for review label Oct 25, 2025
@yiz-liu yiz-liu merged commit a58ff9e into vllm-project:v0.11.0-dev Oct 25, 2025
29 checks passed
Angazenn pushed a commit to Angazenn/vllm-ascend that referenced this pull request Oct 28, 2025
…3753)

This PR moves the communication operation of shared experts out of extra
stream because I found that this might cause rtMemcpy related errors
when running shared experts multistream with aclgraph.

Furthermore, I utilize a global variable as extra stream object to avoid
allocating streams for each layer in full-graph mode.

Signed-off-by: whx-sjtu <[email protected]>
Clorist33 pushed a commit to Clorist33/vllm-ascend that referenced this pull request Dec 9, 2025
…3753)

This PR moves the communication operation of shared experts out of extra
stream because I found that this might cause rtMemcpy related errors
when running shared experts multistream with aclgraph.

Furthermore, I utilize a global variable as extra stream object to avoid
allocating streams for each layer in full-graph mode.

Signed-off-by: whx-sjtu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants