Skip to content

Conversation

@dsxsteven
Copy link
Contributor

@dsxsteven dsxsteven commented Nov 17, 2025

What this PR does / why we need it?

This PR optimizes multistream for eplb heat collection and aggregation

Does this PR introduce any user-facing change?

No

How was this patch tested?

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an asynchronous stream to handle MoE expert load (heat) collection, aiming to optimize performance by overlapping it with other computations. The changes involve creating a dedicated stream and switching to it for MoE load accumulation and gathering.

My review has identified a critical race condition and a high-severity performance issue in vllm_ascend/eplb/eplb_updator.py. The race condition is due to missing synchronization between the main computation stream and the new asynchronous stream, which could lead to incorrect load balancing. The performance issue is related to a buffer being re-allocated on every call, which should be addressed for efficiency. Please see the detailed comments for suggestions on how to fix these issues.

Comment on lines 156 to 171
with npu_stream_switch(moe_load_async_stream()):
self.world_size = dist.get_world_size()
self.device = local_load.device
if self._gather_buffer is None:
shape = (self.world_size, *local_load.shape)
self._gather_buffer = torch.empty(shape,
dtype=local_load.dtype,
device=self.device)

dist.all_gather_into_tensor(self._gather_buffer, local_load)

moe_load = self._gather_buffer.permute(1, 0, 2)
self.shared_dict["moe_load"] = moe_load.cpu()
logger.debug(
f"[ModelRunner] Updated shared_dict['moe_load'] shape={moe_load.shape}"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There is a race condition here. The moe_load tensors are updated asynchronously on moe_load_async_stream in fused_moe.py. However, self.adaptor.get_rank_expert_workload() is called on the default stream (on line 152, before this block) to read these tensors without any synchronization. This can lead to reading stale or incomplete data, causing incorrect load balancing. To fix this, you must synchronize the streams before reading moe_load. For example, you could add moe_load_async_stream().synchronize() before the call to self.adaptor.get_rank_expert_workload() on line 152.

Comment on lines 159 to 163
if self._gather_buffer is None:
shape = (self.world_size, *local_load.shape)
self._gather_buffer = torch.empty(shape,
dtype=local_load.dtype,
device=self.device)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The self._gather_buffer is reset to None on every call to compute_and_set_moe_load (on line 154). This makes this condition always true, causing the buffer to be re-allocated on every invocation, which is inefficient. To avoid this performance issue, self._gather_buffer should be initialized to None in the __init__ method of the EplbUpdator class, and the line self._gather_buffer = None should be removed from this method.

Signed-off-by: daishixun <[email protected]>
Signed-off-by: daishixun <[email protected]>
Signed-off-by: daishixun <[email protected]>
@dsxsteven dsxsteven closed this Nov 18, 2025
@dsxsteven dsxsteven reopened this Nov 18, 2025
@dsxsteven dsxsteven changed the title multistream for eplb heat collection [Feat]multistream for eplb heat collection and aggregation Nov 18, 2025
@dsxsteven dsxsteven changed the title multistream for eplb heat collection [Feat] Multi-stream for eplb heat collection and aggregation Nov 18, 2025
logger.debug(
f"[ModelRunner] Updated shared_dict['moe_load'] shape={moe_load.shape}"
)
with npu_stream_switch(moe_load_async_stream()):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe better to set moe_load_async_stream as class attribute of EplbUpdator

@whx-sjtu whx-sjtu added ready read for review ready-for-test start test by label for PR labels Nov 19, 2025
@github-actions
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@dsxsteven dsxsteven force-pushed the main_1114_multistream-heataggr branch from 6c7c895 to e09650a Compare December 4, 2025 03:39
return _SHARED_EXPERTS_CALCULATION_STREAM


def moe_load_async_stream() -> torch_npu.npu.Stream:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this function to eplb module

Copy link
Collaborator

@whx-sjtu whx-sjtu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dsxsteven dsxsteven force-pushed the main_1114_multistream-heataggr branch from cf110ae to d5b59ad Compare December 4, 2025 11:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:ops ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants