Skip to content

Conversation

@Mercykid-bash
Copy link
Contributor

@Mercykid-bash Mercykid-bash commented Dec 10, 2025

Feature: Support Top9 Mixed Placement for Shared & Router Experts

Overview

This PR implements core modifications to enable Top9 mixed placement of shared experts and router experts in the MOE (Mixture of Experts) architecture, with the goal of maximizing performance gains from EPLB (Expert Load Balancer) by optimizing expert deployment and inference computation.

Key Changes

1. Shared Expert Weight Loading

  • Implemented dedicated weight loading logic for shared experts to support mixed deployment with router experts
  • Ensured seamless integration with existing weight loading pipelines while maintaining compatibility with standard expert parallelism

2. Top9 Routing Selection Enhancement

  • Modified the Top9 routing mechanism to recognize and route requests to both shared and router experts
  • Updated routing score calculation to account for mixed expert placement, ensuring optimal expert selection

3. Mixed Placement Inference Computation

  • Refactored MOE inference computation logic to support:
    • Hybrid deployment of shared and router experts across devices
    • Parallel computation of shared/router expert forward passes
    • Proper aggregation of outputs from mixed expert types

4. EPLB Performance Optimization

  • Aligned mixed placement logic with EPLB load balancing policies to further amplify performance benefits
  • Ensured load balancing decisions are applied consistently across both shared and router experts

Technical Details

  • Supported tensor shapes for mixed placement: [num_layers, ep_size, 9] (compatible with 257 total experts: 256 router + 1 shared)
  • Maintained backward compatibility with existing expert deployment configurations (non-mixed mode)
  • Optimized memory usage and computation efficiency for parallel execution of mixed expert types

Performance Impact

  • Enables better load distribution across experts via EPLB by leveraging shared expert capacity
  • Improves overall inference throughput by utilizing parallel computation of mixed-deployed experts
  • Reduces expert idle time through more granular routing and load balancing

Compatibility

  • Fully compatible with existing EPLB policies and expert load balancing mechanisms
  • Supports both mixed (shared + router) and standard (router-only) expert deployment modes
  • Validated with Top9 routing scenarios and existing MOE model architectures

Signed-off-by: Che Ruan <[email protected]>

Update fused_moe.py

Signed-off-by: Che Ruan <[email protected]>

Update patch_deepseekv3.py

Signed-off-by: Che Ruan <[email protected]>

Update patch_deepseekv3.py

Signed-off-by: Che Ruan <[email protected]>

Update vllm_adaptor.py

Update moe_mlp.py

Update moe_mlp.py

Update eplb_device_transfer_loader.py

Update patch_deepseekv3.py

Update fused_moe.py

Update w8a8_dynamic.py

Update w8a8_dynamic.py

Update patch_deepseekv3.py
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a 'mix-placement' strategy for Mixture-of-Experts models, primarily targeting DeepSeek V3 on Ascend hardware. This approach treats shared experts as if they were routed experts, affecting weight loading and inference logic. The changes are spread across configuration, the MoE layer implementation, expert selection logic, and a new patch for vLLM's DeepSeek model implementations. While the overall implementation appears consistent with the goal, I've identified a critical bug in the quantization logic due to a typo and a hardcoded magic number in the expert selection logic that should be refactored for better maintainability.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@github-actions
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@github-actions
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants