-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
[AMD][ROCm] MoRI EP: a high-performance all2all backend #27273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request integrates MoRI, a high-performance all-to-all communication kernel, as a new backend for vLLM, primarily targeting AMD GPUs. The changes span across several files to add the necessary configurations, manager class, and logic to use this new backend. While the integration is mostly well-structured, I've identified a couple of areas for improvement related to code duplication and consistency, which I've detailed in the comments.
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Alex Sun <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Signed-off-by: Alex Sun <[email protected]>
Signed-off-by: Alex Sun <[email protected]>
Signed-off-by: Alex Sun <[email protected]>
Purpose
This PR is to integrate MoRI-EP, a high performance all2all comm kernel, with vLLM as an all2all backend. See MoRI project here. And MoRI supports cuda graph.
This PR follows the design of vLLM's Fused MoE Modular Kernel. The Fused MoE Modular Kernel is composed of following components:
[Router] → [Quantize-Dispatch] → [Permute-Experts-Unpermute] → [Combine]
For MoRI+AITER path, which is the high performance practice from AMD, it would be:
[Router] → [Quantize-Dispatch] → [Experts] → [Combine]
Two new classes are introduced:
Summary of performance comparison between MoRI-EP and naive backend (bs=128 per DP rank):
How to install MoRI
See https://github.com/ROCm/mori
Test Plan
Test platform: MI300X
Accuracy
Serve on DeepSeek-V3/R1 (Block scale quant)
Serve on DeepSeek-R1-PTPC (per token per channel quant)
see here for more info about PTPC.
Evaluate by gsm8k
Performance
Test EP8 and EP16 performance, compare with naive all2all backend
EP8 with mori backend
EP8 with naive backend:
replace
--all2all-backend moriwith--all2all-backend naive.EP16 with mori backend
EP16 with naive backend:
replace
--all2all-backend moriwith--all2all-backend naive, and use--enforce-eager.Benchmark:
use --random-input-len 1 --random-prefix-len 1023 because we want to simulate the PD disagg and test decode performance without prefill.
Test Result
Accuracy
MoRI-EP with DeepSeek-R1-PTPC
Decode Performance
Summary
EP8 mori all2all backend
EP8 naive all2all backend
EP16 mori all2all backend
EP16 naive all2all backend
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.