[ROCm][Perf] Replace cat to bmm's inplace write when aiter enabled #30611

ganyi1996ppo · 2025-12-13T14:30:43Z

Purpose

Replace torch.cat in decode path to aiter bmm's inplace write, this change makes about 2.5% performance uplift from my test result.

Test Plan

gsm8k and vllm bench

Test Result

Accuracy on gsm8k

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|    20|exact_match|↑  |0.9469|±  |0.0062|
|     |       |strict-match    |    20|exact_match|↑  |0.9454|±  |0.0063|

We have below performance result with ISL/OSL(3584/1024) 64 concurrency 128 prompt on MI308
Performance result:

# before 
============ Serving Benchmark Result ============
Successful requests:                     128       
Failed requests:                         0         
Maximum request concurrency:             64        
Benchmark duration (s):                  139.47    
Total input tokens:                      458624    
Total generated tokens:                  131072    
Request throughput (req/s):              0.92      
Output token throughput (tok/s):         939.77    
Peak output token throughput (tok/s):    1472.00   
Peak concurrent requests:                83.00     
Total token throughput (tok/s):          4228.03   
---------------Time to First Token----------------
Mean TTFT (ms):                          8347.70   
Median TTFT (ms):                        5493.79   
P99 TTFT (ms):                           20215.24  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          59.95     
Median TPOT (ms):                        61.83     
P99 TPOT (ms):                           66.79     
---------------Inter-token Latency----------------
Mean ITL (ms):                           59.95     
Median ITL (ms):                         48.88     
P99 ITL (ms):                            51.42     
==================================================

# This PR
============ Serving Benchmark Result ============
Successful requests:                     128       
Failed requests:                         0         
Maximum request concurrency:             64        
Benchmark duration (s):                  136.29    
Total input tokens:                      458624    
Total generated tokens:                  131072    
Request throughput (req/s):              0.94      
Output token throughput (tok/s):         961.69    
Peak output token throughput (tok/s):    1408.00   
Peak concurrent requests:                83.00     
Total token throughput (tok/s):          4326.66   
---------------Time to First Token----------------
Mean TTFT (ms):                          8353.45   
Median TTFT (ms):                        5495.89   
P99 TTFT (ms):                           20231.12  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          58.39     
Median TPOT (ms):                        60.66     
P99 TPOT (ms):                           65.29     
---------------Inter-token Latency----------------
Mean ITL (ms):                           58.39     
Median ITL (ms):                         47.37     
P99 ITL (ms):                            49.12     
==================================================

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: ganyi <[email protected]>

chatgpt-codex-connector · 2025-12-13T14:30:49Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

gemini-code-assist

Code Review

This pull request aims to improve performance on ROCm by replacing torch.cat with an in-place write in the bmm operation when aiter is enabled. The changes are logical and should yield the described performance benefits. However, I've identified a critical issue where a code path can create a tuple for decode_q, which is then passed to a function that now asserts the input is not a tuple. This will lead to a runtime error under specific configurations. I have provided a suggestion to fix this by ensuring decode_q is always a tensor for the affected implementation.

vllm/v1/attention/backends/mla/common.py

Signed-off-by: ganyi <[email protected]>

ApostaC · 2025-12-16T02:30:16Z

@pavanimajety Hey, can you please help take a look at this PR? Thanks!

replace cat in decode to bmm's inplace write

a6da514

Signed-off-by: ganyi <[email protected]>

ganyi1996ppo requested review from pavanimajety and tjtanaa as code owners December 13, 2025 14:30

mergify bot added rocm Related to AMD ROCm v1 labels Dec 13, 2025

gemini-code-assist bot reviewed Dec 13, 2025

View reviewed changes

vllm/v1/attention/backends/mla/common.py Show resolved Hide resolved

change mla back for safety

ee1bf38

Signed-off-by: ganyi <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[ROCm][Perf] Replace cat to bmm's inplace write when aiter enabled #30611

[ROCm][Perf] Replace cat to bmm's inplace write when aiter enabled #30611

ganyi1996ppo commented Dec 13, 2025 •

edited by github-actions bot

Loading

Uh oh!

chatgpt-codex-connector bot commented Dec 13, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

ApostaC commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

[ROCm][Perf] Replace cat to bmm's inplace write when aiter enabled #30611

Are you sure you want to change the base?

[ROCm][Perf] Replace cat to bmm's inplace write when aiter enabled #30611

Conversation

ganyi1996ppo commented Dec 13, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

chatgpt-codex-connector bot commented Dec 13, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

ApostaC commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ganyi1996ppo commented Dec 13, 2025 •

edited by github-actions bot

Loading