mlapo add qdown output #4707

h1074112368 · 2025-12-04T09:27:17Z

What this PR does / why we need it?

This PR adds mlapo operation support qdown of output.

Does this PR introduce any user-facing change?

mlapo operation add enable_inner_out of input

How was this patch tested?

CI passed with new added/existing test.

vLLM version: v0.12.0
vLLM main: vllm-project/vllm@ad32e3e

gemini-code-assist

Code Review

This pull request introduces a new 'qdown' output, controlled by the enable_inner_out flag. The changes correctly propagate this new parameter from the PyTorch operator definition down to the kernel implementation, adding new logic paths and updating function signatures accordingly. However, my review identified a critical bug related to duplicated arguments being passed to the kernel implementation, which could lead to incorrect computations. Additionally, there is a significant code duplication issue in the new kernel logic, which impacts maintainability. Addressing these points will improve the correctness and quality of the code.

gemini-code-assist · 2025-12-04T09:30:58Z

csrc/torch_binding.cpp

+                          qnope_scale_ptr, q_out0_ptr, kv_cache_out0_ptr, q_out1_ptr, kv_cache_out1_ptr, inner_out_ptr, workspace_ptr,
                          tiling_ptr, block_dim]() -> int {
        mla_preprocess_impl(stream, hidden_state_ptr, quant_scale0_ptr, quant_offset0_ptr, wdqkv_ptr, bias0_ptr,
                            gamma1_ptr, beta1_ptr, quant_scale1_ptr, quant_offset1_ptr, gamma2_ptr, sin_ptr, cos_ptr, sin_ptr, cos_ptr,


The mla_preprocess_impl function is called with sin_ptr and cos_ptr passed for both (sin1, cos1) and (sin2, cos2) arguments. The kernel implementation seems to expect two distinct pairs of sin/cos tensors. Passing the same pointers for both pairs is likely a bug and could lead to incorrect calculations.

The mla_preprocess function signature should probably be updated to accept a second pair of sin/cos tensors, or if this is intentional, it should be clearly documented why the same tensors are passed twice.

gemini-code-assist · 2025-12-04T09:30:58Z

csrc/mla_preprocess/op_kernel/mla_preprocess_kernel.cpp

+        case KEY_BF16_CACHEMODE_0_QUANTMODE_0_INNER: {
+            MLAPO_BF16_INNER::MLAOperation<__bf16, 0, DataFormat::NZ, DataFormat::NZ, DataFormat::ND,
+                                     QuantMode::PER_TENSOR_ASYMM_QUANT>
+                opBf16Cm0Qm0Inner(mlaTilingData, tiling);
+            opBf16Cm0Qm0Inner.Init(hiddenState, quantScale1, quantOffset1, wdqkv, bias1, gamma2, beta2,
+                              quantScale2, quantOffset2, gamma3, sin1, cos1, sin2, cos2, keycache, slotMapping, wuq,
+                              bias2, wuk, descale1, descale2, ctkvScale, qnopeScale, q, keycacheOut, q2, keycacheOut2,
+                              s1, s2, s3, s4, s5, innerOut);
+            if ASCEND_IS_AIC {
+                opBf16Cm0Qm0Inner.ProcessCube();
+            }
+            if ASCEND_IS_AIV {
+                opBf16Cm0Qm0Inner.ProcessVector();
+            }
+            break;
+        }
+        case KEY_BF16_CACHEMODE_1_QUANTMODE_0_INNER: {
+            MLAPO_BF16_INNER::MLAOperation<__bf16, 1, DataFormat::NZ, DataFormat::NZ, DataFormat::ND,
+                                     QuantMode::PER_TENSOR_ASYMM_QUANT>
+                opBf16Cm1Qm0Inner(mlaTilingData, tiling);
+            opBf16Cm1Qm0Inner.Init(hiddenState, quantScale1, quantOffset1, wdqkv, bias1, gamma2, beta2,
+                              quantScale2, quantOffset2, gamma3, sin1, cos1, sin2, cos2, keycache, slotMapping, wuq,
+                              bias2, wuk, descale1, descale2, ctkvScale, qnopeScale, q, keycacheOut, q2, keycacheOut2,
+                              s1, s2, s3, s4, s5, innerOut);
+            if ASCEND_IS_AIC {
+                opBf16Cm1Qm0Inner.ProcessCube();
+            }
+            if ASCEND_IS_AIV {
+                opBf16Cm1Qm0Inner.ProcessVector();
+            }
+            break;
+        }
+        case KEY_BF16_CACHEMODE_3_QUANTMODE_0_INNER: {
+            MLAPO_BF16_INNER::MLAOperation<__bf16, 3, DataFormat::NZ, DataFormat::NZ, DataFormat::ND,
+                                     QuantMode::PER_TENSOR_ASYMM_QUANT>
+                opBf16Cm3Qm0Inner(mlaTilingData, tiling);
+            opBf16Cm3Qm0Inner.Init(hiddenState, quantScale1, quantOffset1, wdqkv, bias1, gamma2, beta2,
+                              quantScale2, quantOffset2, gamma3, sin1, cos1, sin2, cos2, keycache, slotMapping, wuq,
+                              bias2, wuk, descale1, descale2, ctkvScale, qnopeScale, q, keycacheOut, q2, keycacheOut2,
+                              s1, s2, s3, s4, s5, innerOut);
+            if ASCEND_IS_AIC {
+                opBf16Cm3Qm0Inner.ProcessCube();
+            }
+            if ASCEND_IS_AIV {
+                opBf16Cm3Qm0Inner.ProcessVector();
+            }
+            break;
+        }


There is significant code duplication across the new case statements for _INNER keys. The logic inside case KEY_BF16_CACHEMODE_0_QUANTMODE_0_INNER, case KEY_BF16_CACHEMODE_1_QUANTMODE_0_INNER, and case KEY_BF16_CACHEMODE_3_QUANTMODE_0_INNER is identical except for the CacheMode template parameter (0, 1, 3) for MLAPO_BF16_INNER::MLAOperation.

This duplication makes the code harder to read and maintain. Any future changes to this logic will need to be applied in three places, increasing the risk of errors.

To reduce duplication, you could use a helper function templated on the CacheMode to encapsulate the common logic. For example:

template <int CacheMode> __aicore__ void ProcessInnerOp(MlaTilingData& mlaTilingData, GM_ADDR tiling, /* other params */) { MLAPO_BF16_INNER::MLAOperation<__bf16, CacheMode, DataFormat::NZ, DataFormat::NZ, DataFormat::ND, QuantMode::PER_TENSOR_ASYMM_QUANT> op(mlaTilingData, tiling); op.Init(hiddenState, quantScale1, quantOffset1, wdqkv, bias1, gamma2, beta2, quantScale2, quantOffset2, gamma3, sin1, cos1, sin2, cos2, keycache, slotMapping, wuq, bias2, wuk, descale1, descale2, ctkvScale, qnopeScale, q, keycacheOut, q2, keycacheOut2, s1, s2, s3, s4, s5, innerOut); if ASCEND_IS_AIC { op.ProcessCube(); } if ASCEND_IS_AIV { op.ProcessVector(); } } // Then in the switch statement: case KEY_BF16_CACHEMODE_0_QUANTMODE_0_INNER: ProcessInnerOp<0>(mlaTilingData, tiling, ...); break; case KEY_BF16_CACHEMODE_1_QUANTMODE_0_INNER: ProcessInnerOp<1>(mlaTilingData, tiling, ...); break; case KEY_BF16_CACHEMODE_3_QUANTMODE_0_INNER: ProcessInnerOp<3>(mlaTilingData, tiling, ...); break;

Since adding a new function might be a larger change, you could also use a macro to achieve a similar result within the current function body.

github-actions · 2025-12-04T09:51:20Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

MengqingCao · 2025-12-05T07:26:53Z

https://github.com/vllm-project/vllm-ascend/blob/main/tests/e2e/nightly/ops/test_mla_preprocess.py

Signed-off-by: h1074112368 <[email protected]>

github-actions · 2025-12-05T11:30:59Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: h1074112368 <[email protected]>

github-actions · 2025-12-05T12:45:58Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: h1074112368 <[email protected]>

### What this PR does / why we need it? This PR adds mlapo operation support qdown of output. ### Does this PR introduce _any_ user-facing change? mlapo operation add enable_inner_out of input ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: h1074112368 <[email protected]> Co-authored-by: wangxiyuan <[email protected]>

gemini-code-assist bot reviewed Dec 4, 2025

View reviewed changes

h1074112368 force-pushed the main branch 3 times, most recently from 584bc9e to 03aed97 Compare December 5, 2025 08:45

MengqingCao added ready read for review ready-for-test start test by label for PR labels Dec 5, 2025

github-actions bot added the module:tests label Dec 5, 2025

h1074112368 added 3 commits December 5, 2025 19:26

mlapo add qdown output

6f1700d

Signed-off-by: h1074112368 <[email protected]>

Fix indentation

238404e

Signed-off-by: h1074112368 <[email protected]>

add ut

69d5d3b

Signed-off-by: h1074112368 <[email protected]>

h1074112368 force-pushed the main branch from fe1f1b4 to 0c49494 Compare December 5, 2025 11:27

github-actions bot added the merge-conflicts label Dec 5, 2025

github-actions bot added documentation Improvements or additions to documentation module:ops labels Dec 5, 2025

h1074112368 force-pushed the main branch from 0c49494 to 30741bc Compare December 5, 2025 12:01

github-actions bot removed merge-conflicts documentation Improvements or additions to documentation module:ops labels Dec 5, 2025

fix lint

ae82b1a

Signed-off-by: h1074112368 <[email protected]>

h1074112368 force-pushed the main branch from 30741bc to ae82b1a Compare December 5, 2025 12:05

github-actions bot added the merge-conflicts label Dec 5, 2025

h1074112368 force-pushed the main branch from 243d675 to ae82b1a Compare December 5, 2025 12:51

github-actions bot removed the merge-conflicts label Dec 5, 2025

fix input

8953879

Signed-off-by: h1074112368 <[email protected]>

wangxiyuan approved these changes Dec 6, 2025

View reviewed changes

Merge branch 'main' into main

5f944a8

wangxiyuan merged commit 7403399 into vllm-project:main Dec 6, 2025
15 of 17 checks passed

ZYang6263 mentioned this pull request Dec 6, 2025

Support DeepSeekV3.2 with MLAPO operator #4753

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

mlapo add qdown output #4707

mlapo add qdown output #4707

h1074112368 commented Dec 4, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 4, 2025

Uh oh!

gemini-code-assist bot Dec 4, 2025

Uh oh!

github-actions bot commented Dec 4, 2025

Uh oh!

MengqingCao commented Dec 5, 2025

Uh oh!

github-actions bot commented Dec 5, 2025

Uh oh!

github-actions bot commented Dec 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mlapo add qdown output #4707

mlapo add qdown output #4707

Conversation

h1074112368 commented Dec 4, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 4, 2025

Uh oh!

MengqingCao commented Dec 5, 2025

Uh oh!

github-actions bot commented Dec 5, 2025

Uh oh!

github-actions bot commented Dec 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

h1074112368 commented Dec 4, 2025 •

edited by github-actions bot

Loading