mlapo add qdown output #4701

h1074112368 · 2025-12-04T07:44:15Z

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

vLLM version: v0.12.0
vLLM main: https://github.com/vllm-project/vllm/commit/v0.12.0

gemini-code-assist

Code Review

This pull request adds a 'qdown' output to the mla_preprocess operation. The changes involve updating kernel definitions, host-side tiling logic, and PyTorch bindings. My review has identified a few issues: a critical bug in torch_binding.cpp where the same sin/cos tensors are used for two different purposes, leading to incorrect calculations. Additionally, there's a typo in a new struct member and significant code duplication in the kernel implementation which should be refactored for better maintainability. The cleanup of unused code in torch_binding.cpp is a good improvement.

gemini-code-assist · 2025-12-04T07:47:01Z

csrc/torch_binding.cpp

+    c10::optional<c10::string_view> cache_mode, c10::optional<c10::string_view> quant_mode, bool enable_inner_out, at::Tensor &q_out0,
+    at::Tensor &kv_cache_out0, at::Tensor &q_out1, at::Tensor &kv_cache_out1, at::Tensor &inner_out)


This function only accepts one pair of cos and sin tensors. However, the underlying mla_preprocess_impl kernel is called with these same tensors for two different sets of arguments (sin1/cos1 and sin2/cos2). The kernel implementation uses these for distinct purposes (one for RmsNormAndRopeConvergence1 and another for ropeFp16). This will lead to incorrect rotary embedding calculations. The function signature should be updated to accept two distinct pairs of sin/cos tensors, and the call to mla_preprocess_impl should be updated accordingly.

gemini-code-assist · 2025-12-04T07:47:01Z

csrc/mla_preprocess/op_host/tiling/mla_preprocess_tiling.h

    uint32_t rmsQuantMin2{0};

+    // Inner
+    uint32_t hiddtenState{0};


There is a typo in the newly added member hiddtenState. It should be hiddenState to be consistent with common naming conventions and avoid potential confusion.

Suggested change

uint32_t hiddtenState{0};

uint32_t hiddenState{0};

gemini-code-assist · 2025-12-04T07:47:02Z

csrc/mla_preprocess/op_kernel/mla_preprocess_kernel.cpp

+        case KEY_BF16_CACHEMODE_0_QUANTMODE_0_INNER: {
+            MLAPO_BF16_INNER::MLAOperation<__bf16, 0, DataFormat::NZ, DataFormat::NZ, DataFormat::ND,
+                                     QuantMode::PER_TENSOR_ASYMM_QUANT>
+                opBf16Cm0Qm0Inner(mlaTilingData, tiling);
+            opBf16Cm0Qm0Inner.Init(hiddenState, quantScale1, quantOffset1, wdqkv, bias1, gamma2, beta2,
+                              quantScale2, quantOffset2, gamma3, sin1, cos1, sin2, cos2, keycache, slotMapping, wuq,
+                              bias2, wuk, descale1, descale2, ctkvScale, qnopeScale, q, keycacheOut, q2, keycacheOut2,
+                              s1, s2, s3, s4, s5, innerOut);
+            if ASCEND_IS_AIC {
+                opBf16Cm0Qm0Inner.ProcessCube();
+            }
+            if ASCEND_IS_AIV {
+                opBf16Cm0Qm0Inner.ProcessVector();
+            }
+            break;
+        }
+        case KEY_BF16_CACHEMODE_1_QUANTMODE_0_INNER: {
+            MLAPO_BF16_INNER::MLAOperation<__bf16, 1, DataFormat::NZ, DataFormat::NZ, DataFormat::ND,
+                                     QuantMode::PER_TENSOR_ASYMM_QUANT>
+                opBf16Cm1Qm0Inner(mlaTilingData, tiling);
+            opBf16Cm1Qm0Inner.Init(hiddenState, quantScale1, quantOffset1, wdqkv, bias1, gamma2, beta2,
+                              quantScale2, quantOffset2, gamma3, sin1, cos1, sin2, cos2, keycache, slotMapping, wuq,
+                              bias2, wuk, descale1, descale2, ctkvScale, qnopeScale, q, keycacheOut, q2, keycacheOut2,
+                              s1, s2, s3, s4, s5, innerOut);
+            if ASCEND_IS_AIC {
+                opBf16Cm1Qm0Inner.ProcessCube();
+            }
+            if ASCEND_IS_AIV {
+                opBf16Cm1Qm0Inner.ProcessVector();
+            }
+            break;
+        }
+        case KEY_BF16_CACHEMODE_3_QUANTMODE_0_INNER: {
+            MLAPO_BF16_INNER::MLAOperation<__bf16, 3, DataFormat::NZ, DataFormat::NZ, DataFormat::ND,
+                                     QuantMode::PER_TENSOR_ASYMM_QUANT>
+                opBf16Cm3Qm0Inner(mlaTilingData, tiling);
+            opBf16Cm3Qm0Inner.Init(hiddenState, quantScale1, quantOffset1, wdqkv, bias1, gamma2, beta2,
+                              quantScale2, quantOffset2, gamma3, sin1, cos1, sin2, cos2, keycache, slotMapping, wuq,
+                              bias2, wuk, descale1, descale2, ctkvScale, qnopeScale, q, keycacheOut, q2, keycacheOut2,
+                              s1, s2, s3, s4, s5, innerOut);
+            if ASCEND_IS_AIC {
+                opBf16Cm3Qm0Inner.ProcessCube();
+            }
+            if ASCEND_IS_AIV {
+                opBf16Cm3Qm0Inner.ProcessVector();
+            }
+            break;
+        }


The newly added case blocks for KEY_BF16_CACHEMODE_*_INNER contain a significant amount of duplicated code. The logic within each block is almost identical, differing only by the CacheMode template parameter (0, 1, or 3) passed to MLAPO_BF16_INNER::MLAOperation. This code should be refactored into a template helper function to improve maintainability and reduce redundancy. For example:

template <int CacheMode> void process_inner_op(MlaTilingData& mlaTilingData, GM_ADDR tiling, /* other params */) { MLAPO_BF16_INNER::MLAOperation<__bf16, CacheMode, ...> op(mlaTilingData, tiling); op.Init(...); if ASCEND_IS_AIC { op.ProcessCube(); } if ASCEND_IS_AIV { op.ProcessVector(); } } // Then in the switch: case KEY_BF16_CACHEMODE_0_QUANTMODE_0_INNER: { process_inner_op<0>(...); break; }

github-actions · 2025-12-04T08:35:30Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist bot reviewed Dec 4, 2025

View reviewed changes

h1074112368 force-pushed the main branch 2 times, most recently from d269983 to e53341e Compare December 4, 2025 08:24

h1074112368 closed this Dec 4, 2025

h1074112368 force-pushed the main branch from e53341e to da84eb2 Compare December 4, 2025 08:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

mlapo add qdown output #4701

mlapo add qdown output #4701

h1074112368 commented Dec 4, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 4, 2025

Uh oh!

gemini-code-assist bot Dec 4, 2025

Uh oh!

gemini-code-assist bot Dec 4, 2025

Uh oh!

github-actions bot commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		c10::optional<c10::string_view> cache_mode, c10::optional<c10::string_view> quant_mode, bool enable_inner_out, at::Tensor &q_out0,
		at::Tensor &kv_cache_out0, at::Tensor &q_out1, at::Tensor &kv_cache_out1, at::Tensor &inner_out)

mlapo add qdown output #4701

mlapo add qdown output #4701

Conversation

h1074112368 commented Dec 4, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

h1074112368 commented Dec 4, 2025 •

edited by github-actions bot

Loading