Skip to content

Conversation

@h1074112368
Copy link

@h1074112368 h1074112368 commented Dec 4, 2025

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a 'qdown' output to the mla_preprocess operation. The changes involve updating kernel definitions, host-side tiling logic, and PyTorch bindings. My review has identified a few issues: a critical bug in torch_binding.cpp where the same sin/cos tensors are used for two different purposes, leading to incorrect calculations. Additionally, there's a typo in a new struct member and significant code duplication in the kernel implementation which should be refactored for better maintainability. The cleanup of unused code in torch_binding.cpp is a good improvement.

Comment on lines 118 to 184
c10::optional<c10::string_view> cache_mode, c10::optional<c10::string_view> quant_mode, bool enable_inner_out, at::Tensor &q_out0,
at::Tensor &kv_cache_out0, at::Tensor &q_out1, at::Tensor &kv_cache_out1, at::Tensor &inner_out)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This function only accepts one pair of cos and sin tensors. However, the underlying mla_preprocess_impl kernel is called with these same tensors for two different sets of arguments (sin1/cos1 and sin2/cos2). The kernel implementation uses these for distinct purposes (one for RmsNormAndRopeConvergence1 and another for ropeFp16). This will lead to incorrect rotary embedding calculations. The function signature should be updated to accept two distinct pairs of sin/cos tensors, and the call to mla_preprocess_impl should be updated accordingly.

uint32_t rmsQuantMin2{0};

// Inner
uint32_t hiddtenState{0};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is a typo in the newly added member hiddtenState. It should be hiddenState to be consistent with common naming conventions and avoid potential confusion.

Suggested change
uint32_t hiddtenState{0};
uint32_t hiddenState{0};

Comment on lines 222 to 269
case KEY_BF16_CACHEMODE_0_QUANTMODE_0_INNER: {
MLAPO_BF16_INNER::MLAOperation<__bf16, 0, DataFormat::NZ, DataFormat::NZ, DataFormat::ND,
QuantMode::PER_TENSOR_ASYMM_QUANT>
opBf16Cm0Qm0Inner(mlaTilingData, tiling);
opBf16Cm0Qm0Inner.Init(hiddenState, quantScale1, quantOffset1, wdqkv, bias1, gamma2, beta2,
quantScale2, quantOffset2, gamma3, sin1, cos1, sin2, cos2, keycache, slotMapping, wuq,
bias2, wuk, descale1, descale2, ctkvScale, qnopeScale, q, keycacheOut, q2, keycacheOut2,
s1, s2, s3, s4, s5, innerOut);
if ASCEND_IS_AIC {
opBf16Cm0Qm0Inner.ProcessCube();
}
if ASCEND_IS_AIV {
opBf16Cm0Qm0Inner.ProcessVector();
}
break;
}
case KEY_BF16_CACHEMODE_1_QUANTMODE_0_INNER: {
MLAPO_BF16_INNER::MLAOperation<__bf16, 1, DataFormat::NZ, DataFormat::NZ, DataFormat::ND,
QuantMode::PER_TENSOR_ASYMM_QUANT>
opBf16Cm1Qm0Inner(mlaTilingData, tiling);
opBf16Cm1Qm0Inner.Init(hiddenState, quantScale1, quantOffset1, wdqkv, bias1, gamma2, beta2,
quantScale2, quantOffset2, gamma3, sin1, cos1, sin2, cos2, keycache, slotMapping, wuq,
bias2, wuk, descale1, descale2, ctkvScale, qnopeScale, q, keycacheOut, q2, keycacheOut2,
s1, s2, s3, s4, s5, innerOut);
if ASCEND_IS_AIC {
opBf16Cm1Qm0Inner.ProcessCube();
}
if ASCEND_IS_AIV {
opBf16Cm1Qm0Inner.ProcessVector();
}
break;
}
case KEY_BF16_CACHEMODE_3_QUANTMODE_0_INNER: {
MLAPO_BF16_INNER::MLAOperation<__bf16, 3, DataFormat::NZ, DataFormat::NZ, DataFormat::ND,
QuantMode::PER_TENSOR_ASYMM_QUANT>
opBf16Cm3Qm0Inner(mlaTilingData, tiling);
opBf16Cm3Qm0Inner.Init(hiddenState, quantScale1, quantOffset1, wdqkv, bias1, gamma2, beta2,
quantScale2, quantOffset2, gamma3, sin1, cos1, sin2, cos2, keycache, slotMapping, wuq,
bias2, wuk, descale1, descale2, ctkvScale, qnopeScale, q, keycacheOut, q2, keycacheOut2,
s1, s2, s3, s4, s5, innerOut);
if ASCEND_IS_AIC {
opBf16Cm3Qm0Inner.ProcessCube();
}
if ASCEND_IS_AIV {
opBf16Cm3Qm0Inner.ProcessVector();
}
break;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The newly added case blocks for KEY_BF16_CACHEMODE_*_INNER contain a significant amount of duplicated code. The logic within each block is almost identical, differing only by the CacheMode template parameter (0, 1, or 3) passed to MLAPO_BF16_INNER::MLAOperation. This code should be refactored into a template helper function to improve maintainability and reduce redundancy. For example:

template <int CacheMode>
void process_inner_op(MlaTilingData& mlaTilingData, GM_ADDR tiling, /* other params */) {
    MLAPO_BF16_INNER::MLAOperation<__bf16, CacheMode, ...> op(mlaTilingData, tiling);
    op.Init(...);
    if ASCEND_IS_AIC {
        op.ProcessCube();
    }
    if ASCEND_IS_AIV {
        op.ProcessVector();
    }
}

// Then in the switch:
case KEY_BF16_CACHEMODE_0_QUANTMODE_0_INNER: {
    process_inner_op<0>(...);
    break;
}

@h1074112368 h1074112368 force-pushed the main branch 2 times, most recently from d269983 to e53341e Compare December 4, 2025 08:24
@github-actions
Copy link

github-actions bot commented Dec 4, 2025

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant