Skip to content

Conversation

@yiz-liu
Copy link
Collaborator

@yiz-liu yiz-liu commented Nov 28, 2025

What this PR does / why we need it?

Replaces several discrete operations in the SFA forward pass with a single call to the fused mla_preprocess custom operator. This operator combines Q/K/V projection, RoPE application, and KV cache updates into one kernel.

A new weight processing method is added to transform and pre-process weights into the specific layout required by the fused operator. This change aims to improve performance by reducing kernel launch overhead.

Additionally, the condition for allocating RoPE caches is relaxed to support MLA in modes other than just full decode.

Does this PR introduce any user-facing change?

None

How was this patch tested?

None

Replaces several discrete operations in the SFA forward pass with a single call to the fused `mla_preprocess` custom operator. This operator combines Q/K/V projection, RoPE application, and KV cache updates into one kernel.

A new weight processing method is added to transform and pre-process weights into the specific layout required by the fused operator. This change aims to improve performance by reducing kernel launch overhead.

Additionally, the condition for allocating RoPE caches is relaxed to support MLA in modes other than just full decode.

Signed-off-by: Yizhou Liu <[email protected]>
@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant