[kernel] add AscendC op: lightning_indexer and sparse_flash_attention #4625

MingYang119 · 2025-12-02T07:33:55Z

What this PR does / why we need it?

Provide high-performance AscendC operators lightning_indexer and sparse_flash_attention to boost the execution performance of the DeepSeek v3.2 model. Meanwhile, adapt the two AscendC operators to vllm-ascend framework.

Does this PR introduce any user-facing change?

No (only underlying operator optimizations, with no user-facing changes)

How was this patch tested?

vLLM version: v0.11.2
vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

gemini-code-assist

Code Review

This pull request introduces two new AscendC operators, lightning_indexer and sparse_flash_attention, with extensive changes including kernel implementations, host-side logic, and PyTorch bindings. While the overall structure is sound, I've identified several critical and high-severity issues. These include a potential linker error from an undefined method, data truncation risks due to type mismatches, unsafe const_cast usage leading to potential undefined behavior, and inconsistent shape inference logic that could cause runtime errors. Addressing these issues is crucial for the stability and correctness of the new operators.

gemini-code-assist · 2025-12-02T07:37:35Z

csrc/lightning_indexer/op_host/lightning_indexer_tiling.h

+    ge::graphStatus GetAttenMaskInfo();
+    ge::graphStatus GetActualSeqInfo();


The methods GetAttenMaskInfo() and GetActualSeqInfo() are declared in the LIInfoParser class but are not defined in lightning_indexer_tiling.cpp. This will cause a linker error during compilation. If these methods are not used, they should be removed to resolve this.

gemini-code-assist · 2025-12-02T07:37:35Z

csrc/torch_binding_meta.cpp

+        output_size = {query.size(DIM_0), query.size(DIM_1), key.size(DIM_2), sparse_count};
+    } else {
+        int n_dim_index = 0;
+        n_dim_index = (key_layout_str == "TND") ? DIM_1 : DIM_2;


The shape inference logic for the TND layout in this meta function is inconsistent with the shape inference logic in csrc/lightning_indexer/op_host/lightning_indexer_proto.cpp. Here, n_dim_index is determined by (key_layout_str == "TND"), while in proto.cpp it's based on (inputLayoutKeyPtrStr == "PA_BSND"). This discrepancy can lead to shape mismatch errors at runtime. The logic should be unified to match the operator's shape inference.

n_dim_index = (key_layout_str == "PA_BSND") ? 2 : 1;

gemini-code-assist · 2025-12-02T07:37:35Z

csrc/lightning_indexer/op_host/lightning_indexer_def.cpp

+            .AutoContiguous();
+        this->Input("actual_seq_lengths_query")
+            .ParamType(OPTIONAL)
+            .DataType({ge::DT_INT32, ge::DT_INT32})


The DataType for actual_seq_lengths_query is specified as {ge::DT_INT32, ge::DT_INT32}, which is redundant. For consistency with other single-type optional inputs like block_table, it's better to use DataTypeList({ge::DT_INT32}).

.DataTypeList({ge::DT_INT32})

gemini-code-assist · 2025-12-02T07:37:35Z

csrc/lightning_indexer/op_host/lightning_indexer_def.cpp

+            .AutoContiguous();
+        this->Input("actual_seq_lengths_key")
+            .ParamType(OPTIONAL)
+            .DataType({ge::DT_INT32, ge::DT_INT32})


Similar to actual_seq_lengths_query, the DataType for actual_seq_lengths_key is specified as {ge::DT_INT32, ge::DT_INT32}, which is redundant. Please use DataTypeList({ge::DT_INT32}) for consistency and clarity.

.DataTypeList({ge::DT_INT32})

gemini-code-assist · 2025-12-02T07:37:35Z

csrc/lightning_indexer/op_host/lightning_indexer_tiling.h

+TILING_DATA_FIELD_DEF(uint32_t, n2Size)
+TILING_DATA_FIELD_DEF(uint32_t, gSize)
+TILING_DATA_FIELD_DEF(uint32_t, s1Size)
+TILING_DATA_FIELD_DEF(uint32_t, s2Size)


There's a type mismatch for s2Size. It's defined as int64_t in LITilingInfo (line 119) and LIInfoParser (line 193), but as uint32_t here in LITilingData. This can lead to data truncation and incorrect behavior if s2Size exceeds the uint32_t limit. The type should be consistent across all definitions, preferably int64_t or uint64_t to prevent potential overflow.

TILING_DATA_FIELD_DEF(int64_t, s2Size)

gemini-code-assist · 2025-12-02T07:37:35Z

csrc/lightning_indexer/op_kernel/lightning_indexer_kernel.h

+                        if (coreIdx == coreNum - 1 && info.bN2End != constInfo.batchSize -1) {
+                            info.bN2End = constInfo.batchSize -1;
+                            info.gS1End = 0;


The comparison info.bN2End != constInfo.batchSize - 1 involves different integer types (uint32_t and uint64_t), which can lead to unexpected behavior due to C++ type promotion rules. Additionally, if constInfo.batchSize is 0, constInfo.batchSize - 1 will underflow. Although the surrounding logic seems to prevent this path from being taken when batchSize is 0, it's safer to use consistent types and explicitly handle the zero case to prevent potential bugs.

if (coreIdx == coreNum - 1 && constInfo.batchSize > 0 && info.bN2End != constInfo.batchSize - 1) { info.bN2End = constInfo.batchSize - 1;

gemini-code-assist · 2025-12-02T07:37:36Z

csrc/sparse_flash_attention/op_host/sparse_flash_attention_def.cpp

+            .DataType({ge::DT_INT32, ge::DT_INT32})
+            .Format({ge::FORMAT_ND, ge::FORMAT_ND})
+            .AutoContiguous();
+        this->Input("block_table")
+            .ParamType(OPTIONAL)
+            .DataType({ge::DT_INT32, ge::DT_INT32})
+            .Format({ge::FORMAT_ND, ge::FORMAT_ND})
+            .AutoContiguous();
+        this->Input("actual_seq_lengths_query")
+            .ParamType(OPTIONAL)
+            .DataType({ge::DT_INT32, ge::DT_INT32})
+            .Format({ge::FORMAT_ND, ge::FORMAT_ND})
+            .AutoContiguous();
+        this->Input("actual_seq_lengths_kv")
+            .ParamType(OPTIONAL)
+            .DataType({ge::DT_INT32, ge::DT_INT32})
+            .Format({ge::FORMAT_ND, ge::FORMAT_ND})


The DataType for sparse_indices, block_table, actual_seq_lengths_query, and actual_seq_lengths_kv is specified with a redundant ge::DT_INT32. For single-type inputs, DataTypeList({ge::DT_INT32}) should be used for clarity and consistency.

.DataTypeList({ge::DT_INT32}) .Format({ge::FORMAT_ND, ge::FORMAT_ND}) .AutoContiguous(); this->Input("block_table") .ParamType(OPTIONAL) .DataTypeList({ge::DT_INT32}) .Format({ge::FORMAT_ND, ge::FORMAT_ND}) .AutoContiguous(); this->Input("actual_seq_lengths_query") .ParamType(OPTIONAL) .DataTypeList({ge::DT_INT32}) .Format({ge::FORMAT_ND, ge::FORMAT_ND}) .AutoContiguous(); this->Input("actual_seq_lengths_kv") .ParamType(OPTIONAL) .DataTypeList({ge::DT_INT32})

gemini-code-assist · 2025-12-02T07:37:36Z

csrc/torch_binding.cpp

+    char *query_layout_ptr = const_cast<char *>(query_layout_str.c_str());
+    char *key_layout_ptr = const_cast<char *>(key_layout_str.c_str());


Using const_cast to remove the const qualifier from the pointer returned by c_str() is unsafe. If the aclnnLightningIndexer function were to modify the string, it would result in undefined behavior. To ensure safety, a mutable copy of the string should be created and passed to the function, for example by using a std::vector<char>.

std::vector<char> query_layout_cstr(layout_query.begin(), layout_query.end()); query_layout_cstr.push_back('\0'); std::vector<char> key_layout_cstr(layout_key.begin(), layout_key.end()); key_layout_cstr.push_back('\0'); char *query_layout_ptr = query_layout_cstr.data(); char *key_layout_ptr = key_layout_cstr.data();

gemini-code-assist · 2025-12-02T07:37:36Z

csrc/torch_binding.cpp

+    char *layout_query_ptr = const_cast<char *>(layout_query_str.c_str());
+    char *layout_kv_ptr = const_cast<char *>(layout_kv_str.c_str());


Similar to the npu_lightning_indexer binding, using const_cast here is unsafe and can lead to undefined behavior. A mutable copy of the string should be used instead.

std::vector<char> layout_query_cstr(layout_query.begin(), layout_query.end()); layout_query_cstr.push_back('\0'); std::vector<char> layout_kv_cstr(layout_kv.begin(), layout_kv.end()); layout_kv_cstr.push_back('\0'); char *layout_query_ptr = layout_query_cstr.data(); char *layout_kv_ptr = layout_kv_cstr.data();

github-actions · 2025-12-02T07:42:51Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Signed-off-by: MingYang119 <[email protected]>

rjg-lyh · 2025-12-02T13:38:43Z

This PR greatly alleviates the inference latency of the two core fusion operators in the DSA module, leading to significant benefits f
or prefill. Additionally, it improves the usability of the repository by eliminating the need to install external operator packages.

…vllm-project#4625) ### What this PR does / why we need it? Provide high-performance AscendC operators lightning_indexer and sparse_flash_attention to boost the execution performance of the DeepSeek v3.2 model. Meanwhile, adapt the two AscendC operators to vllm-ascend framework. ### Does this PR introduce _any_ user-facing change? No (only underlying operator optimizations, with no user-facing changes) ### How was this patch tested? - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 Signed-off-by: MingYang119 <[email protected]>

…vllm-project#4625) ### What this PR does / why we need it? Provide high-performance AscendC operators lightning_indexer and sparse_flash_attention to boost the execution performance of the DeepSeek v3.2 model. Meanwhile, adapt the two AscendC operators to vllm-ascend framework. ### Does this PR introduce _any_ user-facing change? No (only underlying operator optimizations, with no user-facing changes) ### How was this patch tested? - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 Signed-off-by: MingYang119 <[email protected]> Signed-off-by: Che Ruan <[email protected]>

…vllm-project#4625) ### What this PR does / why we need it? Provide high-performance AscendC operators lightning_indexer and sparse_flash_attention to boost the execution performance of the DeepSeek v3.2 model. Meanwhile, adapt the two AscendC operators to vllm-ascend framework. ### Does this PR introduce _any_ user-facing change? No (only underlying operator optimizations, with no user-facing changes) ### How was this patch tested? - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 Signed-off-by: MingYang119 <[email protected]>

gemini-code-assist bot reviewed Dec 2, 2025

View reviewed changes

MingYang119 force-pushed the feat/new-add-li-sfa branch from 0900896 to 54207d0 Compare December 2, 2025 07:38

MingYang119 force-pushed the feat/new-add-li-sfa branch 6 times, most recently from 0722560 to eae4a14 Compare December 2, 2025 09:45

rjg-lyh added ready read for review ready-for-test start test by label for PR labels Dec 2, 2025

MingYang119 force-pushed the feat/new-add-li-sfa branch 2 times, most recently from 2451987 to 45b03c5 Compare December 2, 2025 11:01

add lightning_indexer and sparse_flash_attention

ad953a2

Signed-off-by: MingYang119 <[email protected]>

MingYang119 force-pushed the feat/new-add-li-sfa branch from 45b03c5 to ad953a2 Compare December 2, 2025 11:13

zzzzwwjj approved these changes Dec 3, 2025

View reviewed changes

whx-sjtu approved these changes Dec 3, 2025

View reviewed changes

wangxiyuan approved these changes Dec 3, 2025

View reviewed changes

wangxiyuan merged commit 18b90b5 into vllm-project:main Dec 3, 2025
22 checks passed

		ge::graphStatus GetAttenMaskInfo();
		ge::graphStatus GetActualSeqInfo();

		char query_layout_ptr = const_cast<char >(query_layout_str.c_str());
		char key_layout_ptr = const_cast<char >(key_layout_str.c_str());

		char layout_query_ptr = const_cast<char >(layout_query_str.c_str());
		char layout_kv_ptr = const_cast<char >(layout_kv_str.c_str());

[kernel] add AscendC op: lightning_indexer and sparse_flash_attention #4625

[kernel] add AscendC op: lightning_indexer and sparse_flash_attention #4625

Conversation

MingYang119 commented Dec 2, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 2, 2025

Uh oh!

rjg-lyh commented Dec 2, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

MingYang119 commented Dec 2, 2025 •

edited by github-actions bot

Loading