Skip to content

Conversation

@MingYang119
Copy link
Contributor

@MingYang119 MingYang119 commented Dec 2, 2025

What this PR does / why we need it?

Provide high-performance AscendC operators lightning_indexer and sparse_flash_attention to boost the execution performance of the DeepSeek v3.2 model. Meanwhile, adapt the two AscendC operators to vllm-ascend framework.

Does this PR introduce any user-facing change?

No (only underlying operator optimizations, with no user-facing changes)

How was this patch tested?

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces two new AscendC operators, lightning_indexer and sparse_flash_attention, with extensive changes including kernel implementations, host-side logic, and PyTorch bindings. While the overall structure is sound, I've identified several critical and high-severity issues. These include a potential linker error from an undefined method, data truncation risks due to type mismatches, unsafe const_cast usage leading to potential undefined behavior, and inconsistent shape inference logic that could cause runtime errors. Addressing these issues is crucial for the stability and correctness of the new operators.

Comment on lines +176 to +171
ge::graphStatus GetAttenMaskInfo();
ge::graphStatus GetActualSeqInfo();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The methods GetAttenMaskInfo() and GetActualSeqInfo() are declared in the LIInfoParser class but are not defined in lightning_indexer_tiling.cpp. This will cause a linker error during compilation. If these methods are not used, they should be removed to resolve this.

output_size = {query.size(DIM_0), query.size(DIM_1), key.size(DIM_2), sparse_count};
} else {
int n_dim_index = 0;
n_dim_index = (key_layout_str == "TND") ? DIM_1 : DIM_2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The shape inference logic for the TND layout in this meta function is inconsistent with the shape inference logic in csrc/lightning_indexer/op_host/lightning_indexer_proto.cpp. Here, n_dim_index is determined by (key_layout_str == "TND"), while in proto.cpp it's based on (inputLayoutKeyPtrStr == "PA_BSND"). This discrepancy can lead to shape mismatch errors at runtime. The logic should be unified to match the operator's shape inference.

        n_dim_index = (key_layout_str == "PA_BSND") ? 2 : 1;

.AutoContiguous();
this->Input("actual_seq_lengths_query")
.ParamType(OPTIONAL)
.DataType({ge::DT_INT32, ge::DT_INT32})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The DataType for actual_seq_lengths_query is specified as {ge::DT_INT32, ge::DT_INT32}, which is redundant. For consistency with other single-type optional inputs like block_table, it's better to use DataTypeList({ge::DT_INT32}).

            .DataTypeList({ge::DT_INT32})

.AutoContiguous();
this->Input("actual_seq_lengths_key")
.ParamType(OPTIONAL)
.DataType({ge::DT_INT32, ge::DT_INT32})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to actual_seq_lengths_query, the DataType for actual_seq_lengths_key is specified as {ge::DT_INT32, ge::DT_INT32}, which is redundant. Please use DataTypeList({ge::DT_INT32}) for consistency and clarity.

            .DataTypeList({ge::DT_INT32})

TILING_DATA_FIELD_DEF(uint32_t, n2Size)
TILING_DATA_FIELD_DEF(uint32_t, gSize)
TILING_DATA_FIELD_DEF(uint32_t, s1Size)
TILING_DATA_FIELD_DEF(uint32_t, s2Size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There's a type mismatch for s2Size. It's defined as int64_t in LITilingInfo (line 119) and LIInfoParser (line 193), but as uint32_t here in LITilingData. This can lead to data truncation and incorrect behavior if s2Size exceeds the uint32_t limit. The type should be consistent across all definitions, preferably int64_t or uint64_t to prevent potential overflow.

TILING_DATA_FIELD_DEF(int64_t, s2Size)

Comment on lines +324 to +317
if (coreIdx == coreNum - 1 && info.bN2End != constInfo.batchSize -1) {
info.bN2End = constInfo.batchSize -1;
info.gS1End = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The comparison info.bN2End != constInfo.batchSize - 1 involves different integer types (uint32_t and uint64_t), which can lead to unexpected behavior due to C++ type promotion rules. Additionally, if constInfo.batchSize is 0, constInfo.batchSize - 1 will underflow. Although the surrounding logic seems to prevent this path from being taken when batchSize is 0, it's safer to use consistent types and explicitly handle the zero case to prevent potential bugs.

                        if (coreIdx == coreNum - 1 && constInfo.batchSize > 0 && info.bN2End != constInfo.batchSize - 1) {
                            info.bN2End = constInfo.batchSize - 1;

Comment on lines +40 to +56
.DataType({ge::DT_INT32, ge::DT_INT32})
.Format({ge::FORMAT_ND, ge::FORMAT_ND})
.AutoContiguous();
this->Input("block_table")
.ParamType(OPTIONAL)
.DataType({ge::DT_INT32, ge::DT_INT32})
.Format({ge::FORMAT_ND, ge::FORMAT_ND})
.AutoContiguous();
this->Input("actual_seq_lengths_query")
.ParamType(OPTIONAL)
.DataType({ge::DT_INT32, ge::DT_INT32})
.Format({ge::FORMAT_ND, ge::FORMAT_ND})
.AutoContiguous();
this->Input("actual_seq_lengths_kv")
.ParamType(OPTIONAL)
.DataType({ge::DT_INT32, ge::DT_INT32})
.Format({ge::FORMAT_ND, ge::FORMAT_ND})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The DataType for sparse_indices, block_table, actual_seq_lengths_query, and actual_seq_lengths_kv is specified with a redundant ge::DT_INT32. For single-type inputs, DataTypeList({ge::DT_INT32}) should be used for clarity and consistency.

            .DataTypeList({ge::DT_INT32})
            .Format({ge::FORMAT_ND, ge::FORMAT_ND})
            .AutoContiguous();
        this->Input("block_table")
            .ParamType(OPTIONAL)
            .DataTypeList({ge::DT_INT32})
            .Format({ge::FORMAT_ND, ge::FORMAT_ND})
            .AutoContiguous();
        this->Input("actual_seq_lengths_query")
            .ParamType(OPTIONAL)
            .DataTypeList({ge::DT_INT32})
            .Format({ge::FORMAT_ND, ge::FORMAT_ND})
            .AutoContiguous();
        this->Input("actual_seq_lengths_kv")
            .ParamType(OPTIONAL)
            .DataTypeList({ge::DT_INT32})

Comment on lines +657 to +659
char *query_layout_ptr = const_cast<char *>(query_layout_str.c_str());
char *key_layout_ptr = const_cast<char *>(key_layout_str.c_str());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using const_cast to remove the const qualifier from the pointer returned by c_str() is unsafe. If the aclnnLightningIndexer function were to modify the string, it would result in undefined behavior. To ensure safety, a mutable copy of the string should be created and passed to the function, for example by using a std::vector<char>.

    std::vector<char> query_layout_cstr(layout_query.begin(), layout_query.end());
    query_layout_cstr.push_back('\0');
    std::vector<char> key_layout_cstr(layout_key.begin(), layout_key.end());
    key_layout_cstr.push_back('\0');
    char *query_layout_ptr = query_layout_cstr.data();
    char *key_layout_ptr = key_layout_cstr.data();

Comment on lines +696 to +698
char *layout_query_ptr = const_cast<char *>(layout_query_str.c_str());
char *layout_kv_ptr = const_cast<char *>(layout_kv_str.c_str());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to the npu_lightning_indexer binding, using const_cast here is unsafe and can lead to undefined behavior. A mutable copy of the string should be used instead.

    std::vector<char> layout_query_cstr(layout_query.begin(), layout_query.end());
    layout_query_cstr.push_back('\0');
    std::vector<char> layout_kv_cstr(layout_kv.begin(), layout_kv.end());
    layout_kv_cstr.push_back('\0');
    char *layout_query_ptr = layout_query_cstr.data();
    char *layout_kv_ptr = layout_kv_cstr.data();

@github-actions
Copy link

github-actions bot commented Dec 2, 2025

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@MingYang119 MingYang119 force-pushed the feat/new-add-li-sfa branch 6 times, most recently from 0722560 to eae4a14 Compare December 2, 2025 09:45
@rjg-lyh rjg-lyh added ready read for review ready-for-test start test by label for PR labels Dec 2, 2025
@MingYang119 MingYang119 force-pushed the feat/new-add-li-sfa branch 2 times, most recently from 2451987 to 45b03c5 Compare December 2, 2025 11:01
@rjg-lyh
Copy link
Collaborator

rjg-lyh commented Dec 2, 2025

This PR greatly alleviates the inference latency of the two core fusion operators in the DSA module, leading to significant benefits f
or prefill. Additionally, it improves the usability of the repository by eliminating the need to install external operator packages.

@wangxiyuan wangxiyuan merged commit 18b90b5 into vllm-project:main Dec 3, 2025
22 checks passed
ChenCangtao pushed a commit to ChenCangtao/vllm-ascend that referenced this pull request Dec 3, 2025
…vllm-project#4625)

### What this PR does / why we need it?
Provide high-performance AscendC operators lightning_indexer and
sparse_flash_attention to boost the execution performance of the
DeepSeek v3.2 model. Meanwhile, adapt the two AscendC operators to
vllm-ascend framework.

### Does this PR introduce _any_ user-facing change?
No (only underlying operator optimizations, with no user-facing changes)

### How was this patch tested?

- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

Signed-off-by: MingYang119 <[email protected]>
Mercykid-bash pushed a commit to Mercykid-bash/vllm-ascend that referenced this pull request Dec 4, 2025
…vllm-project#4625)

### What this PR does / why we need it?
Provide high-performance AscendC operators lightning_indexer and
sparse_flash_attention to boost the execution performance of the
DeepSeek v3.2 model. Meanwhile, adapt the two AscendC operators to
vllm-ascend framework.

### Does this PR introduce _any_ user-facing change?
No (only underlying operator optimizations, with no user-facing changes)

### How was this patch tested?

- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

Signed-off-by: MingYang119 <[email protected]>
Signed-off-by: Che Ruan <[email protected]>
Mercykid-bash pushed a commit to Mercykid-bash/vllm-ascend that referenced this pull request Dec 4, 2025
…vllm-project#4625)

### What this PR does / why we need it?
Provide high-performance AscendC operators lightning_indexer and
sparse_flash_attention to boost the execution performance of the
DeepSeek v3.2 model. Meanwhile, adapt the two AscendC operators to
vllm-ascend framework.

### Does this PR introduce _any_ user-facing change?
No (only underlying operator optimizations, with no user-facing changes)

### How was this patch tested?

- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

Signed-off-by: MingYang119 <[email protected]>
Signed-off-by: Che Ruan <[email protected]>
Meihan-chen pushed a commit to Meihan-chen/vllm-ascend that referenced this pull request Dec 5, 2025
…vllm-project#4625)

### What this PR does / why we need it?
Provide high-performance AscendC operators lightning_indexer and
sparse_flash_attention to boost the execution performance of the
DeepSeek v3.2 model. Meanwhile, adapt the two AscendC operators to
vllm-ascend framework.

### Does this PR introduce _any_ user-facing change?
No (only underlying operator optimizations, with no user-facing changes)

### How was this patch tested?

- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

Signed-off-by: MingYang119 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants