-
Notifications
You must be signed in to change notification settings - Fork 629
[kernel] add AscendC op: lightning_indexer and sparse_flash_attention #4625
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces two new AscendC operators, lightning_indexer and sparse_flash_attention, with extensive changes including kernel implementations, host-side logic, and PyTorch bindings. While the overall structure is sound, I've identified several critical and high-severity issues. These include a potential linker error from an undefined method, data truncation risks due to type mismatches, unsafe const_cast usage leading to potential undefined behavior, and inconsistent shape inference logic that could cause runtime errors. Addressing these issues is crucial for the stability and correctness of the new operators.
| ge::graphStatus GetAttenMaskInfo(); | ||
| ge::graphStatus GetActualSeqInfo(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| output_size = {query.size(DIM_0), query.size(DIM_1), key.size(DIM_2), sparse_count}; | ||
| } else { | ||
| int n_dim_index = 0; | ||
| n_dim_index = (key_layout_str == "TND") ? DIM_1 : DIM_2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The shape inference logic for the TND layout in this meta function is inconsistent with the shape inference logic in csrc/lightning_indexer/op_host/lightning_indexer_proto.cpp. Here, n_dim_index is determined by (key_layout_str == "TND"), while in proto.cpp it's based on (inputLayoutKeyPtrStr == "PA_BSND"). This discrepancy can lead to shape mismatch errors at runtime. The logic should be unified to match the operator's shape inference.
n_dim_index = (key_layout_str == "PA_BSND") ? 2 : 1;| .AutoContiguous(); | ||
| this->Input("actual_seq_lengths_query") | ||
| .ParamType(OPTIONAL) | ||
| .DataType({ge::DT_INT32, ge::DT_INT32}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| .AutoContiguous(); | ||
| this->Input("actual_seq_lengths_key") | ||
| .ParamType(OPTIONAL) | ||
| .DataType({ge::DT_INT32, ge::DT_INT32}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| TILING_DATA_FIELD_DEF(uint32_t, n2Size) | ||
| TILING_DATA_FIELD_DEF(uint32_t, gSize) | ||
| TILING_DATA_FIELD_DEF(uint32_t, s1Size) | ||
| TILING_DATA_FIELD_DEF(uint32_t, s2Size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a type mismatch for s2Size. It's defined as int64_t in LITilingInfo (line 119) and LIInfoParser (line 193), but as uint32_t here in LITilingData. This can lead to data truncation and incorrect behavior if s2Size exceeds the uint32_t limit. The type should be consistent across all definitions, preferably int64_t or uint64_t to prevent potential overflow.
TILING_DATA_FIELD_DEF(int64_t, s2Size)| if (coreIdx == coreNum - 1 && info.bN2End != constInfo.batchSize -1) { | ||
| info.bN2End = constInfo.batchSize -1; | ||
| info.gS1End = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comparison info.bN2End != constInfo.batchSize - 1 involves different integer types (uint32_t and uint64_t), which can lead to unexpected behavior due to C++ type promotion rules. Additionally, if constInfo.batchSize is 0, constInfo.batchSize - 1 will underflow. Although the surrounding logic seems to prevent this path from being taken when batchSize is 0, it's safer to use consistent types and explicitly handle the zero case to prevent potential bugs.
if (coreIdx == coreNum - 1 && constInfo.batchSize > 0 && info.bN2End != constInfo.batchSize - 1) {
info.bN2End = constInfo.batchSize - 1;| .DataType({ge::DT_INT32, ge::DT_INT32}) | ||
| .Format({ge::FORMAT_ND, ge::FORMAT_ND}) | ||
| .AutoContiguous(); | ||
| this->Input("block_table") | ||
| .ParamType(OPTIONAL) | ||
| .DataType({ge::DT_INT32, ge::DT_INT32}) | ||
| .Format({ge::FORMAT_ND, ge::FORMAT_ND}) | ||
| .AutoContiguous(); | ||
| this->Input("actual_seq_lengths_query") | ||
| .ParamType(OPTIONAL) | ||
| .DataType({ge::DT_INT32, ge::DT_INT32}) | ||
| .Format({ge::FORMAT_ND, ge::FORMAT_ND}) | ||
| .AutoContiguous(); | ||
| this->Input("actual_seq_lengths_kv") | ||
| .ParamType(OPTIONAL) | ||
| .DataType({ge::DT_INT32, ge::DT_INT32}) | ||
| .Format({ge::FORMAT_ND, ge::FORMAT_ND}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The DataType for sparse_indices, block_table, actual_seq_lengths_query, and actual_seq_lengths_kv is specified with a redundant ge::DT_INT32. For single-type inputs, DataTypeList({ge::DT_INT32}) should be used for clarity and consistency.
.DataTypeList({ge::DT_INT32})
.Format({ge::FORMAT_ND, ge::FORMAT_ND})
.AutoContiguous();
this->Input("block_table")
.ParamType(OPTIONAL)
.DataTypeList({ge::DT_INT32})
.Format({ge::FORMAT_ND, ge::FORMAT_ND})
.AutoContiguous();
this->Input("actual_seq_lengths_query")
.ParamType(OPTIONAL)
.DataTypeList({ge::DT_INT32})
.Format({ge::FORMAT_ND, ge::FORMAT_ND})
.AutoContiguous();
this->Input("actual_seq_lengths_kv")
.ParamType(OPTIONAL)
.DataTypeList({ge::DT_INT32})| char *query_layout_ptr = const_cast<char *>(query_layout_str.c_str()); | ||
| char *key_layout_ptr = const_cast<char *>(key_layout_str.c_str()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using const_cast to remove the const qualifier from the pointer returned by c_str() is unsafe. If the aclnnLightningIndexer function were to modify the string, it would result in undefined behavior. To ensure safety, a mutable copy of the string should be created and passed to the function, for example by using a std::vector<char>.
std::vector<char> query_layout_cstr(layout_query.begin(), layout_query.end());
query_layout_cstr.push_back('\0');
std::vector<char> key_layout_cstr(layout_key.begin(), layout_key.end());
key_layout_cstr.push_back('\0');
char *query_layout_ptr = query_layout_cstr.data();
char *key_layout_ptr = key_layout_cstr.data();| char *layout_query_ptr = const_cast<char *>(layout_query_str.c_str()); | ||
| char *layout_kv_ptr = const_cast<char *>(layout_kv_str.c_str()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the npu_lightning_indexer binding, using const_cast here is unsafe and can lead to undefined behavior. A mutable copy of the string should be used instead.
std::vector<char> layout_query_cstr(layout_query.begin(), layout_query.end());
layout_query_cstr.push_back('\0');
std::vector<char> layout_kv_cstr(layout_kv.begin(), layout_kv.end());
layout_kv_cstr.push_back('\0');
char *layout_query_ptr = layout_query_cstr.data();
char *layout_kv_ptr = layout_kv_cstr.data();0900896 to
54207d0
Compare
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
0722560 to
eae4a14
Compare
2451987 to
45b03c5
Compare
Signed-off-by: MingYang119 <[email protected]>
45b03c5 to
ad953a2
Compare
|
This PR greatly alleviates the inference latency of the two core fusion operators in the DSA module, leading to significant benefits f |
…vllm-project#4625) ### What this PR does / why we need it? Provide high-performance AscendC operators lightning_indexer and sparse_flash_attention to boost the execution performance of the DeepSeek v3.2 model. Meanwhile, adapt the two AscendC operators to vllm-ascend framework. ### Does this PR introduce _any_ user-facing change? No (only underlying operator optimizations, with no user-facing changes) ### How was this patch tested? - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 Signed-off-by: MingYang119 <[email protected]>
…vllm-project#4625) ### What this PR does / why we need it? Provide high-performance AscendC operators lightning_indexer and sparse_flash_attention to boost the execution performance of the DeepSeek v3.2 model. Meanwhile, adapt the two AscendC operators to vllm-ascend framework. ### Does this PR introduce _any_ user-facing change? No (only underlying operator optimizations, with no user-facing changes) ### How was this patch tested? - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 Signed-off-by: MingYang119 <[email protected]> Signed-off-by: Che Ruan <[email protected]>
…vllm-project#4625) ### What this PR does / why we need it? Provide high-performance AscendC operators lightning_indexer and sparse_flash_attention to boost the execution performance of the DeepSeek v3.2 model. Meanwhile, adapt the two AscendC operators to vllm-ascend framework. ### Does this PR introduce _any_ user-facing change? No (only underlying operator optimizations, with no user-facing changes) ### How was this patch tested? - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 Signed-off-by: MingYang119 <[email protected]> Signed-off-by: Che Ruan <[email protected]>
…vllm-project#4625) ### What this PR does / why we need it? Provide high-performance AscendC operators lightning_indexer and sparse_flash_attention to boost the execution performance of the DeepSeek v3.2 model. Meanwhile, adapt the two AscendC operators to vllm-ascend framework. ### Does this PR introduce _any_ user-facing change? No (only underlying operator optimizations, with no user-facing changes) ### How was this patch tested? - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 Signed-off-by: MingYang119 <[email protected]>
What this PR does / why we need it?
Provide high-performance AscendC operators lightning_indexer and sparse_flash_attention to boost the execution performance of the DeepSeek v3.2 model. Meanwhile, adapt the two AscendC operators to vllm-ascend framework.
Does this PR introduce any user-facing change?
No (only underlying operator optimizations, with no user-facing changes)
How was this patch tested?