-
Notifications
You must be signed in to change notification settings - Fork 2.8k
[CPU] Implement X-Attention for intel CPU #32086
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
zhangYiIntel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current design only consider 1 batch, if there is multiple batches or chunked inputs e.g. continuous batching, this PR cannot handle correctly. Maybe we need a TODO to track the following.
|
|
||
| // Instead of writing -inf directly into scores, build a softmax mask (0/-inf) and pass it to the kernel | ||
| DATA_TYPE* softmax_mask = nullptr; | ||
| std::vector<DATA_TYPE> softmax_mask_storage; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please consider make softmax_mask_storage as a class member of MHAHelper
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have optimized this sparse softmax mask generation method in the update commit, currently no extra sparse softmax mask generation needed anymore.
| const PlainTensor& block_indices_begins, | ||
| const PlainTensor& alibi_slopes, | ||
| const PlainTensor& score_aggregation_window) { | ||
| const PlainTensor& score_aggregation_window, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please consider remove default value for sparse_attention_mask, to avoid misuse of outer function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| // xattention_block_size = 128; | ||
|
|
||
| // If to support second token sparse attention, need generate sparse mask after concat_pastkv | ||
| if (xattention_threshold && q.size(0) > 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check of prefill phase here is not robust, please follow the code in
| if (past_lens.m_dims[0] >= nthr || _workitems.get_reorder_max_batch_size() > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @zhangYiIntel : from my understanding, this check here if (past_lens.m_dims[0] >= nthr || _workitems.get_reorder_max_batch_size() > 0) { include two conditions: ' _workitems.get_reorder_max_batch_size() > 0' : first token, and 'past_lens.m_dims[0] >= nthr' : second token with long past_kv.
while since our current cpu x-attention impl only support first token cases, so it seems we can use ‘q.size(0) > 1’(equals to ' _workitems.get_reorder_max_batch_size() > 0')for x-attention check.
src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/xattention.cpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/xattention.cpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/xattention.cpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/xattention.cpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/xattention.cpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/xattention.cpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/xattention.cpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/xattention.cpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/xattention.cpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/xattention.cpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/xattention.cpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/xattention.cpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/xattention.cpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/xattention.cpp
Outdated
Show resolved
Hide resolved
7dc5202 to
51840cb
Compare
have added this limitation/TODOs in the ticket CVS-171057 |
src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/xattention.hpp
Outdated
Show resolved
Hide resolved
zhangYiIntel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, please fix the minor changes.
src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/xattention.hpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/xattention.hpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/xattention.hpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/xattention.hpp
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements X-Attention for Intel CPU to accelerate long prompt inference by generating sparse attention blocks in a pre-inference stage before PagedAttention execution.
Key Changes:
- Added X-Attention computation kernel that estimates sparse attention block masks based on query-key similarity scores
- Integrated sparse mask filtering into the existing PagedAttention execution pipeline
- Extended softmax kernel to support sparse masking with block-level granularity
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| xattention_test.cpp | Unit tests for X-Attention functionality comparing against reference implementation |
| CMakeLists.txt | Added xattention_test.cpp exclusion for non-x86_64 builds and included reference headers |
| xattention.hpp | Core X-Attention implementation with BRGEMM-based attention estimation and block selection |
| softmax_kernel.hpp | Extended softmax kernel with sparse mask support via new template parameter |
| executor_pa.cpp | Integration of X-Attention into PagedAttention execution pipeline with sparse mask filtering |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| (S * stride * K_H) // src stride | ||
| ); | ||
| # else | ||
| OPENVINO_THROW("xattention: bf16 needs avx512+ hardware."); |
Copilot
AI
Oct 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error message uses inconsistent punctuation. Remove the period at the end for consistency with other error messages in the codebase, such as line 311.
| OPENVINO_THROW("xattention: bf16 needs avx512+ hardware."); | |
| OPENVINO_THROW("xattention: bf16 needs avx512+ hardware"); |
|
@mangguo321, could you please resolve merge conflicts? |
…ftmax_kernel(float) to support sparse mask
… implementation of scale_add2_reduce_max not only as a reference but also as a tail processing step (primarily for ARM).
40e7ec4 to
2939bc6
Compare
| inline void transpose_tailx16_kernel(TDST* dst, | ||
| TSRC* src, | ||
| size_t n_cnt, | ||
| size_t k_cnt, | ||
| size_t dst_stride, | ||
| size_t src_stride) { | ||
| for (size_t i = 0; i < n_cnt; i++) { | ||
| for (size_t j = 0; j < k_cnt; j++) { | ||
| dst[j * dst_stride + i] = static_cast<TDST>(src[j + i * src_stride]); | ||
| } | ||
| } | ||
| } | ||
|
|
||
| template <typename TDST, | ||
| ov::element::Type_t SRC_PREC, | ||
| std::enable_if_t<(none_of(SRC_PREC, ov::element::i8, ov::element::u8, ov::element::u4)), bool> = true> | ||
| void transpose_16NxK(TDST* dst, | ||
| void* src, | ||
| const size_t N, | ||
| const size_t K, | ||
| const size_t block_size, | ||
| const size_t dst_stride, | ||
| const size_t src_stride) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have pretty similar templates in src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/executor_pa.cpp. Can't we extract them in a common header transpose.hpp and reuse across implementations?
| #endif | ||
|
|
||
| #if defined(OPENVINO_ARCH_X86_64) || defined(OPENVINO_ARCH_ARM64) | ||
| struct Xattn { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a general comment to Xattn struct. It's not a template class. It's methods aren't template either. Also implementations look to complex to make advantage of ilining their code.
Can we consider moving method definitions to a .cpp file?
| sparse_scale = (_sparse_mask_block_size == 0 || _sparse_mask_block_size == _block_size) | ||
| ? 1 | ||
| : (_sparse_mask_block_size / _block_size); // >=1 | ||
| map_to_mask_idx = [sparse_scale](size_t q_blk_rt, size_t k_blk_rt) { | ||
| if (sparse_scale == 1) { | ||
| return std::pair<size_t, size_t>{q_blk_rt, k_blk_rt}; | ||
| } | ||
| size_t q_mask = q_blk_rt / sparse_scale; | ||
| size_t k_mask = k_blk_rt / sparse_scale; | ||
| return std::pair<size_t, size_t>{q_mask, k_mask}; | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| sparse_scale = (_sparse_mask_block_size == 0 || _sparse_mask_block_size == _block_size) | |
| ? 1 | |
| : (_sparse_mask_block_size / _block_size); // >=1 | |
| map_to_mask_idx = [sparse_scale](size_t q_blk_rt, size_t k_blk_rt) { | |
| if (sparse_scale == 1) { | |
| return std::pair<size_t, size_t>{q_blk_rt, k_blk_rt}; | |
| } | |
| size_t q_mask = q_blk_rt / sparse_scale; | |
| size_t k_mask = k_blk_rt / sparse_scale; | |
| return std::pair<size_t, size_t>{q_mask, k_mask}; | |
| }; | |
| if (_sparse_mask_block_size == 0 || _sparse_mask_block_size == _block_size) { | |
| sparse_scale = _sparse_mask_block_size / _block_size; | |
| map_to_mask_idx = [sparse_scale](size_t q_blk_rt, size_t k_blk_rt) { | |
| size_t q_mask = q_blk_rt / sparse_scale; | |
| size_t k_mask = k_blk_rt / sparse_scale; | |
| return std::pair<size_t, size_t>{q_mask, k_mask}; | |
| }; |
Because sparse_scale == 1 correspond to the default case anyway.
| // xattention_threshold.resize<float>({1}); | ||
| // xattention_threshold.ptr<float>()[0] = 0.9f; | ||
| // xattention_stride = 16; | ||
| // xattention_block_size = 128; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented code
|
Do we have any e2e tests of this feature? Can we extend the existing PA functional test to cover this Xattention feature? |
Details:
Tickets: