[CPU] Implement X-Attention for intel CPU #32086

mangguo321 · 2025-09-16T03:33:12Z

Details:

Implement X-Attention which is run in pre-inference stage before PagedAttention in attention pipeline to generate sparse attention blocks to accelerate long prompt inference

Tickets:

CVS-171072

zhangYiIntel

Current design only consider 1 batch, if there is multiple batches or chunked inputs e.g. continuous batching, this PR cannot handle correctly. Maybe we need a TODO to track the following.

zhangYiIntel · 2025-09-30T02:57:34Z

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/executor_pa.cpp


+            // Instead of writing -inf directly into scores, build a softmax mask (0/-inf) and pass it to the kernel
+            DATA_TYPE* softmax_mask = nullptr;
+            std::vector<DATA_TYPE> softmax_mask_storage;


Please consider make softmax_mask_storage as a class member of MHAHelper

we have optimized this sparse softmax mask generation method in the update commit, currently no extra sparse softmax mask generation needed anymore.

zhangYiIntel · 2025-09-30T02:59:57Z

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/executor_pa.cpp

                         const PlainTensor& block_indices_begins,
                         const PlainTensor& alibi_slopes,
-                         const PlainTensor& score_aggregation_window) {
+                         const PlainTensor& score_aggregation_window,


Please consider remove default value for sparse_attention_mask, to avoid misuse of outer function.

zhangYiIntel · 2025-09-30T03:02:06Z

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/executor_pa.cpp

+        // xattention_block_size = 128;
+
+        // If to support second token sparse attention, need generate sparse mask after concat_pastkv
+        if (xattention_threshold && q.size(0) > 1) {


The check of prefill phase here is not robust, please follow the code in

openvino/src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/executor_pa.cpp

Line 1809 in 3d74e42

if (past_lens.m_dims[0] >= nthr || _workitems.get_reorder_max_batch_size() > 0) {

Hi, @zhangYiIntel : from my understanding, this check here if (past_lens.m_dims[0] >= nthr || _workitems.get_reorder_max_batch_size() > 0) { include two conditions: ' _workitems.get_reorder_max_batch_size() > 0' : first token, and 'past_lens.m_dims[0] >= nthr' : second token with long past_kv.
while since our current cpu x-attention impl only support first token cases, so it seems we can use ‘q.size(0) > 1’(equals to ' _workitems.get_reorder_max_batch_size() > 0')for x-attention check.

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/executor_pa.cpp

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/xattention.cpp

liubo-intel · 2025-09-30T12:58:51Z

Current design only consider 1 batch, if there is multiple batches or chunked inputs e.g. continuous batching, this PR cannot handle correctly. Maybe we need a TODO to track the following.

have added this limitation/TODOs in the ticket CVS-171057

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/xattention.hpp

zhangYiIntel

LGTM, please fix the minor changes.

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/xattention.hpp

Copilot

Pull Request Overview

This PR implements X-Attention for Intel CPU to accelerate long prompt inference by generating sparse attention blocks in a pre-inference stage before PagedAttention execution.

Key Changes:

Added X-Attention computation kernel that estimates sparse attention block masks based on query-key similarity scores
Integrated sparse mask filtering into the existing PagedAttention execution pipeline
Extended softmax kernel to support sparse masking with block-level granularity

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
xattention_test.cpp	Unit tests for X-Attention functionality comparing against reference implementation
CMakeLists.txt	Added xattention_test.cpp exclusion for non-x86_64 builds and included reference headers
xattention.hpp	Core X-Attention implementation with BRGEMM-based attention estimation and block selection
softmax_kernel.hpp	Extended softmax kernel with sparse mask support via new template parameter
executor_pa.cpp	Integration of X-Attention into PagedAttention execution pipeline with sparse mask filtering

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/xattention.hpp

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/executor_pa.cpp

Copilot · 2025-10-21T03:45:52Z

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/xattention.hpp

+                                                                     (S * stride * K_H)                 // src stride
+                        );
+#    else
+                    OPENVINO_THROW("xattention: bf16 needs avx512+ hardware.");


Error message uses inconsistent punctuation. Remove the period at the end for consistency with other error messages in the codebase, such as line 311.

Suggested change

OPENVINO_THROW("xattention: bf16 needs avx512+ hardware.");

OPENVINO_THROW("xattention: bf16 needs avx512+ hardware");

maxnick · 2025-11-19T14:01:16Z

@mangguo321, could you please resolve merge conflicts?

…ation places

…ftmax_kernel(float) to support sparse mask

… implementation of scale_add2_reduce_max not only as a reference but also as a tail processing step (primarily for ARM).

maxnick · 2025-11-21T10:19:01Z

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/xattention.hpp

+inline void transpose_tailx16_kernel(TDST* dst,
+                                     TSRC* src,
+                                     size_t n_cnt,
+                                     size_t k_cnt,
+                                     size_t dst_stride,
+                                     size_t src_stride) {
+    for (size_t i = 0; i < n_cnt; i++) {
+        for (size_t j = 0; j < k_cnt; j++) {
+            dst[j * dst_stride + i] = static_cast<TDST>(src[j + i * src_stride]);
+        }
+    }
+}
+
+template <typename TDST,
+          ov::element::Type_t SRC_PREC,
+          std::enable_if_t<(none_of(SRC_PREC, ov::element::i8, ov::element::u8, ov::element::u4)), bool> = true>
+void transpose_16NxK(TDST* dst,
+                     void* src,
+                     const size_t N,
+                     const size_t K,
+                     const size_t block_size,
+                     const size_t dst_stride,
+                     const size_t src_stride) {


We have pretty similar templates in src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/executor_pa.cpp. Can't we extract them in a common header transpose.hpp and reuse across implementations?

maxnick · 2025-11-21T10:24:54Z

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/xattention.hpp

+#endif
+
+#if defined(OPENVINO_ARCH_X86_64) || defined(OPENVINO_ARCH_ARM64)
+struct Xattn {


Just a general comment to Xattn struct. It's not a template class. It's methods aren't template either. Also implementations look to complex to make advantage of ilining their code.
Can we consider moving method definitions to a .cpp file?

maxnick · 2025-11-21T10:30:41Z

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/executor_pa.cpp

+            sparse_scale = (_sparse_mask_block_size == 0 || _sparse_mask_block_size == _block_size)
+                               ? 1
+                               : (_sparse_mask_block_size / _block_size);  // >=1
+            map_to_mask_idx = [sparse_scale](size_t q_blk_rt, size_t k_blk_rt) {
+                if (sparse_scale == 1) {
+                    return std::pair<size_t, size_t>{q_blk_rt, k_blk_rt};
+                }
+                size_t q_mask = q_blk_rt / sparse_scale;
+                size_t k_mask = k_blk_rt / sparse_scale;
+                return std::pair<size_t, size_t>{q_mask, k_mask};
+            };


Suggested change

sparse_scale = (_sparse_mask_block_size == 0 || _sparse_mask_block_size == _block_size)

? 1

: (_sparse_mask_block_size / _block_size); // >=1

map_to_mask_idx = [sparse_scale](size_t q_blk_rt, size_t k_blk_rt) {

if (sparse_scale == 1) {

return std::pair<size_t, size_t>{q_blk_rt, k_blk_rt};

}

size_t q_mask = q_blk_rt / sparse_scale;

size_t k_mask = k_blk_rt / sparse_scale;

return std::pair<size_t, size_t>{q_mask, k_mask};

};

if (_sparse_mask_block_size == 0 || _sparse_mask_block_size == _block_size) {

sparse_scale = _sparse_mask_block_size / _block_size;

map_to_mask_idx = [sparse_scale](size_t q_blk_rt, size_t k_blk_rt) {

size_t q_mask = q_blk_rt / sparse_scale;

size_t k_mask = k_blk_rt / sparse_scale;

return std::pair<size_t, size_t>{q_mask, k_mask};

};

Because sparse_scale == 1 correspond to the default case anyway.

maxnick · 2025-11-21T10:35:40Z

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/executor_pa.cpp

+        // xattention_threshold.resize<float>({1});
+        // xattention_threshold.ptr<float>()[0] = 0.9f;
+        // xattention_stride = 16;
+        // xattention_block_size = 128;


Commented code

maxnick · 2025-11-21T10:49:17Z

Do we have any e2e tests of this feature? Can we extend the existing PA functional test to cover this Xattention feature?

mangguo321 requested review from a team as code owners September 16, 2025 03:33

github-actions bot added category: CPU OpenVINO CPU plugin category: build OpenVINO cmake script / infra labels Sep 16, 2025

mangguo321 marked this pull request as draft September 16, 2025 03:34

zhangYiIntel self-assigned this Sep 30, 2025

zhangYiIntel requested changes Sep 30, 2025

View reviewed changes

zhangYiIntel reviewed Sep 30, 2025

View reviewed changes

liubo-intel force-pushed the mang/xatt branch 2 times, most recently from 7dc5202 to 51840cb Compare September 30, 2025 12:30

liubo-intel force-pushed the mang/xatt branch from 51840cb to 07148e4 Compare October 9, 2025 07:20

mangguo321 marked this pull request as ready for review October 12, 2025 23:09

github-actions bot added category: build OpenVINO cmake script / infra and removed category: build OpenVINO cmake script / infra labels Oct 14, 2025

mangguo321 force-pushed the mang/xatt branch from 158ae44 to c704613 Compare October 15, 2025 02:40

zhangYiIntel reviewed Oct 16, 2025

View reviewed changes

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/xattention.hpp Outdated Show resolved Hide resolved

zhangYiIntel approved these changes Oct 16, 2025

View reviewed changes

mangguo321 force-pushed the mang/xatt branch from abce4b9 to cefab2c Compare October 16, 2025 06:31

peterchen-intel mentioned this pull request Oct 16, 2025

Fix XAttention reference code for better alignment with the original #32451

Merged

mangguo321 force-pushed the mang/xatt branch from 675797d to 40e7ec4 Compare October 20, 2025 15:21

yuxu42 requested review from Copilot and vshampor October 21, 2025 03:44

Copilot AI reviewed Oct 21, 2025

View reviewed changes

maxnick added this to the 2026.0 milestone Nov 19, 2025

maxnick self-assigned this Nov 19, 2025

mangguo321 added 3 commits November 20, 2025 16:27

Enable xattention for intel CPU

c9b96b5

Integrate mask to PA

b4bbb28

Support multiple sequences for xattention

4c43504

mangguo321 and others added 25 commits November 20, 2025 16:27

Fix find block issue

276fd0b

Copy key buffer before executing GEMM and bf16 precision support.

a09b3a4

revert pa conflict changes with sparse attetnion execution

2408add

enable sparse attetnion execution

a6ff667

Use optimized softmax kernel in X-Attention

de6a6de

Optimize block sum using hardware instructions

715f790

Add macro definications to not support ARM platform

a9c542c

modify xattention_block_size and PA block_size check and scale calcul…

114ef7f

…ation places

optimize sparse softmax mask in sparse mask attention: extend attn_so…

c86732d

…ftmax_kernel(float) to support sparse mask

Add const to unchanged variable

bcede76

Minor changes in find blocks logic and remove useless parameter.

aac68ab

Fix clang-format issue

ca4347d

optimize use_softmax_sparse_mask check

c0ffb6a

Apply suggestions from code review

fcf896c

Add test case.

8d39093

Fix Clang-format issue

bcb1309

Avoid repeated brgemm kernel creation and temp buffer creation

a654802

Fix ARM cross compile

1419bd6

Fix CI failure

feb4d22

Fix arm compile error

2ac85cb

Fix RISCV64 error

a92433a

Change sum_block logic to handle column num.

9f27b2f

Add const to unchanged variables. Set brgemm output buffer.

458a124

fix arm testcase fail issue: Fix ARM testcase failure: Retain the C++…

dfdcdbb

… implementation of scale_add2_reduce_max not only as a reference but also as a tail processing step (primarily for ARM).

Add test case compare with reference implementation

2939bc6

mangguo321 force-pushed the mang/xatt branch from 40e7ec4 to 2939bc6 Compare November 20, 2025 11:56

maxnick reviewed Nov 21, 2025

View reviewed changes

	OPENVINO_THROW("xattention: bf16 needs avx512+ hardware.");
	OPENVINO_THROW("xattention: bf16 needs avx512+ hardware");

[CPU] Implement X-Attention for intel CPU #32086

Are you sure you want to change the base?

[CPU] Implement X-Attention for intel CPU #32086

Conversation

mangguo321 commented Sep 16, 2025 • edited by peterchen-intel Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Details:

Tickets:

Uh oh!

zhangYiIntel left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

liubo-intel commented Sep 30, 2025

Uh oh!

Uh oh!

zhangYiIntel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

maxnick commented Nov 19, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maxnick commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

mangguo321 commented Sep 16, 2025 •

edited by peterchen-intel

Loading