Skip to content

Conversation

@mangguo321
Copy link
Contributor

@mangguo321 mangguo321 commented Sep 16, 2025

Details:

  • Implement X-Attention which is run in pre-inference stage before PagedAttention in attention pipeline to generate sparse attention blocks to accelerate long prompt inference

Tickets:

@mangguo321 mangguo321 requested review from a team as code owners September 16, 2025 03:33
@github-actions github-actions bot added category: CPU OpenVINO CPU plugin category: build OpenVINO cmake script / infra labels Sep 16, 2025
@mangguo321 mangguo321 marked this pull request as draft September 16, 2025 03:34
@zhangYiIntel zhangYiIntel self-assigned this Sep 30, 2025
Copy link
Contributor

@zhangYiIntel zhangYiIntel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current design only consider 1 batch, if there is multiple batches or chunked inputs e.g. continuous batching, this PR cannot handle correctly. Maybe we need a TODO to track the following.


// Instead of writing -inf directly into scores, build a softmax mask (0/-inf) and pass it to the kernel
DATA_TYPE* softmax_mask = nullptr;
std::vector<DATA_TYPE> softmax_mask_storage;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider make softmax_mask_storage as a class member of MHAHelper

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have optimized this sparse softmax mask generation method in the update commit, currently no extra sparse softmax mask generation needed anymore.

const PlainTensor& block_indices_begins,
const PlainTensor& alibi_slopes,
const PlainTensor& score_aggregation_window) {
const PlainTensor& score_aggregation_window,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider remove default value for sparse_attention_mask, to avoid misuse of outer function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// xattention_block_size = 128;

// If to support second token sparse attention, need generate sparse mask after concat_pastkv
if (xattention_threshold && q.size(0) > 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check of prefill phase here is not robust, please follow the code in

if (past_lens.m_dims[0] >= nthr || _workitems.get_reorder_max_batch_size() > 0) {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @zhangYiIntel : from my understanding, this check here if (past_lens.m_dims[0] >= nthr || _workitems.get_reorder_max_batch_size() > 0) { include two conditions: ' _workitems.get_reorder_max_batch_size() > 0' : first token, and 'past_lens.m_dims[0] >= nthr' : second token with long past_kv.
while since our current cpu x-attention impl only support first token cases, so it seems we can use ‘q.size(0) > 1’(equals to ' _workitems.get_reorder_max_batch_size() > 0')for x-attention check.

@liubo-intel liubo-intel force-pushed the mang/xatt branch 2 times, most recently from 7dc5202 to 51840cb Compare September 30, 2025 12:30
@liubo-intel
Copy link
Contributor

Current design only consider 1 batch, if there is multiple batches or chunked inputs e.g. continuous batching, this PR cannot handle correctly. Maybe we need a TODO to track the following.

have added this limitation/TODOs in the ticket CVS-171057

@mangguo321 mangguo321 marked this pull request as ready for review October 12, 2025 23:09
@github-actions github-actions bot added category: build OpenVINO cmake script / infra and removed category: build OpenVINO cmake script / infra labels Oct 14, 2025
Copy link
Contributor

@zhangYiIntel zhangYiIntel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, please fix the minor changes.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements X-Attention for Intel CPU to accelerate long prompt inference by generating sparse attention blocks in a pre-inference stage before PagedAttention execution.

Key Changes:

  • Added X-Attention computation kernel that estimates sparse attention block masks based on query-key similarity scores
  • Integrated sparse mask filtering into the existing PagedAttention execution pipeline
  • Extended softmax kernel to support sparse masking with block-level granularity

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
xattention_test.cpp Unit tests for X-Attention functionality comparing against reference implementation
CMakeLists.txt Added xattention_test.cpp exclusion for non-x86_64 builds and included reference headers
xattention.hpp Core X-Attention implementation with BRGEMM-based attention estimation and block selection
softmax_kernel.hpp Extended softmax kernel with sparse mask support via new template parameter
executor_pa.cpp Integration of X-Attention into PagedAttention execution pipeline with sparse mask filtering

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

(S * stride * K_H) // src stride
);
# else
OPENVINO_THROW("xattention: bf16 needs avx512+ hardware.");
Copy link

Copilot AI Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error message uses inconsistent punctuation. Remove the period at the end for consistency with other error messages in the codebase, such as line 311.

Suggested change
OPENVINO_THROW("xattention: bf16 needs avx512+ hardware.");
OPENVINO_THROW("xattention: bf16 needs avx512+ hardware");

Copilot uses AI. Check for mistakes.
@maxnick maxnick added this to the 2026.0 milestone Nov 19, 2025
@maxnick maxnick self-assigned this Nov 19, 2025
@maxnick
Copy link
Contributor

maxnick commented Nov 19, 2025

@mangguo321, could you please resolve merge conflicts?

mangguo321 and others added 25 commits November 20, 2025 16:27
… implementation of scale_add2_reduce_max not only as a reference but also as a tail processing step (primarily for ARM).
Comment on lines +31 to +53
inline void transpose_tailx16_kernel(TDST* dst,
TSRC* src,
size_t n_cnt,
size_t k_cnt,
size_t dst_stride,
size_t src_stride) {
for (size_t i = 0; i < n_cnt; i++) {
for (size_t j = 0; j < k_cnt; j++) {
dst[j * dst_stride + i] = static_cast<TDST>(src[j + i * src_stride]);
}
}
}

template <typename TDST,
ov::element::Type_t SRC_PREC,
std::enable_if_t<(none_of(SRC_PREC, ov::element::i8, ov::element::u8, ov::element::u4)), bool> = true>
void transpose_16NxK(TDST* dst,
void* src,
const size_t N,
const size_t K,
const size_t block_size,
const size_t dst_stride,
const size_t src_stride) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have pretty similar templates in src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/executor_pa.cpp. Can't we extract them in a common header transpose.hpp and reuse across implementations?

#endif

#if defined(OPENVINO_ARCH_X86_64) || defined(OPENVINO_ARCH_ARM64)
struct Xattn {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a general comment to Xattn struct. It's not a template class. It's methods aren't template either. Also implementations look to complex to make advantage of ilining their code.
Can we consider moving method definitions to a .cpp file?

Comment on lines +853 to +863
sparse_scale = (_sparse_mask_block_size == 0 || _sparse_mask_block_size == _block_size)
? 1
: (_sparse_mask_block_size / _block_size); // >=1
map_to_mask_idx = [sparse_scale](size_t q_blk_rt, size_t k_blk_rt) {
if (sparse_scale == 1) {
return std::pair<size_t, size_t>{q_blk_rt, k_blk_rt};
}
size_t q_mask = q_blk_rt / sparse_scale;
size_t k_mask = k_blk_rt / sparse_scale;
return std::pair<size_t, size_t>{q_mask, k_mask};
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
sparse_scale = (_sparse_mask_block_size == 0 || _sparse_mask_block_size == _block_size)
? 1
: (_sparse_mask_block_size / _block_size); // >=1
map_to_mask_idx = [sparse_scale](size_t q_blk_rt, size_t k_blk_rt) {
if (sparse_scale == 1) {
return std::pair<size_t, size_t>{q_blk_rt, k_blk_rt};
}
size_t q_mask = q_blk_rt / sparse_scale;
size_t k_mask = k_blk_rt / sparse_scale;
return std::pair<size_t, size_t>{q_mask, k_mask};
};
if (_sparse_mask_block_size == 0 || _sparse_mask_block_size == _block_size) {
sparse_scale = _sparse_mask_block_size / _block_size;
map_to_mask_idx = [sparse_scale](size_t q_blk_rt, size_t k_blk_rt) {
size_t q_mask = q_blk_rt / sparse_scale;
size_t k_mask = k_blk_rt / sparse_scale;
return std::pair<size_t, size_t>{q_mask, k_mask};
};

Because sparse_scale == 1 correspond to the default case anyway.

Comment on lines +2148 to +2151
// xattention_threshold.resize<float>({1});
// xattention_threshold.ptr<float>()[0] = 0.9f;
// xattention_stride = 16;
// xattention_block_size = 128;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented code

@maxnick
Copy link
Contributor

maxnick commented Nov 21, 2025

Do we have any e2e tests of this feature? Can we extend the existing PA functional test to cover this Xattention feature?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: build OpenVINO cmake script / infra category: CPU OpenVINO CPU plugin

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants