-
Notifications
You must be signed in to change notification settings - Fork 13.7k
Implement SparseK Attention mechanism — new GGML operator with CPU backend (GPU planned next) #16817
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Hi @CISC and @NeoZhangJianyu, We’d appreciate it if you could review our PR implementing the new SPARSEK Attention operator. This contribution was developed jointly by both of us (@yael-works and @GittyBurstein ). Thanks in advance for your time and feedback! |
|
We are talking about this SparseK, right? |
|
yes! @CISC |
|
You need to rebase to fix Server CI failures, also please fix whitespaces: |
|
Hi @CISC, I’d really appreciate it if you could review the code itself so we can move forward with the merge — Thanks! |
Yes, as mentioned, will be resolved if you rebase, it's ok. :)
So, my main challenge is where/what/when will SparseK be used? I can't recall seeing any actual implementation being used in the wild. This also means we don't really have any reference to test it against... |
|
@CISC Once this PR is merged, the operator can be connected to higher-level use cases such as:
Thank you!! |
|
I think @ggerganov will have to weigh in on this. |
|
Sparse attention implementations such as DSA and SparseK should leverage the existing FA implementations and mask filtering logic. No need to introduce new operators and duplicate all the existing work that already went into optimizing FA. |
77f4088 to
22c063e
Compare
|
Hi @ggerganov and @CISC, |
16d7eee to
556ab36
Compare
|
Hi @ggerganov and @CISC, |
ggerganov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My idea was more along the following lines:
- Sparse attention implementations should somehow compute a sparse KQ mask. Depending on the specifics (e.g. local windows, top-k product, deepseek lightning stuff, etc.) this can be done in different way, but generally it should require some extra logic when constructing the compute graph
- Then we pass the sparse KQ mask (i.e. a normal mask but with extra -INF values where we don't have to compute the attention) to
ggml_flash_attn_extand we delegate the filtering logic to the backend implementation. For example, the Metal backend will already skip large amount of the filtered values depending on the KQ mask contents (#16372). Similar or better logic can be added to the other backend implementations.
I think at most, the only change to the existing ggml_flash_attn_ext API would be to provide a "mask hint" that would inform the backend what kind of mask to expect (causal, sparse, etc.). End the rest of the changes should be at the compute graph level and at the backend implementation for filtering the -INF values. Let me know if this makes sense.
|
@ggerganov And if that’s the case, where exactly should the mask implementation be added — inside the compute graph logic, or only for testing (e.g., in test-backend-ops)? |
In llama.cpp, the mask is already being created and passed to llama.cpp/src/llama-kv-cache.cpp Lines 1223 to 1306 in afd3532
I think that the sparse attention implementations should augment this static mask through some extra logic. This extra logic should be implemented for example in the From there, the FA implementations will deal with the provided mask in their own way (i.e. by skipping computations when possible).
For testing, you can already take a look how we create KQ masks with blocks of -INF values here: llama.cpp/tests/test-backend-ops.cpp Lines 134 to 176 in afd3532
I imagine that we would need tests that create various sorts of sparse masks and simply run |
7c5f85a to
627bd45
Compare
…oader support Includes only implementation files: - llama-graph: dynamic SparseK mask builder + integration point - llama-model: GGUF key loading for SparseK parameters - llama-model-loader: template instantiations for bool keys - llama-hparams: new SparseK fields - convert_hf_to_gguf.py: emit SparseK keys in GGUF
Co-authored-by: Gitty Burstein <[email protected]> Co-authored-by: Yael Shuker <[email protected]>
| if (n_head := self.find_hparam(["num_attention_heads", "n_head", "n_heads"], optional=True)) is not None: | ||
| self.gguf_writer.add_head_count(n_head) | ||
| logger.info(f"gguf: head count = {n_head}") | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Restore this please.
convert_hf_to_gguf.py
Outdated
| # === SparseK dynamic attention metadata === | ||
| self.gguf_writer.add_key("llama.sparsek.enable", int(self.hparams.get("sparsek_enable", 0))) | ||
| self.gguf_writer.add_key("llama.sparsek.top_k", int(self.hparams.get("sparsek_topk", 0))) | ||
| self.gguf_writer.add_key("llama.sparsek.window", int(self.hparams.get("sparsek_window", 0))) | ||
| self.gguf_writer.add_key("llama.sparsek.stride", int(self.hparams.get("sparsek_stride", 0))) | ||
| # ============================================ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was not what @ggerganov meant, I don't believe these are actual config values from a real model.
Additionally, don't use add_key directly and certainly not the llama namespace like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I fixed it!
Co-authored-by: Gitty Burstein <[email protected]> Co-authored-by: Yael Shuker <[email protected]>
Co-authored-by: Gitty Burstein <[email protected]> Co-authored-by: Yael Shuker <[email protected]>
Co-authored-by: Gitty Burstein <[email protected]> Co-authored-by: Yael Shuker <[email protected]>
5c3c65c to
5798c33
Compare
Update Sparse-K GGUF metadata handling, adjust graph construction, and align KV-cache/model paths with the new operator. Co-authored-by: Gitty Burstein <[email protected]> Co-authored-by: Yael Shuker <[email protected]>
…orks/llama.cpp into feature/sparsek-attn-sycl
Corrected max_nodes_ex calculation and updated test coverage. Co-authored-by: Gitty Burstein <[email protected]> Co-authored-by: Yael Shuker <[email protected]>
Co-authored-by: Gitty Burstein <[email protected]> Co-authored-by: Yael Shuker <[email protected]>
tests/test-sparsek_kq_mask.cpp
Outdated
| @@ -0,0 +1,282 @@ | |||
| // tests/test-sparsek_kq_mask.cpp | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As SparseK is special case of attention, no need to create new test case cpp file.
- Move it in test-backend-ops.cpp.
- Follow the existed framework of test-backend-ops.cpp.
liketest_cases.emplace_back(new test_flash_attn_ext( hsk, hsv, nh, {nr2, nr3}, kv, nb, mask, sinks, max_bias, logit_softcap, prec, type_KV))
It help to create more cases for OPs and extend to support GPU in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we did it!
|
|
||
| buf_compute_meta.resize(ggml_tensor_overhead()*max_nodes + ggml_graph_overhead_custom(max_nodes, false)); | ||
| // increase meta buffer slightly to accommodate extra nodes from SparseK | ||
| int64_t max_nodes_ex = max_nodes + 128; // safety headroom |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to define max_nodes_ex here.
If you feel max_nodes is not big enough, increase it directly.
max_nodes & max_nodes_ex will make next developer confused.
+128 is strange too.
If max_nodes are not correct in same case, we should fix it directly.
+128 will hide some issue. If it's not enough for new mode, the crash issue won't be fixed easily.
If you have no strong reason to approve max_nodes is wrong, suggest not change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After I removed the extra +128 headroom, the model started crashing for me.
I’m trying to understand where the actual memory pools / buffer sizes are defined and managed in the code.
Where is the correct place to adjust the memory allocations, instead of relying on a hard-coded offset?
Co-authored-by: Yael Shuker <[email protected]> Co-authored-by: Gitty Burstein <[email protected]>
a9d2015 to
060ee50
Compare
|
Hi @ggerganov @NeoZhangJianyu @CISC Summary of Updates
Performance (CODEGEMMA — Sparse-K vs Baseline)
Additional Notes
Benchmark Command Used./build/bin/llama-cli \
-m /home/gitty/models/codegemma-sparsek-Q4_K_M.gguf \
-ngl 0 \
-c 256 \
-n 400 \
-t 8 |
It's good to see the improvement of this PR and shared info of the result. Could you update these info in the description of PR too? |
|
Thank you for the feedback! |
New Attention Mechanism: SparseK Attention (CPU Backend)
This PR introduces a new attention mechanism called SparseK Attention, implemented from scratch as a new operator within the GGML framework, currently with CPU backend support.
Overview
SparseK Attention is a selective and efficient attention mechanism inspired by Flash Attention, but introduces additional sparsity through:
Implementation Details
GGML_OP_SPARSEK_ATTNdefined inggml.handggml.c.ggml_sparsek_attn()that creates a computation node with parameters (k_top,win_local,stride_global).ggml-cpu/ops.hggml-cpu/ops.cppggml-cpu.cThe CPU version includes:
QKᵀ / √dNext Steps
Our next goal is to extend SparseK Attention to the SYCL (GPU) backend in order to:
We are submitting this initial CPU implementation first to ensure review, integration, and baseline correctness before introducing GPU acceleration.
Co-Authors
Co-authored-by: Yael Shuker ([email protected])
Co-authored-by: Gitty Burstein ([email protected])