-
Notifications
You must be signed in to change notification settings - Fork 13.6k
bench : cache the llama_context state at computed depth #16944
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I should clarify, for a MoE model this is not going to work correctly. Because the expert selection depends on the numerical input values leaving the memory of the KV cache uninitialized is going to bias the results. My understanding is that recently functionality was added to the server which swaps on-device KV cache with RAM. If that could be repurposed for an implementation like this it would I think already be fast enough:
|
e2f222c to
9e4cbd5
Compare
9e4cbd5 to
08a3c4a
Compare
|
@JohannesGaessler Good idea - pushed a version that I think does what you described. |
JohannesGaessler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, this version seems to be working correctly. One caveat is that the order in which you do the batch sizes matters. Preferably you would run the large batch sizes first as they are going to process the first, uncached KV context much faster. However, the syntax -ub "1-512*2" will do the batch sizes in a suboptimal order (for my purposes this doesn't matter because I'm going to script it anyways). One solution would be to do the depth run always with a constant batch size but previously when we tried that that was causing issues (and I'm not sure it would be worth the opportunity cost to fix).
Can you point me to this previous attempt? |
b5ce8df to
f709a32
Compare
Hm yeah, not sure what's the best way to do that. |
| if (params.progress) { | ||
| fprintf(stderr, "llama-bench: benchmark %d/%zu: depth run %d/%d\n", params_idx, params_count, | ||
| i + 1, params.reps); | ||
| bool is_cached = t.n_depth == cstate.depth; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't this need also to check if the model is the same, using the same KV type, or other parameters that may make the cache incompatible with the current test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
llama_state_seq_set_data should (in theory) return error (i.e. 0) when the state is incompatible with the current llama_context. I did try a few cases (different models, different KV cache types) and it seems to work as expected.
But it is a bit risky if somehow it's internal checks fail to detect an incompatibility, which can lead to invalid benches. So not sure - we can simply the logic to just reuse the state for the repetitions of the same test?
a7bec56 to
9c6bc80
Compare
|
Did some extra tests with SSMs and it works as expected. Let's keep an eye out just in case - using |
See #16944 (comment)
Sample commands:
make -j && ./bin/llama-bench -m ../models/gpt-oss-20b/ggml-model-mxfp4.gguf -t 1 -fa 1 -b 16384 -ub 2048 -d 0,1024,2048,4096,8192,16384,32768 -n 32 -p 2048make -j && ./bin/llama-bench -m ../models/qwen2.5-3b-coder/ggml-model-q4_k.gguf -fa 1 -d 1024 -p 512 -ctk f16,q8_0