Skip to content

Conversation

@ggerganov
Copy link
Member

@ggerganov ggerganov commented Nov 2, 2025

See #16944 (comment)

Sample commands:

make -j && ./bin/llama-bench -m ../models/gpt-oss-20b/ggml-model-mxfp4.gguf -t 1 -fa 1 -b 16384 -ub 2048 -d 0,1024,2048,4096,8192,16384,32768 -n 32 -p 2048
make -j && ./bin/llama-bench -m ../models/qwen2.5-3b-coder/ggml-model-q4_k.gguf -fa 1 -d 1024 -p 512 -ctk f16,q8_0

@JohannesGaessler
Copy link
Collaborator

I should clarify, for a MoE model this is not going to work correctly. Because the expert selection depends on the numerical input values leaving the memory of the KV cache uninitialized is going to bias the results.

My understanding is that recently functionality was added to the server which swaps on-device KV cache with RAM. If that could be repurposed for an implementation like this it would I think already be fast enough:

  1. Do depth run with default batch size of 512.
  2. Save the state of the KV cache.
  3. For each benchmark run, load the KV cache state first instead of re-calculating it.

@ggerganov ggerganov force-pushed the gg/context-skip-compute branch from e2f222c to 9e4cbd5 Compare November 3, 2025 11:25
@ggerganov ggerganov changed the title llama : add option to skip the compute of a batch bench : cache the llama_context state at computed depth Nov 3, 2025
@ggerganov ggerganov force-pushed the gg/context-skip-compute branch from 9e4cbd5 to 08a3c4a Compare November 3, 2025 12:21
@ggerganov
Copy link
Member Author

@JohannesGaessler Good idea - pushed a version that I think does what you described.

Copy link
Collaborator

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, this version seems to be working correctly. One caveat is that the order in which you do the batch sizes matters. Preferably you would run the large batch sizes first as they are going to process the first, uncached KV context much faster. However, the syntax -ub "1-512*2" will do the batch sizes in a suboptimal order (for my purposes this doesn't matter because I'm going to script it anyways). One solution would be to do the depth run always with a constant batch size but previously when we tried that that was causing issues (and I'm not sure it would be worth the opportunity cost to fix).

@ggerganov
Copy link
Member Author

One solution would be to do the depth run always with a constant batch size but previously when we tried that that was causing issues (and I'm not sure it would be worth the opportunity cost to fix).

Can you point me to this previous attempt?

@JohannesGaessler
Copy link
Collaborator

#13096

@ggerganov ggerganov force-pushed the gg/context-skip-compute branch from b5ce8df to f709a32 Compare November 3, 2025 14:07
@ggerganov
Copy link
Member Author

One solution would be to do the depth run always with a constant batch size but previously when we tried that that was causing issues (and I'm not sure it would be worth the opportunity cost to fix).

Hm yeah, not sure what's the best way to do that.

@ggerganov ggerganov marked this pull request as ready for review November 4, 2025 19:45
@ggerganov ggerganov requested a review from slaren as a code owner November 4, 2025 19:45
if (params.progress) {
fprintf(stderr, "llama-bench: benchmark %d/%zu: depth run %d/%d\n", params_idx, params_count,
i + 1, params.reps);
bool is_cached = t.n_depth == cstate.depth;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this need also to check if the model is the same, using the same KV type, or other parameters that may make the cache incompatible with the current test?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

llama_state_seq_set_data should (in theory) return error (i.e. 0) when the state is incompatible with the current llama_context. I did try a few cases (different models, different KV cache types) and it seems to work as expected.

But it is a bit risky if somehow it's internal checks fail to detect an incompatibility, which can lead to invalid benches. So not sure - we can simply the logic to just reuse the state for the repetitions of the same test?

@ggerganov ggerganov force-pushed the gg/context-skip-compute branch from a7bec56 to 9c6bc80 Compare November 7, 2025 18:17
@ggerganov
Copy link
Member Author

Did some extra tests with SSMs and it works as expected. Let's keep an eye out just in case - using --progress is useful for debugging this feature.

@ggerganov ggerganov merged commit 7956bb4 into master Nov 7, 2025
61 of 66 checks passed
@ggerganov ggerganov deleted the gg/context-skip-compute branch November 7, 2025 19:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants