-
Notifications
You must be signed in to change notification settings - Fork 13.6k
CUDA: only use moe_expert_reduce when n_tokens=1 #17032
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
@slaren is there a way to detect that a buffer might be overridden? |
No. I am not sure what you are trying to do, but what you are asking is something that the backend should not be concerned with. |
I'm trying to turn off fusion if there is |
|
That would be a workaround, not an actual solution. We need to find the source of the problem and fix that. I mentioned before that I suspect that |
|
That's not the problem here at least. I'm thinking it might be something to do with different sizes of the tensors between mmq buffer and the cpu weights. mmq will do some padding to avoid boundary checks internally. |
|
Interestingly this bug only manifests when there is
|
|
Can you please specify the repro more closely? Does it happen in pre-fill phase? Or token gen phase. The default behavior for multi-GPU and split-GPU is that we split the cgraph into multiple subgraphs. This will trigger the conseuctive update check, which will effectively disable Cuda Graphs from the 2nd/3rd call to the main model onwards llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu Lines 3550 to 3555 in 22c8c3c
Do you observe a repro when launching with
This is worry-some |
Yes I can repro with |
|
Also it goes away with |
I would recommend to verify via nsys/printf, but in that case it is not a cuda graph issue.
Have you inspected the graph after it is split, yet before it is being fused? Maybe we split around/in the node-pattern we match for in the fusion? Is fusion correctly disabled then? |
|
The problem starts with llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu Lines 4137 to 4143 in 2759ccd
|
Although I don't have enough expertise in ggml graphs, to me they looked fine. Attaching both a good graph (2x 4090) vs a bad graph run (1 x 4090, 1 x 5090) with |
|
Yup just spotted |
|
Selectively offloading layer by layer from the back, I see the problem first occurs on offloading a layer between a 4090 to 5090. EDIT: perhaps unsurprisingly, offloading just that layer also causes the same problem |
|
Okay the issue that the llama-graph.cpp seems to recycle the diff --git a/src/llama-graph.cpp b/src/llama-graph.cpp
index f9751b318..606d6ae1a 100644
--- a/src/llama-graph.cpp
+++ b/src/llama-graph.cpp
@@ -1119,6 +1119,8 @@ ggml_tensor * llm_graph_context::build_moe_ffn(
ggml_build_forward_expand(gf, cur_experts[i]);
}
+ weights = ggml_dup(ctx0, weights);
+ ggml_build_forward_expand(gf, weights);
// aggregate experts
// note: here we explicitly use hparams.n_expert_used instead of n_expert_used |
|
What you may be observing is that |
This does not work. My understanding is once the lifetime of a tensor finishes, the allocator is free to use that memory in any way it chooses, in this case it can recycle the buffer used for the weights for dst, as weights lifetimes ends. |
This is correct, although you can also look at what the auto in-place mechanism does as freeing the memory of a tensor on the last operation it is used (and reusing it for the tensor of the operation itself). |
I thought CUDA graphs are disabled in the repro |
They are automatically disabled for batch_size > 1 |
|
I (at least theoretically) ran into this aliasing issue in a recent MR and added a check in the vulkan backend at https://github.com/ggml-org/llama.cpp/pull/16977/files#diff-35a5049d5eebe22eda1e0d661bd87639b31aafeba62deeaaaca9c13ec3e71d11R12903. If a fused operation is element-wise (or at least, the same thread or possibly workgroup overwrites the locations it reads), it ought to be safe to reuse the tensor memory if it overlaps exactly. So there are two cases where we ought to disable fusion: (1) memory is reused in a way where the inputs/output partially overlap (so one thread could clobber another thread's inputs), and (2) when the overlap is exact but input values are reused by different threads/workgroups. |
|
Amazingly #17089 fixes this |
|
I spoke too soon |


When doing
-ot ".ffn_(down)_exps.=CPU", this kernel produces garbage output for tokens > 1, it may be related to CUDA graph capture when using-ot, I will try to investigate more. For now, this fixes it