CUDA: only use moe_expert_reduce when n_tokens=1 #17032

am17an · 2025-11-05T17:08:42Z

When doing -ot ".ffn_(down)_exps.=CPU", this kernel produces garbage output for tokens > 1, it may be related to CUDA graph capture when using -ot, I will try to investigate more. For now, this fixes it

am17an · 2025-11-05T17:10:41Z

@slaren is there a way to detect that a buffer might be overridden? ggml_backend_cuda_split_buffer_type_is_host would be ideal I guess but it's not implemented yet

slaren · 2025-11-05T17:28:55Z

@slaren is there a way to detect that a buffer might be overridden? ggml_backend_cuda_split_buffer_type_is_host would be ideal I guess but it's not implemented yet

No. I am not sure what you are trying to do, but what you are asking is something that the backend should not be concerned with.

am17an · 2025-11-05T17:32:11Z

No. I am not sure what you are trying to do, but what you are asking is something that the backend should not be concerned with.

I'm trying to turn off fusion if there is --ot involved in any of the fused tensors, we have a similar check for split buffers

slaren · 2025-11-05T17:38:30Z

That would be a workaround, not an actual solution. We need to find the source of the problem and fix that. I mentioned before that I suspect that ggml_node_get_use_count may not work properly when ggml_backend_sched replaces a node, I suggest checking that fusion is not being incorrectly enabled with some combinations of -ot, when the intermediate tensors are necessary.

am17an · 2025-11-05T18:29:22Z

That's not the problem here at least. I'm thinking it might be something to do with different sizes of the tensors between mmq buffer and the cpu weights. mmq will do some padding to avoid boundary checks internally.

am17an · 2025-11-06T03:45:39Z

Interestingly this bug only manifests when there is

Multi GPU
At least one of the GPUs is a blackwell

ORippler · 2025-11-06T09:04:31Z

@am17an

Can you please specify the repro more closely? Does it happen in pre-fill phase? Or token gen phase. The default behavior for multi-GPU and split-GPU is that we split the cgraph into multiple subgraphs. This will trigger the conseuctive update check, which will effectively disable Cuda Graphs from the 2nd/3rd call to the main model onwards

llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu

Lines 3550 to 3555 in 22c8c3c

    
           // Disable CUDA graphs (from the next token) if the use-case is demanding too many consecutive graph updates. 
        
           if (use_cuda_graph && cuda_graph_update_required) { 
        
               cuda_ctx->cuda_graph->number_consecutive_updates++; 
        
           } else { 
        
               cuda_ctx->cuda_graph->number_consecutive_updates = 0; 
        
           }

. Cuda Graphs should moreover be disabled when batch-size is >1 (unless the heuristic fails to trigger if the split graph does not contain an addition operation).

Do you observe a repro when launching with GGML_CUDA_DISABLE_GRAPHS=1?

At least one of the GPUs is a blackwell

This is worry-some

am17an · 2025-11-06T09:11:06Z

Do you observe a repro when launching with GGML_CUDA_DISABLE_GRAPHS=1?

Yes I can repro with GGML_CUDA_DISABLE_GRAPHS=1, it goes away with GGML_CUDA_DISABLE_FUSION=1 and also just skipping the moe_expert_reduce kernel. Also goes away without --ot, i.e. fully offloaded with a blackwell too

am17an · 2025-11-06T09:17:34Z

Also it goes away with -ub 1, which is technically what is this PR is also doing

ORippler · 2025-11-06T09:21:09Z

Yes I can repro with GGML_CUDA_DISABLE_GRAPHS=1

I would recommend to verify via nsys/printf, but in that case it is not a cuda graph issue.

Also goes away without --ot, i.e. fully offloaded with a blackwell too

Have you inspected the graph after it is split, yet before it is being fused? Maybe we split around/in the node-pattern we match for in the fusion? Is fusion correctly disabled then?

am17an · 2025-11-06T09:23:10Z

The problem starts with -ub 32 where we start to do offload, till then it's all fine. So likely not a CUDA graph thing

llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu

Lines 4137 to 4143 in 2759ccd

    
           static bool ggml_backend_cuda_device_offload_op(ggml_backend_dev_t dev, const ggml_tensor * op) { 
        
               const int min_batch_size = 32; 
        
               return get_op_batch_size(op) >= min_batch_size; 
        
               GGML_UNUSED(dev); 
        
           }

am17an · 2025-11-06T09:26:35Z

Have you inspected the graph after it is split, yet before it is being fused? Maybe we split around/in the node-pattern we match for in the fusion? Is fusion correctly disabled then?

Although I don't have enough expertise in ggml graphs, to me they looked fine. Attaching both a good graph (2x 4090) vs a bad graph run (1 x 4090, 1 x 5090) with GGML_SCHED_DEBUG=2

bad.txt
good.txt

ORippler · 2025-11-06T09:39:23Z

Red is "bad", green is "good"

Seems to me like in "good" we go "CUDA 0 -> CPU -> CUDA0" while "bad" is "CUDA 0 -> CPU -> CUDA1". In case 1 we cannot fuse, not sure about case 0.

am17an · 2025-11-06T09:43:15Z

I think you'll find a CUDA0->CPU->CUDA1 in the second "good" graph also, since the 5090 has 32gb (unlike the 4090's 24), the allocations are a bit different

ORippler · 2025-11-06T09:43:32Z

Yup just spotted

am17an · 2025-11-06T14:09:47Z

Selectively offloading layer by layer from the back, I see the problem first occurs on offloading a layer between a 4090 to 5090.

EDIT: perhaps unsurprisingly, offloading just that layer also causes the same problem

am17an · 2025-11-07T09:53:37Z

Okay the issue that the llama-graph.cpp seems to recycle the weights buffer into the moe_out buffer, and experts_d and dst_d seem to alias. What I don't understand how it only occurs on offload + a blackwell. Also weights being read-only and dst being write-only, I would think the alias is still okay, so definitely there's something more going on here. Also perhaps why CUDA graphs is able to capture this dependency, while the static graph analysis can't for obvious reasons

diff --git a/src/llama-graph.cpp b/src/llama-graph.cpp
index f9751b318..606d6ae1a 100644
--- a/src/llama-graph.cpp
+++ b/src/llama-graph.cpp
@@ -1119,6 +1119,8 @@ ggml_tensor * llm_graph_context::build_moe_ffn(

         ggml_build_forward_expand(gf, cur_experts[i]);
     }
+    weights = ggml_dup(ctx0, weights);
+    ggml_build_forward_expand(gf, weights);

     // aggregate experts
     // note: here we explicitly use hparams.n_expert_used instead of n_expert_used

slaren · 2025-11-07T10:54:27Z

What you may be observing is that ggml-alloc can make operations in-place automatically to save memory when it determines that there are no dependencies. When this happens, the destination buffer of an operation will be the same as the buffer of one of its operands. If you suspect that's causing issues, you can try to disable this behavior by making ggml_op_can_inplace in ggml-alloc.c always return false.

am17an · 2025-11-07T11:42:33Z

you can try to disable this behavior by making ggml_op_can_inplace in ggml-alloc.c always return false.

This does not work. My understanding is once the lifetime of a tensor finishes, the allocator is free to use that memory in any way it chooses, in this case it can recycle the buffer used for the weights for dst, as weights lifetimes ends.

slaren · 2025-11-07T11:47:32Z

My understanding is once the lifetime of a tensor finishes, the allocator is free to use that memory in any way it chooses, in this case it can recycle the buffer used for the weights for dst, as weights lifetimes ends.

This is correct, although you can also look at what the auto in-place mechanism does as freeing the memory of a tensor on the last operation it is used (and reusing it for the tensor of the operation itself).

ORippler · 2025-11-07T12:14:10Z

Also perhaps why CUDA graphs is able to capture this dependency, while the static graph analysis can't for obvious reasons

I thought CUDA graphs are disabled in the repro

am17an · 2025-11-07T12:15:59Z

I thought CUDA graphs are disabled in the repro

They are automatically disabled for batch_size > 1

jeffbolznv · 2025-11-07T13:15:13Z

I (at least theoretically) ran into this aliasing issue in a recent MR and added a check in the vulkan backend at https://github.com/ggml-org/llama.cpp/pull/16977/files#diff-35a5049d5eebe22eda1e0d661bd87639b31aafeba62deeaaaca9c13ec3e71d11R12903. If a fused operation is element-wise (or at least, the same thread or possibly workgroup overwrites the locations it reads), it ought to be safe to reuse the tensor memory if it overlaps exactly. So there are two cases where we ought to disable fusion: (1) memory is reused in a way where the inputs/output partially overlap (so one thread could clobber another thread's inputs), and (2) when the overlap is exact but input values are reused by different threads/workgroups.

am17an · 2025-11-08T01:43:02Z

Amazingly #17089 fixes this

am17an · 2025-11-08T07:15:46Z

I spoke too soon

CUDA: only use moe_expert_reduce when n_tokens=1

9609da0

am17an mentioned this pull request Nov 5, 2025

Eval bug: unsloth/gpt-oss-120b-GGUF:F16 produces incoherent output #17016

Open

DajanaV mentioned this pull request Nov 5, 2025

UPSTREAM PR #17032: CUDA: only use moe_expert_reduce when n_tokens=1 auroralabs-loci/llama.cpp#93

Open

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Nov 5, 2025

am17an closed this Nov 8, 2025

am17an reopened this Nov 8, 2025

CUDA: only use moe_expert_reduce when n_tokens=1 #17032

Are you sure you want to change the base?

CUDA: only use moe_expert_reduce when n_tokens=1 #17032

Conversation

am17an commented Nov 5, 2025

Uh oh!

am17an commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slaren commented Nov 5, 2025

Uh oh!

am17an commented Nov 5, 2025

Uh oh!

slaren commented Nov 5, 2025

Uh oh!

am17an commented Nov 5, 2025

Uh oh!

am17an commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ORippler commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

am17an commented Nov 6, 2025

Uh oh!

am17an commented Nov 6, 2025

Uh oh!

ORippler commented Nov 6, 2025

Uh oh!

am17an commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

am17an commented Nov 6, 2025

Uh oh!

ORippler commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

am17an commented Nov 6, 2025

Uh oh!

ORippler commented Nov 6, 2025

Uh oh!

am17an commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

am17an commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slaren commented Nov 7, 2025

Uh oh!

am17an commented Nov 7, 2025

Uh oh!

slaren commented Nov 7, 2025

Uh oh!

ORippler commented Nov 7, 2025

Uh oh!

am17an commented Nov 7, 2025

Uh oh!

jeffbolznv commented Nov 7, 2025

Uh oh!

am17an commented Nov 8, 2025

Uh oh!

am17an commented Nov 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

am17an commented Nov 5, 2025 •

edited

Loading

am17an commented Nov 6, 2025 •

edited

Loading

ORippler commented Nov 6, 2025 •

edited

Loading

am17an commented Nov 6, 2025 •

edited

Loading

ORippler commented Nov 6, 2025 •

edited

Loading

am17an commented Nov 6, 2025 •

edited

Loading

am17an commented Nov 7, 2025 •

edited

Loading