Split `mul_mm` GLSL shader into simt / coopmat1 parts #16811

lumina37 · 2025-10-28T04:57:05Z

lumina37
Oct 28, 2025

It seems that the mul_mm has three major impls: simt / coopmat1 / coopmat2. But simt & cm1 are combined in one single .comp file. I wonder if we can split them into two files for easier future updates. I can handle this if U approve this idea @jeffbolznv .
Actually I'm now working on simt sgemm, with a more aggressive register usage, the performance seems to be a bit higher than the warp-tile method. You can check the draft impl here: https://github.com/lumina37/vulkan-compute-demo/blob/master/shader/glsl/sgemm/dbg/rcc/v1.comp. But the wrap-tile is also been used in cutlass. So I'm not sure whether the improvement is reasonable.

On my 3070 the sgemm performance is (all the launch params have been fine-tuned):

E:\code\vulkan\vulkan-compute-demo\cmake-build-debug\samples\vkc-bin-sgemm-dbg-rcc-v1.exe
Candidate physical device: NVIDIA GeForce RTX 3070. Vk API version: 1.4.312. Score: 147456
Candidate queue family: 0. Score: -4
Candidate queue family: 2. Score: -3
============================
Size: 1024
Dispatch timecost: 0.1994752 ms
Performace: 10.7657 (10.6836~10.8490) tflops
============================
Size: 2048
Dispatch timecost: 1.4907392 ms
Performace: 11.5244 (11.4972~11.5517) tflops
============================
Size: 3072
Dispatch timecost: 5.0623426 ms
Performace: 11.4536 (11.0264~11.9152) tflops
============================
Size: 4096
Dispatch timecost: 11.232684 ms
Performace: 12.2356 (11.6306~12.9071) tflops
============================
...

Compare with

E:\code\vulkan\vulkan-compute-demo\cmake-build-debug\samples\vkc-bin-sgemm-dbg-rcc-ggml.exe
Candidate physical device: NVIDIA GeForce RTX 3070. Vk API version: 1.4.312. Score: 147456
Candidate queue family: 0. Score: -4
Candidate queue family: 2. Score: -3
============================
Size: 1024
Dispatch timecost: 0.3217408 ms
Performace: 6.6746 (6.6254~6.7245) tflops
============================
Size: 2048
Dispatch timecost: 1.8550783 ms
Performace: 9.2610 (9.2085~9.3141) tflops
============================
Size: 3072
Dispatch timecost: 6.0185084 ms
Performace: 9.6340 (9.6217~9.6462) tflops
============================
Size: 4096
Dispatch timecost: 14.092883 ms
Performace: 9.7524 (9.7164~9.7886) tflops
============================
...

jeffbolznv · 2025-10-28T05:19:01Z

jeffbolznv
Oct 28, 2025
Collaborator

Now that the load-to-smem functions have been factored out, I think it would be reasonable to split mul_mm into separate files for scalar vs coopmat1. But I defer to @0cc4m.

What is the goal and target hardware for your optimizations? There's not much point in optimizing the non-coopmat GEMM for RTX 3070.

1 reply

lumina37 Oct 28, 2025
Author

The target hardware is some modern GPU that has a large enough register file (maybe 96KB or more) but does not support coopmat extension. And the goal is to optimize gemm by using more registers. I know increasing the register per thread may lead to register spilling and I still need to do more experiments on various devices.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Split `mul_mm` GLSL shader into simt / coopmat1 parts #16811

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Split mul_mm GLSL shader into simt / coopmat1 parts #16811

Uh oh!

lumina37 Oct 28, 2025

Replies: 1 comment · 1 reply

Uh oh!

jeffbolznv Oct 28, 2025 Collaborator

Uh oh!

Uh oh!

lumina37 Oct 28, 2025 Author

Split `mul_mm` GLSL shader into simt / coopmat1 parts #16811

lumina37
Oct 28, 2025

Replies: 1 comment 1 reply

jeffbolznv
Oct 28, 2025
Collaborator

lumina37 Oct 28, 2025
Author