Replies: 1 comment 1 reply
-
|
Now that the load-to-smem functions have been factored out, I think it would be reasonable to split mul_mm into separate files for scalar vs coopmat1. But I defer to @0cc4m. What is the goal and target hardware for your optimizations? There's not much point in optimizing the non-coopmat GEMM for RTX 3070. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
It seems that the
mul_mmhas three major impls: simt / coopmat1 / coopmat2. But simt & cm1 are combined in one single .comp file. I wonder if we can split them into two files for easier future updates. I can handle this if U approve this idea @jeffbolznv .Actually I'm now working on simt sgemm, with a more aggressive register usage, the performance seems to be a bit higher than the warp-tile method. You can check the draft impl here: https://github.com/lumina37/vulkan-compute-demo/blob/master/shader/glsl/sgemm/dbg/rcc/v1.comp. But the wrap-tile is also been used in cutlass. So I'm not sure whether the improvement is reasonable.
On my 3070 the sgemm performance is (all the launch params have been fine-tuned):
Compare with
Beta Was this translation helpful? Give feedback.
All reactions