Skip to content

Conversation

@reeselevine
Copy link
Collaborator

Adds the following

  • Two matrix multiplication implementations: one using register tiling and the other "subgroup matrices" (WebGPU's feature to allow access to tensor cores/optimized subgroup (warp) routines on devices that have them). Currently, subgroup matrices are experimental, and on devices where it's not supported, the code will fall back to the register tiling approach
  • A somewhat sped up matrix vector multiplication (still needs some work, but it's a decent start I think)
  • Support for f32/f16/q4_0 for this code, but set up in a way that I think will make integration of other quantization types easier.
  • Updates the dawn version the WebGPU backend is built against
  • Moving to a new format for pipeline initialization, with the eventual goal of making initialization lazy/smarter so we don't carry around a ton of compiled shaders that are never used in the browser

Some preliminary performance numbers on my M3:

Llama-3.2-1B-Instruct-F16

WebGPU:

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 1B F16                   |   2.30 GiB |     1.24 B | WebGPU     |  99 |           pp512 |       1014.17 ± 9.38 |
| llama 1B F16                   |   2.30 GiB |     1.24 B | WebGPU     |  99 |           tg128 |         28.71 ± 0.19 |

Metal:

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 1B F16                   |   2.30 GiB |     1.24 B | Metal      |  99 |           pp512 |       1368.47 ± 0.95 |
| llama 1B F16                   |   2.30 GiB |     1.24 B | Metal      |  99 |           tg128 |         35.99 ± 0.78 |

Llama-3.2-1B-Instruct-Q4_0

WebGPU:

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 1B Q4_0                  | 729.75 MiB |     1.24 B | WebGPU     |  99 |           pp512 |        960.52 ± 6.05 |
| llama 1B Q4_0                  | 729.75 MiB |     1.24 B | WebGPU     |  99 |           tg128 |         41.76 ± 0.62 |

Metal:

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 1B Q4_0                  | 729.75 MiB |     1.24 B | Metal      |  99 |           pp512 |       1346.68 ± 1.21 |
| llama 1B Q4_0                  | 729.75 MiB |     1.24 B | Metal      |  99 |           tg128 |        103.92 ± 0.37 |

Add fast matrix and matrix/vector multiplication.
@reeselevine reeselevine requested a review from CISC as a code owner November 5, 2025 17:02
@github-actions github-actions bot added python python script changes devops improvements to build systems and github actions ggml changes relating to the ggml tensor library for machine learning labels Nov 5, 2025
Copy link
Collaborator

@CISC CISC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who can/should review the webgpu part?

@reeselevine
Copy link
Collaborator Author

Who can/should review the webgpu part?

@CISC perhaps the other person in this repository who has the most context on web stuff in general is @ngxson, but I'm not sure how busy they are. Otherwise, it would be great to get 1/2 more serious collaborators on the WebGPU backend, but I'm not sure who that would be at the moment. I think we're getting closer to a demo on the browser running llama.cpp with WebGPU integration, so if that is publicized a bit when it happens maybe it'll lead to some more interest in helping out.

@CISC
Copy link
Collaborator

CISC commented Nov 6, 2025

Who can/should review the webgpu part?

@CISC perhaps the other person in this repository who has the most context on web stuff in general is @ngxson, but I'm not sure how busy they are. Otherwise, it would be great to get 1/2 more serious collaborators on the WebGPU backend, but I'm not sure who that would be at the moment.

Yes, it's always a little tricky if there's only one person with knowledge on a codebase piece.

I think we're getting closer to a demo on the browser running llama.cpp with WebGPU integration, so if that is publicized a bit when it happens maybe it'll lead to some more interest in helping out.

For sure, drumming up some publicity here and on the usual channels like LocalLlama etc when the time comes is a given. @ggerganov make a mental note. :)

Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having a second contributor dedicated on webgpu would be nice. Personally I know more about web development in general, not particularly good at webgpu stuff.

Re. the introduction of ggml_webgpu_process_shader_repls in this PR, probably it's not necessary as shaders/kernels are compiled statically to different version on other backends too. These shaders are often small, so I think we should keep compiling them statically for simplification.

@reeselevine
Copy link
Collaborator Author

Re. the introduction of ggml_webgpu_process_shader_repls in this PR, probably it's not necessary as shaders/kernels are compiled statically to different version on other backends too. These shaders are often small, so I think we should keep compiling them statically for simplification.

The reason I added this is that some values, e.g., WEBGPU_MUL_MAT_SUBGROUP_MATRIX_M, cannot be override constants due to some constraints in WGSL compilers (which I think could be improved). So to avoid defining these values in two places and having to manually make sure they're in sync, this function guarantees they are.

I think there's a bigger question around how/when to generate shaders, which I'm open to ideas/feedback on. Probably it makes sense to move away from the Python script at some point, which I made just for early speed/iteration, and use a C++ solution, like what Vulkan does.

@reeselevine
Copy link
Collaborator Author

I also just want to drop a quick acknowledgement here to @SharmaRithik, @xuyanwen2012, @Ant-28, and @tyler-utah, who helped write and design the infrastructure around making these shaders possible.

@reeselevine reeselevine merged commit 647b960 into ggml-org:master Nov 8, 2025
65 of 70 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

devops improvements to build systems and github actions ggml changes relating to the ggml tensor library for machine learning python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants