-
Notifications
You must be signed in to change notification settings - Fork 13.6k
ggml webgpu: faster matrix multiplication/matrix-vector multiplication #17031
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Add fast matrix and matrix/vector multiplication.
CISC
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Who can/should review the webgpu part?
@CISC perhaps the other person in this repository who has the most context on web stuff in general is @ngxson, but I'm not sure how busy they are. Otherwise, it would be great to get 1/2 more serious collaborators on the WebGPU backend, but I'm not sure who that would be at the moment. I think we're getting closer to a demo on the browser running llama.cpp with WebGPU integration, so if that is publicized a bit when it happens maybe it'll lead to some more interest in helping out. |
Yes, it's always a little tricky if there's only one person with knowledge on a codebase piece.
For sure, drumming up some publicity here and on the usual channels like LocalLlama etc when the time comes is a given. @ggerganov make a mental note. :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think having a second contributor dedicated on webgpu would be nice. Personally I know more about web development in general, not particularly good at webgpu stuff.
Re. the introduction of ggml_webgpu_process_shader_repls in this PR, probably it's not necessary as shaders/kernels are compiled statically to different version on other backends too. These shaders are often small, so I think we should keep compiling them statically for simplification.
The reason I added this is that some values, e.g., I think there's a bigger question around how/when to generate shaders, which I'm open to ideas/feedback on. Probably it makes sense to move away from the Python script at some point, which I made just for early speed/iteration, and use a C++ solution, like what Vulkan does. |
|
I also just want to drop a quick acknowledgement here to @SharmaRithik, @xuyanwen2012, @Ant-28, and @tyler-utah, who helped write and design the infrastructure around making these shaders possible. |
Adds the following
Some preliminary performance numbers on my M3:
Llama-3.2-1B-Instruct-F16
WebGPU:
Metal:
Llama-3.2-1B-Instruct-Q4_0
WebGPU:
Metal: