Vulkan: MMVQ Integer Dot K-Quant and MUL_MAT_ID support #16900

0cc4m · 2025-10-31T17:55:36Z

Add k-quant mul_mat_vec support, and enable MUL_MAT_ID integer dot vector path.

Tuning this is quite difficult. I've included an attempt, but I'm not done. I'll add performance numbers later.

Q3_K and Q6_K currently don't work well at all, I'm still trying to figure out why.

0cc4m · 2025-11-01T11:47:44Z

AMD Radeon Pro VII

model	size	params	ngl	fa	test	t/s (ROCm)	t/s (before)	t/s (after)	diff
llama 8B Q2_K - Medium	2.95 GiB	8.03 B	99	0	tg128	63.49 ± 0.20	71.40 ± 0.24	83.84 ± 0.26	+17.4%
llama 8B Q2_K - Medium	2.95 GiB	8.03 B	99	1	tg128	64.74 ± 0.12	67.75 ± 0.09	78.96 ± 0.20	+16.5%
llama 8B Q3_K - Small	3.41 GiB	8.03 B	99	0	tg128	48.80 ± 0.08	60.59 ± 0.14	59.91 ± 0.24	-1.1%
llama 8B Q3_K - Small	3.41 GiB	8.03 B	99	1	tg128	49.47 ± 0.44	58.06 ± 0.11	57.43 ± 0.04	-1.1%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	99	0	tg128	65.92 ± 0.15	72.60 ± 0.17	76.77 ± 0.24	+5.7%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	99	1	tg128	67.66 ± 0.18	69.41 ± 0.12	72.90 ± 0.19	+5.0%
llama 13B Q5_K - Small	15.18 GiB	23.57 B	99	0	tg128	19.10 ± 0.16	19.11 ± 0.09	24.50 ± 0.16	+28.2%
llama 13B Q5_K - Small	15.18 GiB	23.57 B	99	1	tg128	19.00 ± 0.05	18.24 ± 0.21	23.61 ± 0.22	+29.4%
granitehybrid 1B Q4_K - Small	3.75 GiB	6.94 B	99	0	tg128	80.04 ± 0.02	90.66 ± 0.17	87.32 ± 0.46	-3.7%
granitehybrid 1B Q4_K - Small	3.75 GiB	6.94 B	99	1	tg128	80.24 ± 0.10	86.01 ± 5.01	86.50 ± 0.53	+0.6%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	0	tg128	67.68 ± 0.06	82.89 ± 0.22	85.36 ± 0.61	+3.0%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	tg128	70.80 ± 0.03	75.71 ± 0.17	77.52 ± 0.12	+2.4%
gpt-oss 20B Q8_0	11.27 GiB	20.91 B	99	0	tg128	107.99 ± 0.65	127.26 ± 0.27	128.89 ± 0.75	+1.3%
gpt-oss 20B Q8_0	11.27 GiB	20.91 B	99	1	tg128	114.36 ± 0.11	125.49 ± 0.07	126.27 ± 0.37	+0.6%

AMD Radeon RX 6800 XT

model	size	params	ngl	fa	test	t/s (ROCm)	t/s (before)	t/s (after)	diff
llama 8B Q2_K - Medium	2.95 GiB	8.03 B	99	0	tg128	93.30 ± 0.25	115.95 ± 3.40	122.98 ± 0.14	+6.1%
llama 8B Q2_K - Medium	2.95 GiB	8.03 B	99	1	tg128	95.99 ± 0.11	109.65 ± 1.76	113.62 ± 0.02	+3.6%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	99	0	tg128	75.50 ± 0.01	93.13 ± 0.05	90.81 ± 0.01	-2.5%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	99	1	tg128	77.68 ± 0.00	88.41 ± 0.04	86.52 ± 0.01	-2.1%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	0	tg128	101.67 ± 0.04	148.71 ± 0.08	151.96 ± 0.03	+2.2%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	tg128	106.92 ± 0.01	136.12 ± 0.39	137.91 ± 0.04	+1.3%
gpt-oss 20B Q8_0	11.27 GiB	20.91 B	99	0	tg128	120.05 ± 0.05	145.28 ± 0.05	145.86 ± 0.02	+0.4%
gpt-oss 20B Q8_0	11.27 GiB	20.91 B	99	1	tg128	124.10 ± 0.00	142.70 ± 0.06	143.23 ± 0.04	+0.4%

Intel A770

model	size	params	ngl	fa	test	t/s (before)	t/s (after)	diff
llama 8B Q2_K - Medium	2.95 GiB	8.03 B	99	0	tg128	29.90 ± 0.32	44.53 ± 0.74	+48.9%
llama 8B Q2_K - Medium	2.95 GiB	8.03 B	99	1	tg128	19.55 ± 0.01	26.37 ± 0.00	+34.9%
llama 8B Q3_K - Small	3.41 GiB	8.03 B	99	0	tg128	15.91 ± 0.01	15.92 ± 0.02	+0.1%
llama 8B Q3_K - Small	3.41 GiB	8.03 B	99	1	tg128	12.52 ± 0.03	12.56 ± 0.01	+0.3%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	99	0	tg128	38.36 ± 0.04	47.72 ± 0.05	+24.4%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	99	1	tg128	29.89 ± 0.01	34.91 ± 0.02	+16.8%
llama 13B Q5_K - Small	15.18 GiB	23.57 B	99	0	tg128	12.00 ± 0.01	14.29 ± 1.43	+19.1%
llama 13B Q5_K - Small	15.18 GiB	23.57 B	99	1	tg128	10.46 ± 0.02	11.90 ± 0.34	+13.8%
granitehybrid 1B Q4_K - Small	3.75 GiB	6.94 B	99	0	tg128	46.88 ± 2.27	49.79 ± 5.03	+6.2%
granitehybrid 1B Q4_K - Small	3.75 GiB	6.94 B	99	1	tg128	47.69 ± 0.42	51.01 ± 0.11	+7.0%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	0	tg128	43.62 ± 0.04	41.81 ± 0.21	-4.1%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	tg128	28.22 ± 0.05	28.22 ± 0.01	+0.0%
gpt-oss 20B Q8_0	11.27 GiB	20.91 B	99	0	tg128	23.94 ± 0.03	39.25 ± 0.02	+64.0%
gpt-oss 20B Q8_0	11.27 GiB	20.91 B	99	1	tg128	22.87 ± 0.05	36.10 ± 0.01	+57.8%

RTX 3090

model	size	params	ngl	fa	test	t/s (CUDA)	t/s (before)	t/s (after)	diff
llama 8B Q2_K - Medium	2.95 GiB	8.03 B	99	0	tg128	138.00 ± 0.66	114.32 ± 0.45	112.74 ± 0.36	-1.4%
llama 8B Q2_K - Medium	2.95 GiB	8.03 B	99	1	tg128	136.82 ± 0.35	116.74 ± 0.35	114.95 ± 0.29	-1.5%
llama 8B Q3_K - Small	3.41 GiB	8.03 B	99	0	tg128	105.80 ± 0.29	98.13 ± 0.18	95.82 ± 0.58	-2.4%
llama 8B Q3_K - Small	3.41 GiB	8.03 B	99	1	tg128	105.10 ± 0.27	100.27 ± 0.37	96.59 ± 0.37	-3.7%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	99	0	tg128	145.41 ± 0.43	123.22 ± 0.41	121.58 ± 2.54	-1.3%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	99	1	tg128	144.52 ± 0.09	125.32 ± 0.18	126.04 ± 0.19	+0.6%
llama 13B Q5_K - Small	15.18 GiB	23.57 B	99	0	tg128	48.59 ± 0.03	38.82 ± 0.63	41.02 ± 0.18	+5.7%
llama 13B Q5_K - Small	15.18 GiB	23.57 B	99	1	tg128	48.44 ± 0.06	39.31 ± 0.14	41.31 ± 0.09	+5.1%
granitehybrid 1B Q4_K - Small	3.75 GiB	6.94 B	99	0	tg128	141.75 ± 0.46	143.90 ± 0.91	145.12 ± 1.67	+0.8%
granitehybrid 1B Q4_K - Small	3.75 GiB	6.94 B	99	1	tg128	141.72 ± 0.44	144.40 ± 0.24	145.24 ± 0.20	+0.6%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	0	tg128	165.61 ± 1.53	151.74 ± 7.18	153.97 ± 0.99	+1.5%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	tg128	162.49 ± 0.32	159.56 ± 1.25	159.13 ± 0.85	-0.3%
gpt-oss 20B Q8_0	11.27 GiB	20.91 B	99	0	tg128	205.45 ± 1.12	153.52 ± 12.40	160.16 ± 17.99	+4.3%
gpt-oss 20B Q8_0	11.27 GiB	20.91 B	99	1	tg128	210.33 ± 0.86	159.12 ± 0.81	172.44 ± 0.27	+8.4%

jeffbolznv

I only did a quick read through. I'll do some perf testing soon.

ggml/src/ggml-vulkan/ggml-vulkan.cpp

0cc4m · 2025-11-02T09:23:24Z

As usual, I appear to have caused an llvmpipe issue. I'll look into it.

jeffbolznv · 2025-11-02T19:25:27Z

Some initial perf results:

before:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 0 -r 10 --prio 1 -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q6_K.gguf -m c:\models\DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf -m c:\models\Llama-3.2-1B.Q2_K.gguf -m c:\models\Llama-3.2-1B.Q3_K_S.gguf -m c:\models\llama-3.2-3b-instruct-q5_k_m.gguf -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -m c:\models\Qwen2.5-7B-Instruct-1M-Q2_K.gguf  -m c:\models\\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf -m c:\models\gpt-oss-20b-mxfp4.gguf -m c:\models\Phi-3-mini-4k-instruct-q4.gguf -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\llama-3.2-3b-instruct-q8_0.gguf -m c:\models\Mistral-22B-v0.2-Q4_K_M.gguf -m c:\models\nvidia_Llama-3_3-Nemotron-Super-49B-v1_5-Q4_K_S.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |       239.48 ± 11.34 |
| llama 8B Q6_K                  |   6.14 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        201.44 ± 7.81 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        129.84 ± 4.07 |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |       872.67 ± 15.33 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |       845.99 ± 13.20 |
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |       391.09 ± 24.08 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |       265.33 ± 14.59 |
| qwen2 7B Q2_K - Medium         |   2.80 GiB |     7.62 B | Vulkan     |  99 |  1 |           tg128 |       251.59 ± 17.44 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |       305.19 ± 28.81 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |       301.64 ± 24.09 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           tg128 |       356.71 ± 17.34 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        273.06 ± 2.17 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |       317.10 ± 15.70 |
| llama ?B Q4_K - Medium         |  12.42 GiB |    22.24 B | Vulkan     |  99 |  1 |           tg128 |         91.93 ± 0.22 |
| deci 70B Q4_K - Small          |  26.66 GiB |    49.87 B | Vulkan     |  99 |  1 |           tg128 |         49.29 ± 0.22 |

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 0 -r 10 --prio 1 -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q6_K.gguf -m c:\models\DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf -m c:\models\Llama-3.2-1B.Q2_K.gguf -m c:\models\Llama-3.2-1B.Q3_K_S.gguf -m c:\models\llama-3.2-3b-instruct-q5_k_m.gguf -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -m c:\models\Qwen2.5-7B-Instruct-1M-Q2_K.gguf  -m c:\models\\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf -m c:\models\gpt-oss-20b-mxfp4.gguf -m c:\models\Phi-3-mini-4k-instruct-q4.gguf -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\llama-3.2-3b-instruct-q8_0.gguf -m c:\models\Mistral-22B-v0.2-Q4_K_M.gguf -m c:\models\nvidia_Llama-3_3-Nemotron-Super-49B-v1_5-Q4_K_S.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |         91.03 ± 1.52 |
| llama 8B Q6_K                  |   6.14 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |         70.20 ± 0.40 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |         48.53 ± 0.66 |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |       431.26 ± 28.74 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |       397.86 ± 23.85 |
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        167.72 ± 3.56 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |       153.41 ± 10.78 |
| qwen2 7B Q2_K - Medium         |   2.80 GiB |     7.62 B | Vulkan     |  99 |  1 |           tg128 |        103.66 ± 3.49 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |       173.04 ± 12.22 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |         37.22 ± 0.54 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           tg128 |        159.48 ± 1.35 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        108.88 ± 0.43 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        125.48 ± 0.54 |

after:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 0 -r 10 --prio 1 -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q6_K.gguf -m c:\models\DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf -m c:\models\Llama-3.2-1B.Q2_K.gguf -m c:\models\Llama-3.2-1B.Q3_K_S.gguf -m c:\models\llama-3.2-3b-instruct-q5_k_m.gguf -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -m c:\models\Qwen2.5-7B-Instruct-1M-Q2_K.gguf  -m c:\models\\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf -m c:\models\gpt-oss-20b-mxfp4.gguf -m c:\models\Phi-3-mini-4k-instruct-q4.gguf -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\llama-3.2-3b-instruct-q8_0.gguf -m c:\models\Mistral-22B-v0.2-Q4_K_M.gguf -m c:\models\nvidia_Llama-3_3-Nemotron-Super-49B-v1_5-Q4_K_S.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |       238.12 ± 12.03 |
| llama 8B Q6_K                  |   6.14 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        202.69 ± 5.07 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        133.12 ± 4.19 |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |       855.76 ± 15.46 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |      641.24 ± 260.16 |
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |       396.68 ± 14.22 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        264.39 ± 8.21 |
| qwen2 7B Q2_K - Medium         |   2.80 GiB |     7.62 B | Vulkan     |  99 |  1 |           tg128 |       250.60 ± 18.72 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |       317.92 ± 10.59 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |       325.54 ± 12.60 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           tg128 |       358.63 ± 16.21 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        277.27 ± 4.62 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        327.73 ± 7.12 |
| llama ?B Q4_K - Medium         |  12.42 GiB |    22.24 B | Vulkan     |  99 |  1 |           tg128 |         92.43 ± 2.13 |
| deci 70B Q4_K - Small          |  26.66 GiB |    49.87 B | Vulkan     |  99 |  1 |           tg128 |         50.05 ± 0.23 |

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 0 -r 10 --prio 1 -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q6_K.gguf -m c:\models\DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf -m c:\models\Llama-3.2-1B.Q2_K.gguf -m c:\models\Llama-3.2-1B.Q3_K_S.gguf -m c:\models\llama-3.2-3b-instruct-q5_k_m.gguf -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -m c:\models\Qwen2.5-7B-Instruct-1M-Q2_K.gguf  -m c:\models\\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf -m c:\models\gpt-oss-20b-mxfp4.gguf -m c:\models\Phi-3-mini-4k-instruct-q4.gguf -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\llama-3.2-3b-instruct-q8_0.gguf -m c:\models\Mistral-22B-v0.2-Q4_K_M.gguf -m c:\models\nvidia_Llama-3_3-Nemotron-Super-49B-v1_5-Q4_K_S.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |         91.30 ± 0.94 |
| llama 8B Q6_K                  |   6.14 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |         71.16 ± 0.26 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |         49.35 ± 0.18 |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |        461.59 ± 1.94 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |        420.99 ± 1.95 |
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        167.92 ± 2.62 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        152.94 ± 8.52 |
| qwen2 7B Q2_K - Medium         |   2.80 GiB |     7.62 B | Vulkan     |  99 |  1 |           tg128 |        106.06 ± 3.89 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |       178.63 ± 16.11 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |         41.86 ± 1.68 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           tg128 |        160.77 ± 1.69 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        108.78 ± 1.08 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        125.95 ± 0.12 |

I reran some of the models with the biggest deltas. Most seem to be noise, except the improvement for gpt-oss MXFP4 is real:

before:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128,128,128,128,128 -p 0 -r 10 --prio 1 -m c:\models\gpt-oss-20b-mxfp4.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |       314.61 ± 23.74 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |        323.84 ± 1.17 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |        322.33 ± 2.26 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |        319.46 ± 2.80 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |        318.55 ± 3.96 |

build: 5d8bb900b (6910)

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128,128,128,128,128 -p 0 -r 10 --prio 1 -m c:\models\llama-3.2-3b-instruct-q8_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        332.90 ± 5.17 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        333.56 ± 0.96 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        330.42 ± 7.14 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        330.52 ± 6.45 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        334.98 ± 1.17 |

build: 5d8bb900b (6910)

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128,128,128,128,128 -p 0 -r 10 --prio 1 -m c:\models\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |       327.08 ± 19.41 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        334.18 ± 5.79 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        339.58 ± 3.17 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        338.76 ± 2.68 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        337.12 ± 5.83 |

build: 5d8bb900b (6910)

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128,128,128,128,128 -p 0 -r 10 --prio 1 -m c:\models\DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        132.41 ± 3.78 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        131.42 ± 0.73 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        131.74 ± 0.18 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        131.36 ± 0.23 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        131.26 ± 0.30 |

after:
Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128,128,128,128,128 -p 0 -r 10 --prio 1 -m c:\models\gpt-oss-20b-mxfp4.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |       331.53 ± 16.17 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |        335.87 ± 1.67 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |        334.85 ± 4.53 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |        334.90 ± 2.64 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |        333.53 ± 3.58 |

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128,128,128,128,128 -p 0 -r 10 --prio 1 -m c:\models\llama-3.2-3b-instruct-q8_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        333.99 ± 2.56 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        333.84 ± 1.31 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        330.21 ± 5.07 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        327.78 ± 6.82 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        334.95 ± 1.13 |

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128,128,128,128,128 -p 0 -r 10 --prio 1 -m c:\models\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |       321.82 ± 31.23 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        329.96 ± 4.85 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        335.48 ± 2.55 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        334.77 ± 6.32 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        334.00 ± 5.05 |

build: b153aac38 (6921)

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128,128,128,128,128 -p 0 -r 10 --prio 1 -m c:\models\DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        131.75 ± 3.42 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        130.28 ± 0.68 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        130.52 ± 0.39 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        130.62 ± 0.41 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        130.60 ± 0.40 |

0cc4m · 2025-11-07T19:53:04Z

Most seem to be noise, except the improvement for gpt-oss MXFP4 is real

The funny thing about that is that I didn't even enable the MMVQ path for Nvidia Turing+ on MXFP4. Not sure what is going on there.

I still have some tuning to do here, my Strix Halo device isn't liking this PR yet.

DajanaV mentioned this pull request Oct 31, 2025

UPSTREAM PR #16900: Vulkan: MMVQ Integer Dot K-Quant and MUL_MAT_ID support auroralabs-loci/llama.cpp#25

Closed

github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Oct 31, 2025

0cc4m force-pushed the 0cc4m/vulkan-mmq-dp4a-vec-k-quants branch from d5192bf to d2f8f00 Compare November 1, 2025 11:31

0cc4m marked this pull request as ready for review November 1, 2025 11:47

0cc4m requested a review from jeffbolznv November 1, 2025 11:48

jeffbolznv reviewed Nov 1, 2025

View reviewed changes

ggml/src/ggml-vulkan/ggml-vulkan.cpp Show resolved Hide resolved

0cc4m added 11 commits November 7, 2025 19:56

vulkan: split mul_mmq_funcs for mul_mat_vecq use

63145b2

add mxfp4 mmvq

023248d

add q2_k mmvq

0c470f6

add q3_k mmvq

e1def53

add q4_k and q5_k mmvq

ce90458

add q6_k mmvq

761bcd2

handle 4x4 quants per mmvq thread

315908c

enable MUL_MAT_ID mmvq support

437cfbc

enable subgroup optimizations for mul_mat_vec_id shaders

1185533

device tuning

434c6d3

request prealloc_y sync after quantization

1b78909

0cc4m force-pushed the 0cc4m/vulkan-mmq-dp4a-vec-k-quants branch from b153aac to 1b78909 Compare November 7, 2025 19:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Vulkan: MMVQ Integer Dot K-Quant and MUL_MAT_ID support #16900

Vulkan: MMVQ Integer Dot K-Quant and MUL_MAT_ID support #16900

0cc4m commented Oct 31, 2025

Uh oh!

0cc4m commented Nov 1, 2025

Uh oh!

jeffbolznv left a comment

Uh oh!

Uh oh!

0cc4m commented Nov 2, 2025

Uh oh!

jeffbolznv commented Nov 2, 2025

Uh oh!

0cc4m commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Vulkan: MMVQ Integer Dot K-Quant and MUL_MAT_ID support #16900

Are you sure you want to change the base?

Vulkan: MMVQ Integer Dot K-Quant and MUL_MAT_ID support #16900

Conversation

0cc4m commented Oct 31, 2025

Uh oh!

0cc4m commented Nov 1, 2025

AMD Radeon Pro VII

AMD Radeon RX 6800 XT

Intel A770

RTX 3090

Uh oh!

jeffbolznv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

0cc4m commented Nov 2, 2025

Uh oh!

jeffbolznv commented Nov 2, 2025

Uh oh!

0cc4m commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants