Releases: ggml-org/llama.cpp
Releases · ggml-org/llama.cpp
b6987
CUDA: skip fusion for repeating adds in bias (#17080)
b6986
vulkan: Increase BK to 32; use BK/4 for non-CM mul_mm.comp (#16636) Signed-off-by: Stefan Savic <[email protected]> Co-authored-by: Stefan Savic <[email protected]>
b6985
ggml: disable vxe for cross-compilation by default (#16966) Otherwise compilation will fail due to enabling -mvx -mzvector and not setting corresponding -march options.
b6984
vulkan: fuse rms_norm + mul + rope (+ view + set_rows) (#16977) This change combines the rms_norm+mul and rope+view+set_rows fusions to allow fusing the whole sequence together. This comes up in Qwen3, Bailing, and some other models.
b6983
vulkan: Fix test-thread-safety crashes (#17024) The std::map pipeline_flash_attn_f32_f16 could be searched and inserted at the same time, which needs to hold the lock. To be safe, hold the lock for all of ggml_vk_load_shaders.
b6982
CUDA: fix MMQ stream-k fixup ne1 indices (#17089)
b6981
ggml webgpu: faster matrix multiplication/matrix-vector multiplicatio…
b6980
CUDA: properly handle nb00=nb02 case for cpy (#17081)
b6979
vulkan : refactor buffer handling in vk_op_f32 (#16840) * vulkan : refactor/simplify buffer handling in vk_op_* functions * Combine UMA handling into ggml_vk_tensor_subbuffer
b6978
CUDA: fix should_use_mmvf for ne11 == 1 (#17085) * CUDA: fix should_use_mmvf for ne11 == 1 * Apply suggestion from @am17an Co-authored-by: Aman Gupta <[email protected]> --------- Co-authored-by: Aman Gupta <[email protected]>