Releases: ggml-org/llama.cpp
Releases · ggml-org/llama.cpp
b7003
b7002
vulkan: fix memory allocations (#17122)
b6999
server : handle failures to restore host cache (#17078) * server : handle failures to restore host cache * server : add tests for the prompt cache
b6996
vulkan: iGPU memory reporting fix (#17110) * vulkan: use all device-local heaps for memory availability reporting Co-authored-by: Giuseppe Scrivano <[email protected]> * use all available heaps for iGPU memory reporting * Allow multiple memory types per buffer request for devices with split heaps --------- Co-authored-by: Giuseppe Scrivano <[email protected]>
b6995
vulkan: fix mmq out of bounds reads (#17108) * vulkan: fix mmq out of bounds reads, streamline outdated matmul host code * fix mul_mat_id quantization call * Fix compiler warnings
b6994
vulkan: fuse mul_mat_id + mul (#17095) * vulkan: fuse mul_mat_id + mul This comes up in qwen3 moe. * split mul_mat_id fusion tests into a separate class
b6993
metal : retain src and dst buffers during async ops (#17101)
b6992
arg: add --cache-list argument to list cached models (#17073) * arg: add --cache-list argument to list cached models * new manifest naming format * improve naming * Update common/arg.cpp Co-authored-by: Georgi Gerganov <[email protected]> --------- Co-authored-by: Georgi Gerganov <[email protected]>
b6990
vulkan: Use spec constants for conv2d s/d/p and kernel W/H (#16978) * vulkan: Use spec constants for conv2d s/d/p and kernel W/H Also add some additional unroll hints, which seems to help. * lock around map lookup
b6989
server: fix correct time_ms calculation in prompt_progress (#17093) * fix: correct time_ms calculation in send_partial_response The time_ms field was incorrectly calculated. The division was happening before the subtraction leading to incorrect values. Before: (ggml_time_us() - slot.t_start_process_prompt / 1000) After: (ggml_time_us() - slot.t_start_process_prompt) / 1000 * docs : document time_ms field in prompt_progress