Releases · ggml-org/llama.cpp

09 Nov 18:10

b8595b1

b7003 Latest

Latest

mtmd : fix embedding size for image input (#17123)

Assets 16

cudart-llama-bin-win-cuda-12.4-x64.zip

sha256:8c79a9b226de4b3cacfd1f83d24f962d0773be79f1e7b75c6af4ded7e32ae1d6

373 MB 2025-11-09T18:10:05Z
llama-b7003-bin-macos-arm64.zip

sha256:6aeaef67a7369b6196786b47d985c47d54b8021cbbd48cd0998ac120aea54508

11.1 MB 2025-11-09T18:10:19Z
llama-b7003-bin-macos-x64.zip

sha256:7078aff42a09c6175d93a9ac6e4e6aa53d46c121548fbbc538c14673eaf7505c

28.3 MB 2025-11-09T18:10:20Z
llama-b7003-bin-ubuntu-s390x.zip

sha256:6947506051d1dcb6497f5fc8d0207df69ec206aa3ab9ce965133c9849cfae14b

12.7 MB 2025-11-09T18:10:22Z
llama-b7003-bin-ubuntu-vulkan-x64.zip

sha256:582bae4bbeec663c58ada128533d9970badbb90b967ec8407330b7a1b2651319

26.9 MB 2025-11-09T18:10:23Z
llama-b7003-bin-ubuntu-x64.zip

sha256:6541b4f14242c7ba3a70fceaedcd66d0010ebb7faec841769d77c729f70a8ba8

13.1 MB 2025-11-09T18:10:25Z
llama-b7003-bin-win-cpu-arm64.zip

sha256:75181fb6614f7e2e3b8913a284afd48c91cdc3df9390783f18abbc923fa9956e

11.2 MB 2025-11-09T18:10:26Z
llama-b7003-bin-win-cpu-x64.zip

sha256:fde7b9a1811d7f9a3eed872b858c3f783de11e1ac53e48eabb4556ca66a91ef3

14.3 MB 2025-11-09T18:10:27Z
llama-b7003-bin-win-cuda-12.4-x64.zip

sha256:c78b64299240b1d89b377a66b876569befc627a7e7ce8e6de65f74abb34a12a5

174 MB 2025-11-09T18:10:28Z
llama-b7003-bin-win-hip-radeon-x64.zip

sha256:7d2271eef952cb404936c0be93c5b7f169117e81a2928bbbf17449d45ad91ca3

324 MB 2025-11-09T18:10:35Z
Source code (zip)

2025-11-09T16:31:02Z
Source code (tar.gz)

2025-11-09T16:31:02Z

09 Nov 16:52

github-actions

b7002

392e09a

b7002

vulkan: fix memory allocations (#17122)

Assets 16

09 Nov 12:54

github-actions

b6999

cb1adf8

b6999

server : handle failures to restore host cache (#17078)

* server : handle failures to restore host cache

* server : add tests for the prompt cache

Assets 16

09 Nov 09:57

github-actions

b6996

7f3e9d3

b6996

vulkan: iGPU memory reporting fix (#17110)

* vulkan: use all device-local heaps for memory availability reporting

Co-authored-by: Giuseppe Scrivano <[email protected]>

* use all available heaps for iGPU memory reporting

* Allow multiple memory types per buffer request for devices with split heaps

---------

Co-authored-by: Giuseppe Scrivano <[email protected]>

Assets 16

09 Nov 09:56

github-actions

b6995

8a3519b

b6995

vulkan: fix mmq out of bounds reads (#17108)

* vulkan: fix mmq out of bounds reads, streamline outdated matmul host code

* fix mul_mat_id quantization call

* Fix compiler warnings

Assets 16

09 Nov 09:41

github-actions

b6994

80a6cf6

b6994

vulkan: fuse mul_mat_id + mul (#17095)

* vulkan: fuse mul_mat_id + mul

This comes up in qwen3 moe.

* split mul_mat_id fusion tests into a separate class

Assets 16

09 Nov 06:56

github-actions

b6993

0750a59

b6993

metal : retain src and dst buffers during async ops (#17101)

Assets 16

08 Nov 22:14

github-actions

b6992

aa3b7a9

b6992

arg: add --cache-list argument to list cached models (#17073)

* arg: add --cache-list argument to list cached models

* new manifest naming format

* improve naming

* Update common/arg.cpp

Co-authored-by: Georgi Gerganov <[email protected]>

---------

Co-authored-by: Georgi Gerganov <[email protected]>

Assets 16

08 Nov 20:44

github-actions

b6990

53d7d21

b6990

vulkan: Use spec constants for conv2d s/d/p and kernel W/H (#16978)

* vulkan: Use spec constants for conv2d s/d/p and kernel W/H

Also add some additional unroll hints, which seems to help.

* lock around map lookup

Assets 16

08 Nov 14:06

github-actions

b6989

eeee367

b6989

server: fix correct time_ms calculation in prompt_progress (#17093)

* fix: correct time_ms calculation in send_partial_response

The time_ms field was incorrectly calculated. The division was happening
before the subtraction leading to incorrect values.

Before: (ggml_time_us() - slot.t_start_process_prompt / 1000) After:
(ggml_time_us() - slot.t_start_process_prompt) / 1000

* docs : document time_ms field in prompt_progress

Assets 16

Releases: ggml-org/llama.cpp

b7003

Uh oh!

b7002

Uh oh!

b6999

Uh oh!

b6996

Uh oh!

b6995

Uh oh!

b6994

Uh oh!

b6993

Uh oh!

b6992

Uh oh!

b6990

Uh oh!

b6989

Uh oh!