-
Notifications
You must be signed in to change notification settings - Fork 13.6k
Description
Name and Version
$ ./llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes
Device 1: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes
version: 6951 (9aa6337)
built with cc (GCC) 15.2.1 20251022 (Red Hat 15.2.1-3) for x86_64-redhat-linux
Operating systems
Linux
GGML backends
CUDA
Hardware
Ryzen 5 9600X 128GB RAM, 2 x RTX5060TI-16
Models
unsloth/gpt-oss-120b-GGUF:F16
Problem description & steps to reproduce
In recent versions of llama-cpp (unfortunately, I can't tell exactly starting with which version) unsloth/gpt-oss-120b-GGUF:F16 produces the outputs unrelated to users' prompt, sometimes replying to something it wasn't asked, sometimes producing a totally incoherent output. It may not happen with short prompts, but with longer prompts (800+ tokens) it happens every time. The models of Qwen3 family (including VL) process the same prompts without issues, just like gpt-oss-120b could do about a month ago.
Running it like this:
./llama-server -dev CUDA0,CUDA1 -ngl 99 -ts 9,10 -c 90000 --no-webui -hf unsloth/gpt-oss-120b-GGUF:F16 --prio 3 -ot ".ffn_(up|down)_exps.=CPU" -fa on --swa-full --jinja
Adding the parameters recommended by Unsloth.ai (--temp 1.0 --min-p 0.0 --top-p 1.0 --top-k 0.0) doesn't change anything.
First Bad Commit
n/a
Relevant log output
n/a