Buffer Overflow in llama.cpp via Malicious GGUF Model – Exploitable via Vocabulary Loading (`llama_vocab::impl::token_to_piece`)

Summary

An attacker‐supplied GGUF model vocabulary can trigger a buffer overflow in llama.cpp’s vocabulary‐loading code. Specifically, the helper _try_copy in llama.cpp/src/vocab.cpp: llama_vocab::impl::token_to_piece() casts a very large size_t token length into an int32_t, causing the length check (if (length < (int32_t)size)) to be bypassed. As a result, memcpy is still called with that oversized size, letting a malicious model overwrite memory beyond the intended buffer. This can lead to arbitrary memory corruption and potential code execution.

Details

The vulnerability lies in the function:

llama.cpp/src/vocab.cpp
  llama_vocab::impl::token_to_piece(llama_token token,
                                    char * buf,
                                    int32_t length,
                                    int32_t lstrip,
                                    bool special) const

Specifically, the inline helper _try_copy performs a signed comparison against a potentially oversized size_t without handling cases where size_t exceeds INT32_MAX. When that happens, the cast to int32_t wraps into a negative value, causing the length check to be bypassed and leading to an unchecked memcpy.

// File: llama.cpp/src/vocab.cpp (around line 2570)

auto _try_copy = [=](const char * token, size_t size) -> int32_t {
    // 1) Skip up to `lstrip` leading spaces in the token string.
    for (int32_t i = 0; i < lstrip && size && *token == ' '; ++i) {
        token++;
        size--;
    }

    // 2) Bound check (VULNERABLE):
    //    - `length` is the maximum number of bytes the caller promised `buf` can hold (signed int32_t).
    //    - `size` is the unsigned token length (size_t). If size > INT32_MAX, casting to int32_t overflows
    //      and produces a negative value.
    if (length < (int32_t) size) {
        // Intention: return a negative error code when the token is too large to fit.
        // But when size > INT32_MAX:
        //    (int32_t)size becomes a negative integer (e.g. size_t=2,147,483,648 → (int32_t)=−2,147,483,648).
        //    Then (length < negative) is always false, so this branch is skipped.
        return -(int32_t) size;
    }

    // 3) Unchecked memcpy (VULNERABLE):
    //    At this point, even if `size` is far larger than `length`, the code will reach this memcpy,
    //    because the prior check falsely evaluated to false when (int32_t)size wrapped negative.
    //    This copies `size` bytes into `buf`, overrunning the buffer whenever size > length.
    memcpy(buf, token, size);

    // 4) Return the number of bytes copied (signed).
    //    Note: this cast also overflows if size > INT32_MAX, but the overflow has already happened.
    return (int32_t) size;
};

Why This Check Fails for Extremely Large Tokens:

Unsigned size vs. Signed length:
- size is size_t (e.g., 64-bit on most platforms).
- length is int32_t (maximum positive value = 2,147,483,647).
Cast Overflow:
- If token_text.size() > INT32_MAX, then (int32_t) size wraps into a negative value (two’s-complement). For example:
```
size_t size = 2,147,483,648  // one more than INT32_MAX
(int32_t)size → −2,147,483,648
```
- The comparison if (length < (int32_t) size) becomes effectively if (small_positive < large_negative), which is always false.
Unchecked memcpy
- Because the bound check is bypassed, the code executes memcpy(buf, token, size).
- Even though buf only has room for length bytes, memcpy uses the full (very large) size, causing a buffer overflow to the tune of billions of bytes.

Callers and Code Paths

Any “token → string” conversion can overflow if token_text.size() > INT32_MAX. Notable call sites include:

Model loading (each GGUF token string passes through token_to_piece())
Detokenization (llama_vocab::impl::detokenize(...))
Grammar routines (llama_grammar_apply_impl, llama_grammar_accept_impl)
Sampling & infill (llama_sampler_infill_apply, etc.)
Public API (llama_token_to_piece(...))

As soon as llama.cpp loads the oversized token, it will crash with a buffer‐overflow in _try_copy().

Impact

Vulnerability Type
- Buffer overflow caused by a signed‐to‐unsigned conversion error in _try_copy().
Attack Vector
- A malicious GGUF model file containing a vocabulary entry whose token_text.size() exceeds INT32_MAX.
- As soon as llama.cpp attempts any “token → string” conversion (e.g., during model load, detokenization, grammar checks, or sampling), the oversized size_t bypasses the length check and triggers an unchecked memcpy.
Affected Component
- llama_vocab::impl::token_to_piece(), which is invoked by:
  - Grammar routines (llama_grammar_apply_impl, llama_grammar_accept_impl)
  - Sampling/infill code (llama_sampler_infill_apply, etc.)
  - The public API (llama_token_to_piece())
Severity
- Critical – a single malicious token in a GGUF file can immediately corrupt memory or hijack control flow.
Consequences
- Arbitrary Memory Corruption
  - Overwrites heap or stack data, leading to application instability or crashes.
- Remote Code Execution (RCE)
  - By corrupting adjacent heap metadata, return addresses, or vtable pointers, an attacker can redirect execution flow.
- Denial of Service (DoS)
  - Immediate crash under sanitizers (ASAN) or undefined behavior in production binaries.
- Information Disclosure
  - Overwritten memory might reveal sensitive data or internal pointers.
Who Is Impacted
- Any application or service that uses llama.cpp to load GGUF models from untrusted sources.
- Inference servers, chatbots, or pipelines that dynamically ingest external model files are all at risk.
Mitigation & Recommendations
- Required Patch
  - Modify _try_copy so that length and size are compared in an unsigned context, for example:
```
if ((size_t)length < size) {
    return -(int32_t)size;
}
```
  - This change ensures size values above INT32_MAX cannot bypass the bound check.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Buffer Overflow in llama.cpp via Malicious GGUF Model – Exploitable via Vocabulary Loading (`llama_vocab::impl::token_to_piece`)

Package

Affected versions

Patched versions

Description

Summary

Details

Callers and Code Paths

Impact

Severity

CVSS overall score

CVSS v3 base metrics

CVSS v3 base metrics

CVE ID

Weaknesses

Improper Restriction of Operations within the Bounds of a Memory Buffer

Signed to Unsigned Conversion Error

Credits