Skip to content

[FT] Prompt Level Caching #1053

@f14-bertolotti

Description

@f14-bertolotti

Issue Encountered

Currently, the caching mechanism stores the entire evaluation split but does not consider sampling parameters (e.g., temperature). These parameters significantly influence model behavior, meaning that cached responses may not align with new configurations if the sampling settings change.

In particular, when modifying LiteLLM sampling parameters, cache hits can occur that produce inconsistent or incoherent results relative to the new settings.

Proposed Solution / Feature

Instead of the current caching approach, a request-based caching mechanism could be implemented using a library like diskcache. This method would cache based on the full request payload, ensuring that changes to sampling parameters (or any other request field) generate distinct cache keys.

Below is an illustrative example using an OpenAI-style request:

import hashlib, json, requests, diskcache

# Prepare request parameters
request_params = {
    "url": f"{self.config.base_url}/chat/completions",
    "headers": {
        "Authorization": f"Bearer {self.config.api_key}",
        "Content-Type": "application/json",
    },
    "json": {
        "model": self.config.model_name,
        "messages": [{"role": "user", "content": doc.query}],
        "n": doc.num_samples,
        "max_tokens": self.config.max_tokens,
        "temperature": self.config.temperature,
        "top_p": self.config.top_p,
        "min_p": self.config.min_p,
        "seed": self.config.seed,
        **self.config.extra_body,
    },
    "timeout": self.config.timeout,
}

# Cache lookup and update
with diskcache.Cache(self.config.cache_dir or "/tmp/vllm_cache") as cache:
    key = hashlib.sha256(json.dumps(request_params, sort_keys=True).encode()).hexdigest()
    if key not in cache:
        cache[key] = {
            "response": requests.post(**request_params).json(),
            "request": request_params,
        }
    response = cache[key]["response"]

This approach allows multiple concurrent processes or threads to safely access the same cache, enabling consistent reuse across evaluations on a shared filesystem.

Adopting diskcache would also reduce maintenance complexity by replacing the custom caching logic with a robust, well-tested framework.

If this proposal is accepted, I’d be happy to open a PR implementing it.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions