-
Notifications
You must be signed in to change notification settings - Fork 385
Description
Issue Encountered
Currently, the caching mechanism stores the entire evaluation split but does not consider sampling parameters (e.g., temperature). These parameters significantly influence model behavior, meaning that cached responses may not align with new configurations if the sampling settings change.
In particular, when modifying LiteLLM sampling parameters, cache hits can occur that produce inconsistent or incoherent results relative to the new settings.
Proposed Solution / Feature
Instead of the current caching approach, a request-based caching mechanism could be implemented using a library like diskcache. This method would cache based on the full request payload, ensuring that changes to sampling parameters (or any other request field) generate distinct cache keys.
Below is an illustrative example using an OpenAI-style request:
import hashlib, json, requests, diskcache
# Prepare request parameters
request_params = {
"url": f"{self.config.base_url}/chat/completions",
"headers": {
"Authorization": f"Bearer {self.config.api_key}",
"Content-Type": "application/json",
},
"json": {
"model": self.config.model_name,
"messages": [{"role": "user", "content": doc.query}],
"n": doc.num_samples,
"max_tokens": self.config.max_tokens,
"temperature": self.config.temperature,
"top_p": self.config.top_p,
"min_p": self.config.min_p,
"seed": self.config.seed,
**self.config.extra_body,
},
"timeout": self.config.timeout,
}
# Cache lookup and update
with diskcache.Cache(self.config.cache_dir or "/tmp/vllm_cache") as cache:
key = hashlib.sha256(json.dumps(request_params, sort_keys=True).encode()).hexdigest()
if key not in cache:
cache[key] = {
"response": requests.post(**request_params).json(),
"request": request_params,
}
response = cache[key]["response"]This approach allows multiple concurrent processes or threads to safely access the same cache, enabling consistent reuse across evaluations on a shared filesystem.
Adopting diskcache would also reduce maintenance complexity by replacing the custom caching logic with a robust, well-tested framework.
If this proposal is accepted, I’d be happy to open a PR implementing it.