Skip to content

Commit 9868268

Browse files
yiz-liuAngazenn
authored andcommitted
[v0.11.0][Fix] Cap max tokens to prevent potential OOM (vllm-project#3720) (vllm-project#3744)
### What this PR does / why we need it? Caps the calculated maximum number of tokens at 512. This prevents allocating an excessively large buffer when a cudagraph capture size is not specified, mitigating the risk of out-of-memory errors. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? None. Signed-off-by: Yizhou Liu <[email protected]>
1 parent 9191728 commit 9868268

File tree

1 file changed

+3
-1
lines changed

1 file changed

+3
-1
lines changed

vllm_ascend/worker/model_runner_v1.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -543,7 +543,9 @@ def _init_mc2_tokens_capacity(self):
543543
if self.compilation_config.cudagraph_capture_sizes:
544544
max_num_tokens = self.compilation_config.cudagraph_capture_sizes[0]
545545
else:
546-
max_num_tokens = self.max_num_reqs * self.uniform_decode_query_len
546+
# NOTE: To save memory, we cap the max number of tokens to 512.
547+
max_num_tokens = min(
548+
self.max_num_reqs * self.uniform_decode_query_len, 512)
547549
tp_size = self.parallel_config.tensor_parallel_size
548550
# Use integer arithmetic for ceiling division.
549551
num_tokens_per_tp_rank = (max_num_tokens + tp_size - 1) // tp_size

0 commit comments

Comments
 (0)