Commit 1f25d60
authored
[Fix] Cap max tokens to prevent potential OOM (#3720)
### What this PR does / why we need it?
Caps the calculated maximum number of tokens at 512.
This prevents allocating an excessively large buffer when a cudagraph
capture size is not specified, mitigating the risk of out-of-memory
errors.
### Does this PR introduce _any_ user-facing change?
None.
### How was this patch tested?
None.
- vLLM version: v0.11.0rc3
- vLLM main:
vllm-project/vllm@17c540a
Signed-off-by: Yizhou Liu <[email protected]>1 parent 63c363d commit 1f25d60
1 file changed
+3
-1
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
586 | 586 | | |
587 | 587 | | |
588 | 588 | | |
589 | | - | |
| 589 | + | |
| 590 | + | |
| 591 | + | |
590 | 592 | | |
591 | 593 | | |
592 | 594 | | |
| |||
0 commit comments