correct bug to fix the value of max_num_tokens (#3933)

zouyida2052 · web-flow · commit ec983202850e · 2025-11-03T14:17:51.000+08:00
### What this PR does / why we need it? correct bug to fix the value of max_num_tokens - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@83f478b Signed-off-by: zouyida2052 <zouyida2002@gmail.com>
diff --git a/vllm_ascend/torchair/torchair_model_runner.py b/vllm_ascend/torchair/torchair_model_runner.py
@@ -117,7 +117,7 @@ def _init_mc2_tokens_capacity(self):
         # NOTE: To be clear, we need to make sure that during graph capture, the number of
         # tokens is less than or equal to mc2_tokens_capacity. According to _set_cudagraph_sizes,
         # the max number of tokens in graph is min(max_num_seqs * uniform_decode_query_len, 512).
-        max_num_tokens = self.parallel_config.tensor_parallel_size
+        max_num_tokens = self.max_num_reqs * self.uniform_decode_query_len
         tp_size = self.parallel_config.tensor_parallel_size
         # Use integer arithmetic for ceiling division.
         max_graph_batch_size = self.calculate_new_torchair_graph_batch_size(