GH200 R1 kv cache calculation error

```
vllm serve DeepSeek-R1-AWQ/ --quantization moe_wna16 --trust-remote-code --max-model-len 90 --gpu-memory-utilization 0.98 --enforce-eager --enable-chunked-prefill=False --swap-space 1
export PYTORCH_CUDA_ALLOC_CONF='use_uvm:True,uvm_oversubscription_ratio:5.0,uvm_access_pattern:gpu_first'
```

I did load the R1 awq model in GH200, but fail to start.
**INFO 01-25 07:47:34 worker.py:266] model weights take 339.70GiB; non_torch_memory takes -247.98GiB; PyTorch activation peak memory takes 1.23GiB; the rest of the memory reserved for KV Cache is -0.34GiB.**

> memory usage is 90%.

negative digits usage is insame!!!

vllm need code changing, the kv cache size computation method is not corrent for GH200.



```



INFO 01-25 07:47:31 model_runner.py:1114] Loading model weights took 339.6998 GB
WARNING 01-25 07:47:31 fused_moe.py:634] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.11/dist-packages/vllm-0.1.dev4283+gdf887e2.cu124-py3.11-linux-aarch64.egg/vllm/model_executor/layers/fused_moe/configs/E=256,N=1024,device_name=NVIDIA_GH200_480GB,dtype=int4_w8a16.json
INFO 01-25 07:47:34 worker.py:266] Memory profiling takes 3.09 seconds
INFO 01-25 07:47:34 worker.py:266] the current vLLM instance can use total_gpu_memory (94.50GiB) x gpu_memory_utilization (0.98) = 92.61GiB
INFO 01-25 07:47:34 worker.py:266] model weights take 339.70GiB; non_torch_memory takes -247.98GiB; PyTorch activation peak memory takes 1.23GiB; the rest of the memory reserved for KV Cache is -0.34GiB.
INFO 01-25 07:47:34 executor_base.py:107] # CUDA blocks: 0, # CPU blocks: 44
INFO 01-25 07:47:34 executor_base.py:112] Maximum concurrency for 90 tokens per request: 0.00x
ERROR 01-25 07:47:34 engine.py:387] No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.
ERROR 01-25 07:47:34 engine.py:387] Traceback (most recent call last):
ERROR 01-25 07:47:34 engine.py:387]   File "/usr/local/lib/python3.11/dist-packages/vllm-0.1.dev4283+gdf887e2.cu124-py3.11-linux-aarch64.egg/vllm/engine/multiprocessing/engine.py", line 378, in run_mp_engine
ERROR 01-25 07:47:34 engine.py:387]     engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
ERROR 01-25 07:47:34 engine.py:387]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-25 07:47:34 engine.py:387]   File "/usr/local/lib/python3.11/dist-packages/vllm-0.1.dev4283+gdf887e2.cu124-py3.11-linux-aarch64.egg/vllm/engine/multiprocessing/engine.py", line 121, in from_engine_args
ERROR 01-25 07:47:34 engine.py:387]     return cls(ipc_path=ipc_path,
ERROR 01-25 07:47:34 engine.py:387]            ^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-25 07:47:34 engine.py:387]   File "/usr/local/lib/python3.11/dist-packages/vllm-0.1.dev4283+gdf887e2.cu124-py3.11-linux-aarch64.egg/vllm/engine/multiprocessing/engine.py", line 73, in __init__
ERROR 01-25 07:47:34 engine.py:387]     self.engine = LLMEngine(*args, **kwargs)
ERROR 01-25 07:47:34 engine.py:387]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-25 07:47:34 engine.py:387]   File "/usr/local/lib/python3.11/dist-packages/vllm-0.1.dev4283+gdf887e2.cu124-py3.11-linux-aarch64.egg/vllm/engine/llm_engine.py", line 274, in __init__
ERROR 01-25 07:47:34 engine.py:387]     self._initialize_kv_caches()
ERROR 01-25 07:47:34 engine.py:387]   File "/usr/local/lib/python3.11/dist-packages/vllm-0.1.dev4283+gdf887e2.cu124-py3.11-linux-aarch64.egg/vllm/engine/llm_engine.py", line 427, in _initialize_kv_caches
ERROR 01-25 07:47:34 engine.py:387]     self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
ERROR 01-25 07:47:34 engine.py:387]   File "/usr/local/lib/python3.11/dist-packages/vllm-0.1.dev4283+gdf887e2.cu124-py3.11-linux-aarch64.egg/vllm/executor/executor_base.py", line 118, in initialize_cache
ERROR 01-25 07:47:34 engine.py:387]     self.collective_rpc("initialize_cache",
ERROR 01-25 07:47:34 engine.py:387]   File "/usr/local/lib/python3.11/dist-packages/vllm-0.1.dev4283+gdf887e2.cu124-py3.11-linux-aarch64.egg/vllm/executor/uniproc_executor.py", line 49, in collective_rpc
ERROR 01-25 07:47:34 engine.py:387]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 01-25 07:47:34 engine.py:387]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-25 07:47:34 engine.py:387]   File "/usr/local/lib/python3.11/dist-packages/vllm-0.1.dev4283+gdf887e2.cu124-py3.11-linux-aarch64.egg/vllm/utils.py", line 2208, in run_method
ERROR 01-25 07:47:34 engine.py:387]     return func(*args, **kwargs)
ERROR 01-25 07:47:34 engine.py:387]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 01-25 07:47:34 engine.py:387]   File "/usr/local/lib/python3.11/dist-packages/vllm-0.1.dev4283+gdf887e2.cu124-py3.11-linux-aarch64.egg/vllm/worker/worker.py", line 293, in initialize_cache
ERROR 01-25 07:47:34 engine.py:387]     raise_if_cache_size_invalid(num_gpu_blocks,
ERROR 01-25 07:47:34 engine.py:387]   File "/usr/local/lib/python3.11/dist-packages/vllm-0.1.dev4283+gdf887e2.cu124-py3.11-linux-aarch64.egg/vllm/worker/worker.py", line 529, in raise_if_cache_size_invalid
ERROR 01-25 07:47:34 engine.py:387]     raise ValueError("No available memory for the cache blocks. "
ERROR 01-25 07:47:34 engine.py:387] ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GH200 R1 kv cache calculation error #2

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

GH200 R1 kv cache calculation error #2

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions