Skip to content

GH200 R1 kv cache calculation error #2

@WuNein

Description

@WuNein
vllm serve DeepSeek-R1-AWQ/ --quantization moe_wna16 --trust-remote-code --max-model-len 90 --gpu-memory-utilization 0.98 --enforce-eager --enable-chunked-prefill=False --swap-space 1
export PYTORCH_CUDA_ALLOC_CONF='use_uvm:True,uvm_oversubscription_ratio:5.0,uvm_access_pattern:gpu_first'

I did load the R1 awq model in GH200, but fail to start.
INFO 01-25 07:47:34 worker.py:266] model weights take 339.70GiB; non_torch_memory takes -247.98GiB; PyTorch activation peak memory takes 1.23GiB; the rest of the memory reserved for KV Cache is -0.34GiB.

memory usage is 90%.

negative digits usage is insame!!!

vllm need code changing, the kv cache size computation method is not corrent for GH200.




INFO 01-25 07:47:31 model_runner.py:1114] Loading model weights took 339.6998 GB
WARNING 01-25 07:47:31 fused_moe.py:634] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.11/dist-packages/vllm-0.1.dev4283+gdf887e2.cu124-py3.11-linux-aarch64.egg/vllm/model_executor/layers/fused_moe/configs/E=256,N=1024,device_name=NVIDIA_GH200_480GB,dtype=int4_w8a16.json
INFO 01-25 07:47:34 worker.py:266] Memory profiling takes 3.09 seconds
INFO 01-25 07:47:34 worker.py:266] the current vLLM instance can use total_gpu_memory (94.50GiB) x gpu_memory_utilization (0.98) = 92.61GiB
INFO 01-25 07:47:34 worker.py:266] model weights take 339.70GiB; non_torch_memory takes -247.98GiB; PyTorch activation peak memory takes 1.23GiB; the rest of the memory reserved for KV Cache is -0.34GiB.
INFO 01-25 07:47:34 executor_base.py:107] # CUDA blocks: 0, # CPU blocks: 44
INFO 01-25 07:47:34 executor_base.py:112] Maximum concurrency for 90 tokens per request: 0.00x
ERROR 01-25 07:47:34 engine.py:387] No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.
ERROR 01-25 07:47:34 engine.py:387] Traceback (most recent call last):
ERROR 01-25 07:47:34 engine.py:387]   File "/usr/local/lib/python3.11/dist-packages/vllm-0.1.dev4283+gdf887e2.cu124-py3.11-linux-aarch64.egg/vllm/engine/multiprocessing/engine.py", line 378, in run_mp_engine
ERROR 01-25 07:47:34 engine.py:387]     engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
ERROR 01-25 07:47:34 engine.py:387]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-25 07:47:34 engine.py:387]   File "/usr/local/lib/python3.11/dist-packages/vllm-0.1.dev4283+gdf887e2.cu124-py3.11-linux-aarch64.egg/vllm/engine/multiprocessing/engine.py", line 121, in from_engine_args
ERROR 01-25 07:47:34 engine.py:387]     return cls(ipc_path=ipc_path,
ERROR 01-25 07:47:34 engine.py:387]            ^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-25 07:47:34 engine.py:387]   File "/usr/local/lib/python3.11/dist-packages/vllm-0.1.dev4283+gdf887e2.cu124-py3.11-linux-aarch64.egg/vllm/engine/multiprocessing/engine.py", line 73, in __init__
ERROR 01-25 07:47:34 engine.py:387]     self.engine = LLMEngine(*args, **kwargs)
ERROR 01-25 07:47:34 engine.py:387]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-25 07:47:34 engine.py:387]   File "/usr/local/lib/python3.11/dist-packages/vllm-0.1.dev4283+gdf887e2.cu124-py3.11-linux-aarch64.egg/vllm/engine/llm_engine.py", line 274, in __init__
ERROR 01-25 07:47:34 engine.py:387]     self._initialize_kv_caches()
ERROR 01-25 07:47:34 engine.py:387]   File "/usr/local/lib/python3.11/dist-packages/vllm-0.1.dev4283+gdf887e2.cu124-py3.11-linux-aarch64.egg/vllm/engine/llm_engine.py", line 427, in _initialize_kv_caches
ERROR 01-25 07:47:34 engine.py:387]     self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
ERROR 01-25 07:47:34 engine.py:387]   File "/usr/local/lib/python3.11/dist-packages/vllm-0.1.dev4283+gdf887e2.cu124-py3.11-linux-aarch64.egg/vllm/executor/executor_base.py", line 118, in initialize_cache
ERROR 01-25 07:47:34 engine.py:387]     self.collective_rpc("initialize_cache",
ERROR 01-25 07:47:34 engine.py:387]   File "/usr/local/lib/python3.11/dist-packages/vllm-0.1.dev4283+gdf887e2.cu124-py3.11-linux-aarch64.egg/vllm/executor/uniproc_executor.py", line 49, in collective_rpc
ERROR 01-25 07:47:34 engine.py:387]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 01-25 07:47:34 engine.py:387]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-25 07:47:34 engine.py:387]   File "/usr/local/lib/python3.11/dist-packages/vllm-0.1.dev4283+gdf887e2.cu124-py3.11-linux-aarch64.egg/vllm/utils.py", line 2208, in run_method
ERROR 01-25 07:47:34 engine.py:387]     return func(*args, **kwargs)
ERROR 01-25 07:47:34 engine.py:387]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 01-25 07:47:34 engine.py:387]   File "/usr/local/lib/python3.11/dist-packages/vllm-0.1.dev4283+gdf887e2.cu124-py3.11-linux-aarch64.egg/vllm/worker/worker.py", line 293, in initialize_cache
ERROR 01-25 07:47:34 engine.py:387]     raise_if_cache_size_invalid(num_gpu_blocks,
ERROR 01-25 07:47:34 engine.py:387]   File "/usr/local/lib/python3.11/dist-packages/vllm-0.1.dev4283+gdf887e2.cu124-py3.11-linux-aarch64.egg/vllm/worker/worker.py", line 529, in raise_if_cache_size_invalid
ERROR 01-25 07:47:34 engine.py:387]     raise ValueError("No available memory for the cache blocks. "
ERROR 01-25 07:47:34 engine.py:387] ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions