-
Notifications
You must be signed in to change notification settings - Fork 199
Description
Version
0.0.5
Operating System
Linux
Python Version
3.12
What happened?
I am calling create on file chunks to generate QA pairs. It works for the first file, then the GPU usage drops to 0% and vLLM seems to crash silently as it does not progress and eventually times out.
The loop i use to call the CLI-command:
for filename in tqdm(glob.glob("data/output/*/chunked/")):
!synthetic-data-kit -c synthetic_data_kit_config.yaml create {filename} --num-pairs 20 --type "qa"
time.sleep(5)
torch.cuda.empty_cache()This is strange as it seems to be a memory issue, however the cards I use should have more than enough.
As a parameter I set the max gpu memory usage to 0.8.
The model: unsloth/Llama-3.2-3B-Instruct -> ~6GB VRAM Usage
The system: 2x L40S (48GB VRAM)
I`m not sure how to fix this issue as there Is no error logged, it just freezes. Do you have any idea how I could fix this?
Relevant log output
vLLM STDOUT: INFO 09-11 14:37:24 [config.py:689] This model supports multiple tasks: {'score', 'generate', 'classify', 'embed', 'reward'}. Defaulting to 'generate'.
vLLM STDOUT: WARNING 09-11 14:37:24 [arg_utils.py:1731] --kv-cache-dtype is not supported by the V1 Engine. Falling back to V0.
vLLM STDOUT: INFO 09-11 14:37:24 [config.py:1316] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor
vLLM STDOUT: INFO 09-11 14:37:24 [api_server.py:246] Started engine process with PID 2660991
vLLM STDOUT: INFO 09-11 14:37:27 [__init__.py:239] Automatically detected platform cuda.
vLLM STDOUT: INFO 09-11 14:37:29 [llm_engine.py:243] Initializing a V0 LLM engine (v0.8.4) with config: model='unsloth/Llama-3.2-3B-Instruct', speculative_config=None, tokenizer='unsloth/Llama-3.2-3B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=fp8, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=unsloth/Llama-3.2-3B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":272}, use_cached_outputs=True,
vLLM STDOUT: INFO 09-11 14:37:31 [cuda.py:273] Cannot use FlashAttention backend for FP8 KV cache.
vLLM STDOUT: WARNING 09-11 14:37:31 [cuda.py:275] Please use FlashInfer backend with FP8 KV Cache for better performance by setting environment variable VLLM_ATTENTION_BACKEND=FLASHINFER
vLLM STDOUT: INFO 09-11 14:37:31 [cuda.py:289] Using XFormers backend.
vLLM STDOUT: INFO 09-11 14:37:31 [parallel_state.py:959] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
vLLM STDOUT: INFO 09-11 14:37:31 [model_runner.py:1110] Starting to load model unsloth/Llama-3.2-3B-Instruct...
vLLM STDOUT: INFO 09-11 14:37:32 [weight_utils.py:265] Using model weights format ['*.safetensors']
vLLM STDOUT: INFO 09-11 14:37:33 [loader.py:458] Loading weights took 0.88 seconds
vLLM STDOUT: INFO 09-11 14:37:33 [model_runner.py:1146] Model loading took 6.0160 GiB and 1.512391 seconds
vLLM STDOUT: INFO 09-11 14:37:38 [backends.py:416] Using cache directory: /home/weissl/.cache/vllm/torch_compile_cache/37e66abc53/rank_0_0 for vLLM's torch.compile
vLLM STDOUT: INFO 09-11 14:37:38 [backends.py:426] Dynamo bytecode transform time: 4.22 s
vLLM STDOUT: INFO 09-11 14:37:40 [backends.py:115] Directly load the compiled graph for shape None from the cache
vLLM STDOUT: INFO 09-11 14:37:41 [monitor.py:33] torch.compile takes 4.22 s in total
vLLM STDOUT: INFO 09-11 14:37:41 [worker.py:267] Memory profiling takes 7.65 seconds
vLLM STDOUT: INFO 09-11 14:37:41 [worker.py:267] the current vLLM instance can use total_gpu_memory (44.39GiB) x gpu_memory_utilization (0.79) = 35.18GiB
vLLM STDOUT: INFO 09-11 14:37:41 [worker.py:267] model weights take 6.02GiB; non_torch_memory takes 0.08GiB; PyTorch activation peak memory takes 1.25GiB; the rest of the memory reserved for KV Cache is 27.83GiB.
vLLM STDOUT: INFO 09-11 14:37:41 [executor_base.py:112] # cuda blocks: 32570, # CPU blocks: 7021
vLLM STDOUT: INFO 09-11 14:37:41 [executor_base.py:117] Maximum concurrency for 2048 tokens per request: 254.45x
vLLM STDOUT: INFO 09-11 14:37:45 [model_runner.py:1456] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
vLLM STDOUT: INFO 09-11 14:38:07 [model_runner.py:1598] Graph capturing finished in 21 secs, took 0.26 GiB
vLLM STDOUT: INFO 09-11 14:38:07 [llm_engine.py:449] init engine (profile, create kv cache, warmup model) took 33.42 seconds
vLLM STDOUT: WARNING 09-11 14:38:08 [config.py:1177] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
vLLM STDOUT: INFO 09-11 14:38:08 [serving_chat.py:118] Using default chat sampling params from model: {'temperature': 0.6, 'top_p': 0.9}
vLLM STDOUT: INFO 09-11 14:38:08 [serving_completion.py:61] Using default completion sampling params from model: {'temperature': 0.6, 'top_p': 0.9}
vLLM STDOUT: INFO 09-11 14:38:08 [api_server.py:1081] Starting vLLM API server on http://0.0.0.0:8000
--- vLLM Server Ready (Detected: 'Starting vLLM API server on') ---Steps to reproduce
- Generate data from PDFs (Works as intended)
for pdf in pdfs:
!synthetic-data-kit -c synthetic_data_kit_config.yaml ingest "{pdf}" -o data/output/industrial_edge-
Chunk data using
unsloth.SyntheticDataKit.chunk_data(Works as intended) -
Run the loop to create QA pairs (This is where vLLM is crashing)
for filename in tqdm(glob.glob("data/output/*/chunked/")):
!synthetic-data-kit -c synthetic_data_kit_config.yaml create {filename} --num-pairs 20 --type "qa"
time.sleep(5)
torch.cuda.empty_cache()