-
Notifications
You must be signed in to change notification settings - Fork 599
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Your current environment
Avg generation throughput: 5.3 tokens/s, Running: 1 reqs , it too slow...
🐛 Describe the bug
use quay.io/ascend/vllm-ascend:v0.11.0rc2 as base image
FROM quay.io/ascend/vllm-ascend:v0.11.0rc2
RUN wget https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/Ascend-BiSheng-toolkit_aarch64.run
RUN chmod a+x Ascend-BiSheng-toolkit_aarch64.run
RUN ./Ascend-BiSheng-toolkit_aarch64.run --install
RUN source /usr/local/Ascend/8.3.RC1/bisheng_toolkit/set_env.sh
RUN wget https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/triton_ascend-3.2.0.dev20250914-cp311-cp311-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl
RUN pip install triton_ascend-3.2.0.dev20250914-cp311-cp311-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl
RUN source /usr/local/Ascend/8.3.RC1/bisheng_toolkit/set_env.sh
build image:
docker build -t quay.io/ascend/vllm-ascend:v0.11.0rc2.next -f Dockerfile_vllm_11_0_next .
deploy qwen3-next-80B:
docker run -d --name qwen3-Next-80B \
--privileged \
--shm-size=200g \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci4 \
--device /dev/davinci5 \
--device /dev/davinci6 \
--device /dev/davinci7 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver:ro \
-v /usr/local/dcmi:/usr/local/dcmi:ro \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi:ro \
-v /etc/ascend_install.info:/etc/ascend_install.info:ro \
-v /data/models/Qwen/Qwen3-Next-80B-A3B-Instruct:/data1/Qwen3-Next-80B-A3B-Instruct:ro \
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:32 \
-e VLLM_USE_MODELSCOPE=true \
-p 8000:8000 \
quay.io/ascend/vllm-ascend:v0.11.0rc2.next \
vllm serve /data1/Qwen3-Next-80B-A3B-Instruct \
--tensor-parallel-size 4 \
--max-model-len 32768 \
--gpu-memory-utilization 0.7 \
--enforce-eager \
--served-model-name Qwen3-Next-80B-A3B-Instruct
result:
{
"model": "Qwen3-Next-80B-A3B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant!"},
{"role": "user", "content": "what can you do?"}
],
"stream": false
}
INFO 11-22 08:07:16 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 11-22 08:07:16 [__init__.py:38] - ascend -> vllm_ascend:register
INFO 11-22 08:07:16 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 11-22 08:07:16 [__init__.py:207] Platform plugin ascend is activated
WARNING 11-22 08:07:22 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 11-22 08:07:24 [registry.py:582] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 11-22 08:07:24 [registry.py:582] Model architecture Qwen3VLMoeForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl_without_padding:AscendQwen3VLMoeForConditionalGeneration.
WARNING 11-22 08:07:24 [registry.py:582] Model architecture Qwen3VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl_without_padding:AscendQwen3VLForConditionalGeneration.
WARNING 11-22 08:07:24 [registry.py:582] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 11-22 08:07:24 [registry.py:582] Model architecture Qwen2_5OmniModel is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_omni_thinker:AscendQwen2_5OmniThinkerForConditionalGeneration.
WARNING 11-22 08:07:24 [registry.py:582] Model architecture DeepseekV32ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v3_2:CustomDeepseekV3ForCausalLM.
WARNING 11-22 08:07:24 [registry.py:582] Model architecture Qwen3NextForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_next:CustomQwen3NextForCausalLM.
(APIServer pid=1) INFO 11-22 08:07:24 [api_server.py:1839] vLLM API server version 0.11.0
(APIServer pid=1) INFO 11-22 08:07:24 [utils.py:233] non-default args: {'model_tag': '/data1/Qwen3-Next-80B-A3B-Instruct', 'api_key': ['9c8f3c6abb8eff91356e0056d3e43df0'], 'model': '/data1/Qwen3-Next-80B-A3B-Instruct', 'max_model_len': 32768, 'enforce_eager': True, 'served_model_name': ['Qwen3-235B-A22B-Instruct-2507'], 'tensor_parallel_size': 4, 'gpu_memory_utilization': 0.7}
(APIServer pid=1) INFO 11-22 08:07:41 [model.py:547] Resolved architecture: Qwen3NextForCausalLM
(APIServer pid=1) `torch_dtype` is deprecated! Use `dtype` instead!
(APIServer pid=1) INFO 11-22 08:07:41 [model.py:1510] Using max model len 32768
(APIServer pid=1) INFO 11-22 08:07:41 [scheduler.py:205] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=1) INFO 11-22 08:07:41 [config.py:297] Hybrid or mamba-based model detected: disabling prefix caching since it is not yet supported.
(APIServer pid=1) INFO 11-22 08:07:41 [config.py:308] Hybrid or mamba-based model detected: setting cudagraph mode to FULL_AND_PIECEWISE in order to optimize performance.
(APIServer pid=1) INFO 11-22 08:07:42 [__init__.py:381] Cudagraph is disabled under eager mode
(APIServer pid=1) INFO 11-22 08:07:42 [platform.py:152] Compilation disabled, using eager mode by default
(APIServer pid=1) WARNING 11-22 08:07:42 [platform.py:275] If chunked prefill or prefix caching is enabled, block size must be set to 128.
(APIServer pid=1) WARNING 11-22 08:07:42 [platform.py:282] When running qwen3-next model, block_size needs to be restored to its original value.
INFO 11-22 08:07:51 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 11-22 08:07:51 [__init__.py:38] - ascend -> vllm_ascend:register
INFO 11-22 08:07:51 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 11-22 08:07:51 [__init__.py:207] Platform plugin ascend is activated
WARNING 11-22 08:07:57 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
(EngineCore_DP0 pid=818) INFO 11-22 08:07:57 [core.py:644] Waiting for init message from front-end.
(EngineCore_DP0 pid=818) WARNING 11-22 08:07:57 [registry.py:582] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
(EngineCore_DP0 pid=818) WARNING 11-22 08:07:57 [registry.py:582] Model architecture Qwen3VLMoeForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl_without_padding:AscendQwen3VLMoeForConditionalGeneration.
(EngineCore_DP0 pid=818) WARNING 11-22 08:07:57 [registry.py:582] Model architecture Qwen3VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl_without_padding:AscendQwen3VLForConditionalGeneration.
(EngineCore_DP0 pid=818) WARNING 11-22 08:07:57 [registry.py:582] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
(EngineCore_DP0 pid=818) WARNING 11-22 08:07:57 [registry.py:582] Model architecture Qwen2_5OmniModel is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_omni_thinker:AscendQwen2_5OmniThinkerForConditionalGeneration.
(EngineCore_DP0 pid=818) WARNING 11-22 08:07:57 [registry.py:582] Model architecture DeepseekV32ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v3_2:CustomDeepseekV3ForCausalLM.
(EngineCore_DP0 pid=818) WARNING 11-22 08:07:57 [registry.py:582] Model architecture Qwen3NextForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_next:CustomQwen3NextForCausalLM.
(EngineCore_DP0 pid=818) INFO 11-22 08:07:57 [core.py:77] Initializing a V1 LLM engine (v0.11.0) with config: model='/data1/Qwen3-Next-80B-A3B-Instruct', speculative_config=None, tokenizer='/data1/Qwen3-Next-80B-A3B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=4, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=npu, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen3-235B-A22B-Instruct-2507, enable_prefix_caching=False, chunked_prefill_enabled=True, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":["all"],"splitting_ops":null,"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":0,"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"use_inductor_graph_partition":false,"pass_config":{},"max_capture_size":0,"local_cache_dir":null}
(EngineCore_DP0 pid=818) WARNING 11-22 08:07:57 [multiproc_executor.py:720] Reducing Torch parallelism from 192 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore_DP0 pid=818) INFO 11-22 08:07:57 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1, 2, 3], buffer_handle=(4, 16777216, 10, 'psm_3a4666fb'), local_subscribe_addr='ipc:///tmp/e6e0ef2f-fe7d-450c-bc80-071ba8c7a024', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 11-22 08:08:05 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 11-22 08:08:05 [__init__.py:38] - ascend -> vllm_ascend:register
INFO 11-22 08:08:05 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 11-22 08:08:05 [__init__.py:207] Platform plugin ascend is activated
INFO 11-22 08:08:06 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 11-22 08:08:06 [__init__.py:38] - ascend -> vllm_ascend:register
INFO 11-22 08:08:06 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 11-22 08:08:06 [__init__.py:207] Platform plugin ascend is activated
INFO 11-22 08:08:06 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 11-22 08:08:06 [__init__.py:38] - ascend -> vllm_ascend:register
INFO 11-22 08:08:06 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 11-22 08:08:06 [__init__.py:207] Platform plugin ascend is activated
INFO 11-22 08:08:06 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 11-22 08:08:06 [__init__.py:38] - ascend -> vllm_ascend:register
INFO 11-22 08:08:06 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 11-22 08:08:06 [__init__.py:207] Platform plugin ascend is activated
WARNING 11-22 08:08:12 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 11-22 08:08:12 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 11-22 08:08:12 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 11-22 08:08:12 [registry.py:582] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 11-22 08:08:12 [registry.py:582] Model architecture Qwen3VLMoeForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl_without_padding:AscendQwen3VLMoeForConditionalGeneration.
WARNING 11-22 08:08:12 [registry.py:582] Model architecture Qwen3VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl_without_padding:AscendQwen3VLForConditionalGeneration.
WARNING 11-22 08:08:12 [registry.py:582] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 11-22 08:08:12 [registry.py:582] Model architecture Qwen2_5OmniModel is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_omni_thinker:AscendQwen2_5OmniThinkerForConditionalGeneration.
WARNING 11-22 08:08:12 [registry.py:582] Model architecture DeepseekV32ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v3_2:CustomDeepseekV3ForCausalLM.
WARNING 11-22 08:08:12 [registry.py:582] Model architecture Qwen3NextForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_next:CustomQwen3NextForCausalLM.
WARNING 11-22 08:08:12 [registry.py:582] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 11-22 08:08:12 [registry.py:582] Model architecture Qwen3VLMoeForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl_without_padding:AscendQwen3VLMoeForConditionalGeneration.
WARNING 11-22 08:08:12 [registry.py:582] Model architecture Qwen3VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl_without_padding:AscendQwen3VLForConditionalGeneration.
WARNING 11-22 08:08:12 [registry.py:582] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 11-22 08:08:12 [registry.py:582] Model architecture Qwen2_5OmniModel is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_omni_thinker:AscendQwen2_5OmniThinkerForConditionalGeneration.
WARNING 11-22 08:08:12 [registry.py:582] Model architecture DeepseekV32ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v3_2:CustomDeepseekV3ForCausalLM.
WARNING 11-22 08:08:12 [registry.py:582] Model architecture Qwen3NextForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_next:CustomQwen3NextForCausalLM.
WARNING 11-22 08:08:13 [registry.py:582] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 11-22 08:08:13 [registry.py:582] Model architecture Qwen3VLMoeForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl_without_padding:AscendQwen3VLMoeForConditionalGeneration.
WARNING 11-22 08:08:13 [registry.py:582] Model architecture Qwen3VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl_without_padding:AscendQwen3VLForConditionalGeneration.
WARNING 11-22 08:08:13 [registry.py:582] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 11-22 08:08:13 [registry.py:582] Model architecture Qwen2_5OmniModel is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_omni_thinker:AscendQwen2_5OmniThinkerForConditionalGeneration.
WARNING 11-22 08:08:13 [registry.py:582] Model architecture DeepseekV32ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v3_2:CustomDeepseekV3ForCausalLM.
WARNING 11-22 08:08:13 [registry.py:582] Model architecture Qwen3NextForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_next:CustomQwen3NextForCausalLM.
WARNING 11-22 08:08:13 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 11-22 08:08:13 [registry.py:582] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 11-22 08:08:13 [registry.py:582] Model architecture Qwen3VLMoeForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl_without_padding:AscendQwen3VLMoeForConditionalGeneration.
WARNING 11-22 08:08:13 [registry.py:582] Model architecture Qwen3VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl_without_padding:AscendQwen3VLForConditionalGeneration.
WARNING 11-22 08:08:13 [registry.py:582] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 11-22 08:08:13 [registry.py:582] Model architecture Qwen2_5OmniModel is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_omni_thinker:AscendQwen2_5OmniThinkerForConditionalGeneration.
WARNING 11-22 08:08:13 [registry.py:582] Model architecture DeepseekV32ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v3_2:CustomDeepseekV3ForCausalLM.
WARNING 11-22 08:08:13 [registry.py:582] Model architecture Qwen3NextForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_next:CustomQwen3NextForCausalLM.
INFO 11-22 08:08:15 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_3272d29d'), local_subscribe_addr='ipc:///tmp/48684b2e-2bd0-4642-a7a8-f81f0a236647', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 11-22 08:08:15 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_f587ee33'), local_subscribe_addr='ipc:///tmp/f932e878-740b-4f6a-b8ad-258fe274a215', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 11-22 08:08:15 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_dcacadee'), local_subscribe_addr='ipc:///tmp/7057c2f7-fe92-4a3a-afda-a3c7f5fd71b0', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 11-22 08:08:16 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_a2636003'), local_subscribe_addr='ipc:///tmp/dfcfe164-3aee-4b47-9b40-70d97176bad4', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 11-22 08:08:30 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[1, 2, 3], buffer_handle=(3, 4194304, 6, 'psm_1fdfe493'), local_subscribe_addr='ipc:///tmp/68345f8f-cded-4c19-ba07-a1285752f245', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 11-22 08:08:30 [parallel_state.py:1208] rank 0 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
INFO 11-22 08:08:30 [parallel_state.py:1208] rank 2 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 2, EP rank 2
INFO 11-22 08:08:30 [parallel_state.py:1208] rank 1 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
INFO 11-22 08:08:30 [parallel_state.py:1208] rank 3 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 3, EP rank 3
(Worker_TP0 pid=954) INFO 11-22 08:08:31 [model_runner_v1.py:2641] Starting to load model /data1/Qwen3-Next-80B-A3B-Instruct...
(Worker_TP3 pid=957) INFO 11-22 08:08:31 [model_runner_v1.py:2641] Starting to load model /data1/Qwen3-Next-80B-A3B-Instruct...
(Worker_TP1 pid=955) INFO 11-22 08:08:31 [model_runner_v1.py:2641] Starting to load model /data1/Qwen3-Next-80B-A3B-Instruct...
(Worker_TP2 pid=956) INFO 11-22 08:08:31 [model_runner_v1.py:2641] Starting to load model /data1/Qwen3-Next-80B-A3B-Instruct...
Loading safetensors checkpoint shards: 0% Completed | 0/41 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 2% Completed | 1/41 [00:02<01:23, 2.08s/it]
Loading safetensors checkpoint shards: 5% Completed | 2/41 [00:04<01:35, 2.45s/it]
Loading safetensors checkpoint shards: 7% Completed | 3/41 [00:07<01:33, 2.46s/it]
Loading safetensors checkpoint shards: 10% Completed | 4/41 [00:09<01:31, 2.46s/it]
Loading safetensors checkpoint shards: 12% Completed | 5/41 [00:12<01:30, 2.52s/it]
Loading safetensors checkpoint shards: 15% Completed | 6/41 [00:14<01:27, 2.50s/it]
Loading safetensors checkpoint shards: 17% Completed | 7/41 [00:16<01:20, 2.37s/it]
Loading safetensors checkpoint shards: 20% Completed | 8/41 [00:19<01:20, 2.45s/it]
Loading safetensors checkpoint shards: 22% Completed | 9/41 [00:22<01:21, 2.53s/it]
Loading safetensors checkpoint shards: 24% Completed | 10/41 [00:24<01:16, 2.46s/it]
Loading safetensors checkpoint shards: 27% Completed | 11/41 [00:27<01:16, 2.56s/it]
Loading safetensors checkpoint shards: 29% Completed | 12/41 [00:29<01:13, 2.52s/it]
Loading safetensors checkpoint shards: 32% Completed | 13/41 [00:32<01:11, 2.56s/it]
Loading safetensors checkpoint shards: 34% Completed | 14/41 [00:34<01:07, 2.51s/it]
Loading safetensors checkpoint shards: 37% Completed | 15/41 [00:37<01:06, 2.56s/it]
Loading safetensors checkpoint shards: 39% Completed | 16/41 [00:37<00:45, 1.83s/it]
Loading safetensors checkpoint shards: 41% Completed | 17/41 [00:40<00:50, 2.09s/it]
Loading safetensors checkpoint shards: 44% Completed | 18/41 [00:42<00:51, 2.23s/it]
Loading safetensors checkpoint shards: 46% Completed | 19/41 [00:45<00:50, 2.31s/it]
Loading safetensors checkpoint shards: 49% Completed | 20/41 [00:48<00:52, 2.48s/it]
Loading safetensors checkpoint shards: 51% Completed | 21/41 [00:51<00:56, 2.85s/it]
Loading safetensors checkpoint shards: 54% Completed | 22/41 [00:54<00:53, 2.81s/it]
Loading safetensors checkpoint shards: 56% Completed | 23/41 [00:58<00:55, 3.08s/it]
Loading safetensors checkpoint shards: 59% Completed | 24/41 [01:01<00:51, 3.02s/it]
Loading safetensors checkpoint shards: 61% Completed | 25/41 [01:04<00:47, 2.97s/it]
Loading safetensors checkpoint shards: 63% Completed | 26/41 [01:06<00:43, 2.93s/it]
Loading safetensors checkpoint shards: 66% Completed | 27/41 [01:09<00:40, 2.88s/it]
Loading safetensors checkpoint shards: 68% Completed | 28/41 [01:12<00:36, 2.83s/it]
Loading safetensors checkpoint shards: 71% Completed | 29/41 [01:14<00:33, 2.75s/it]
Loading safetensors checkpoint shards: 73% Completed | 30/41 [01:17<00:29, 2.72s/it]
Loading safetensors checkpoint shards: 76% Completed | 31/41 [01:20<00:26, 2.62s/it]
Loading safetensors checkpoint shards: 78% Completed | 32/41 [01:22<00:23, 2.57s/it]
(Worker_TP2 pid=956) INFO 11-22 08:09:57 [default_loader.py:267] Loading weights took 82.99 seconds
Loading safetensors checkpoint shards: 80% Completed | 33/41 [01:24<00:20, 2.53s/it]
(Worker_TP2 pid=956) INFO 11-22 08:09:59 [model_runner_v1.py:2667] Loading model weights took 37.4525 GB
Loading safetensors checkpoint shards: 83% Completed | 34/41 [01:27<00:18, 2.60s/it]
Loading safetensors checkpoint shards: 85% Completed | 35/41 [01:30<00:15, 2.61s/it]
Loading safetensors checkpoint shards: 88% Completed | 36/41 [01:33<00:13, 2.68s/it]
Loading safetensors checkpoint shards: 90% Completed | 37/41 [01:36<00:11, 2.76s/it]
(Worker_TP3 pid=957) INFO 11-22 08:10:13 [default_loader.py:267] Loading weights took 99.08 seconds
Loading safetensors checkpoint shards: 93% Completed | 38/41 [01:39<00:08, 2.94s/it]
(Worker_TP3 pid=957) INFO 11-22 08:10:15 [model_runner_v1.py:2667] Loading model weights took 37.4525 GB
Loading safetensors checkpoint shards: 95% Completed | 39/41 [01:42<00:06, 3.04s/it]
Loading safetensors checkpoint shards: 98% Completed | 40/41 [01:45<00:03, 3.11s/it]
Loading safetensors checkpoint shards: 100% Completed | 41/41 [01:49<00:00, 3.09s/it]
Loading safetensors checkpoint shards: 100% Completed | 41/41 [01:49<00:00, 2.66s/it]
(Worker_TP0 pid=954)
(Worker_TP0 pid=954) INFO 11-22 08:10:23 [default_loader.py:267] Loading weights took 109.16 seconds
(Worker_TP0 pid=954) INFO 11-22 08:10:24 [model_runner_v1.py:2667] Loading model weights took 37.4525 GB
(Worker_TP1 pid=955) INFO 11-22 08:11:13 [default_loader.py:267] Loading weights took 158.22 seconds
(Worker_TP1 pid=955) INFO 11-22 08:11:13 [model_runner_v1.py:2667] Loading model weights took 37.4525 GB
(Worker_TP1 pid=955) WARNING 11-22 08:11:14 [cudagraph_dispatcher.py:106] cudagraph dispatching keys are not initialized. No cudagraph will be used.
(Worker_TP0 pid=954) WARNING 11-22 08:11:14 [cudagraph_dispatcher.py:106] cudagraph dispatching keys are not initialized. No cudagraph will be used.
(Worker_TP3 pid=957) WARNING 11-22 08:11:14 [cudagraph_dispatcher.py:106] cudagraph dispatching keys are not initialized. No cudagraph will be used.
(Worker_TP2 pid=956) WARNING 11-22 08:11:14 [cudagraph_dispatcher.py:106] cudagraph dispatching keys are not initialized. No cudagraph will be used.
(Worker_TP0 pid=954) INFO 11-22 08:11:17 [worker_v1.py:256] Available memory: 4062010368, total memory: 65452113920
(Worker_TP1 pid=955) INFO 11-22 08:11:17 [worker_v1.py:256] Available memory: 4068736000, total memory: 65452113920
(Worker_TP3 pid=957) INFO 11-22 08:11:17 [worker_v1.py:256] Available memory: 4071652352, total memory: 65452113920
(Worker_TP2 pid=956) INFO 11-22 08:11:17 [worker_v1.py:256] Available memory: 4066769920, total memory: 65452113920
(EngineCore_DP0 pid=818) INFO 11-22 08:11:17 [kv_cache_utils.py:1087] GPU KV cache size: 82,560 tokens
(EngineCore_DP0 pid=818) INFO 11-22 08:11:17 [kv_cache_utils.py:1091] Maximum concurrency for 32,768 tokens per request: 9.66x
(EngineCore_DP0 pid=818) INFO 11-22 08:11:17 [kv_cache_utils.py:1087] GPU KV cache size: 82,560 tokens
(EngineCore_DP0 pid=818) INFO 11-22 08:11:17 [kv_cache_utils.py:1091] Maximum concurrency for 32,768 tokens per request: 9.69x
(EngineCore_DP0 pid=818) INFO 11-22 08:11:17 [kv_cache_utils.py:1087] GPU KV cache size: 82,560 tokens
(EngineCore_DP0 pid=818) INFO 11-22 08:11:17 [kv_cache_utils.py:1091] Maximum concurrency for 32,768 tokens per request: 9.67x
(EngineCore_DP0 pid=818) INFO 11-22 08:11:17 [kv_cache_utils.py:1087] GPU KV cache size: 82,560 tokens
(EngineCore_DP0 pid=818) INFO 11-22 08:11:17 [kv_cache_utils.py:1091] Maximum concurrency for 32,768 tokens per request: 9.69x
(EngineCore_DP0 pid=818) INFO 11-22 08:11:17 [core.py:210] init engine (profile, create kv cache, warmup model) took 3.75 seconds
(EngineCore_DP0 pid=818) INFO 11-22 08:11:18 [__init__.py:381] Cudagraph is disabled under eager mode
(EngineCore_DP0 pid=818) INFO 11-22 08:11:18 [platform.py:152] Compilation disabled, using eager mode by default
(EngineCore_DP0 pid=818) WARNING 11-22 08:11:18 [platform.py:275] If chunked prefill or prefix caching is enabled, block size must be set to 128.
(EngineCore_DP0 pid=818) WARNING 11-22 08:11:18 [platform.py:282] When running qwen3-next model, block_size needs to be restored to its original value.
(APIServer pid=1) INFO 11-22 08:11:18 [loggers.py:147] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 860
(APIServer pid=1) INFO 11-22 08:11:19 [api_server.py:1634] Supported_tasks: ['generate']
(APIServer pid=1) WARNING 11-22 08:11:19 [model.py:1389] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=1) INFO 11-22 08:11:19 [serving_responses.py:137] Using default chat sampling params from model: {'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
(APIServer pid=1) INFO 11-22 08:11:19 [serving_chat.py:139] Using default chat sampling params from model: {'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
(APIServer pid=1) INFO 11-22 08:11:19 [serving_completion.py:76] Using default completion sampling params from model: {'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
(APIServer pid=1) INFO 11-22 08:11:19 [api_server.py:1912] Starting vLLM API server 0 on http://0.0.0.0:8000
(APIServer pid=1) INFO 11-22 08:11:19 [launcher.py:34] Available routes are:
(APIServer pid=1) INFO 11-22 08:11:19 [launcher.py:42] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=1) INFO 11-22 08:11:19 [launcher.py:42] Route: /docs, Methods: GET, HEAD
(APIServer pid=1) INFO 11-22 08:11:19 [launcher.py:42] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=1) INFO 11-22 08:11:19 [launcher.py:42] Route: /redoc, Methods: GET, HEAD
(APIServer pid=1) INFO 11-22 08:11:19 [launcher.py:42] Route: /health, Methods: GET
(APIServer pid=1) INFO 11-22 08:11:19 [launcher.py:42] Route: /load, Methods: GET
(APIServer pid=1) INFO 11-22 08:11:19 [launcher.py:42] Route: /ping, Methods: POST
(APIServer pid=1) INFO 11-22 08:11:19 [launcher.py:42] Route: /ping, Methods: GET
(APIServer pid=1) INFO 11-22 08:11:19 [launcher.py:42] Route: /tokenize, Methods: POST
(APIServer pid=1) INFO 11-22 08:11:19 [launcher.py:42] Route: /detokenize, Methods: POST
(APIServer pid=1) INFO 11-22 08:11:19 [launcher.py:42] Route: /v1/models, Methods: GET
(APIServer pid=1) INFO 11-22 08:11:19 [launcher.py:42] Route: /version, Methods: GET
(APIServer pid=1) INFO 11-22 08:11:19 [launcher.py:42] Route: /v1/responses, Methods: POST
(APIServer pid=1) INFO 11-22 08:11:19 [launcher.py:42] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=1) INFO 11-22 08:11:19 [launcher.py:42] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=1) INFO 11-22 08:11:19 [launcher.py:42] Route: /v1/chat/completions, Methods: POST
(APIServer pid=1) INFO 11-22 08:11:19 [launcher.py:42] Route: /v1/completions, Methods: POST
(APIServer pid=1) INFO 11-22 08:11:19 [launcher.py:42] Route: /v1/embeddings, Methods: POST
(APIServer pid=1) INFO 11-22 08:11:19 [launcher.py:42] Route: /pooling, Methods: POST
(APIServer pid=1) INFO 11-22 08:11:19 [launcher.py:42] Route: /classify, Methods: POST
(APIServer pid=1) INFO 11-22 08:11:19 [launcher.py:42] Route: /score, Methods: POST
(APIServer pid=1) INFO 11-22 08:11:19 [launcher.py:42] Route: /v1/score, Methods: POST
(APIServer pid=1) INFO 11-22 08:11:19 [launcher.py:42] Route: /v1/audio/transcriptions, Methods: POST
(APIServer pid=1) INFO 11-22 08:11:19 [launcher.py:42] Route: /v1/audio/translations, Methods: POST
(APIServer pid=1) INFO 11-22 08:11:19 [launcher.py:42] Route: /rerank, Methods: POST
(APIServer pid=1) INFO 11-22 08:11:19 [launcher.py:42] Route: /v1/rerank, Methods: POST
(APIServer pid=1) INFO 11-22 08:11:19 [launcher.py:42] Route: /v2/rerank, Methods: POST
(APIServer pid=1) INFO 11-22 08:11:19 [launcher.py:42] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=1) INFO 11-22 08:11:19 [launcher.py:42] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=1) INFO 11-22 08:11:19 [launcher.py:42] Route: /invocations, Methods: POST
(APIServer pid=1) INFO 11-22 08:11:19 [launcher.py:42] Route: /metrics, Methods: GET
(APIServer pid=1) INFO: Started server process [1]
(APIServer pid=1) INFO: Waiting for application startup.
(APIServer pid=1) INFO: Application startup complete.
(APIServer pid=1) ERROR 11-22 08:11:43 [serving_chat.py:178] Error with model error=ErrorInfo(message='The model `Qwen3-Next-80B` does not exist.', type='NotFoundError', param=None, code=404)
(APIServer pid=1) INFO: 172.17.0.1:47806 - "POST /v1/chat/completions HTTP/1.0" 404 Not Found
(APIServer pid=1) INFO 11-22 08:12:04 [chat_utils.py:560] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
INFO 11-22 08:12:15 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 11-22 08:12:15 [__init__.py:38] - ascend -> vllm_ascend:register
INFO 11-22 08:12:15 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 11-22 08:12:15 [__init__.py:207] Platform plugin ascend is activated
INFO 11-22 08:12:15 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 11-22 08:12:15 [__init__.py:38] - ascend -> vllm_ascend:register
INFO 11-22 08:12:15 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 11-22 08:12:15 [__init__.py:207] Platform plugin ascend is activated
INFO 11-22 08:12:15 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 11-22 08:12:15 [__init__.py:38] - ascend -> vllm_ascend:register
INFO 11-22 08:12:15 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 11-22 08:12:15 [__init__.py:207] Platform plugin ascend is activated
INFO 11-22 08:12:15 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 11-22 08:12:15 [__init__.py:38] - ascend -> vllm_ascend:register
INFO 11-22 08:12:15 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 11-22 08:12:15 [__init__.py:207] Platform plugin ascend is activated
WARNING 11-22 08:12:21 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 11-22 08:12:21 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 11-22 08:12:21 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 11-22 08:12:22 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
(APIServer pid=1) INFO 11-22 08:13:09 [loggers.py:127] Engine 000: Avg prompt throughput: 2.4 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.5%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 11-22 08:13:19 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.5%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 11-22 08:13:39 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 2.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.5%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 11-22 08:13:49 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 5.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.5%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 11-22 08:13:59 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 5.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.5%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO: 172.17.0.1:47808 - "POST /v1/chat/completions HTTP/1.0" 200 OK
(APIServer pid=1) INFO 11-22 08:14:09 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 11-22 08:14:19 [loggers.py:127] Engine 000: Avg prompt throughput: 2.4 tokens/s, Avg generation throughput: 4.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.5%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 11-22 08:14:29 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 5.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.5%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO: 172.17.0.1:47810 - "POST /v1/chat/completions HTTP/1.0" 200 OK
(APIServer pid=1) INFO 11-22 08:14:39 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 3.6 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 11-22 08:14:49 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working