[Bug]: Qwen3-Next-80B too slow 太慢

### Your current environment

<details>
<summary> Avg generation throughput: 5.3 tokens/s, Running: 1 reqs , it too slow...</summary>



</details>


### 🐛 Describe the bug

use quay.io/ascend/vllm-ascend:v0.11.0rc2 as base image

```
FROM quay.io/ascend/vllm-ascend:v0.11.0rc2
RUN wget https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/Ascend-BiSheng-toolkit_aarch64.run
RUN chmod a+x Ascend-BiSheng-toolkit_aarch64.run
RUN ./Ascend-BiSheng-toolkit_aarch64.run --install
RUN source /usr/local/Ascend/8.3.RC1/bisheng_toolkit/set_env.sh

RUN wget https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/triton_ascend-3.2.0.dev20250914-cp311-cp311-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl
RUN pip install triton_ascend-3.2.0.dev20250914-cp311-cp311-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl
RUN source /usr/local/Ascend/8.3.RC1/bisheng_toolkit/set_env.sh
```

build image:

```
docker build -t quay.io/ascend/vllm-ascend:v0.11.0rc2.next -f Dockerfile_vllm_11_0_next .
```

deploy qwen3-next-80B:

```
docker run -d --name qwen3-Next-80B \
  --privileged \
  --shm-size=200g \
  --device /dev/davinci0 \
  --device /dev/davinci1 \
  --device /dev/davinci2 \
  --device /dev/davinci3 \
  --device /dev/davinci4 \
  --device /dev/davinci5 \
  --device /dev/davinci6 \
  --device /dev/davinci7 \
  --device /dev/davinci_manager \
  --device /dev/devmm_svm \
  --device /dev/hisi_hdc \
  -v /usr/local/Ascend/driver:/usr/local/Ascend/driver:ro \
  -v /usr/local/dcmi:/usr/local/dcmi:ro \
  -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi:ro \
  -v /etc/ascend_install.info:/etc/ascend_install.info:ro \
  -v /data/models/Qwen/Qwen3-Next-80B-A3B-Instruct:/data1/Qwen3-Next-80B-A3B-Instruct:ro \
  -e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:32 \
  -e VLLM_USE_MODELSCOPE=true \
  -p 8000:8000 \
  quay.io/ascend/vllm-ascend:v0.11.0rc2.next \
  vllm serve /data1/Qwen3-Next-80B-A3B-Instruct \
    --tensor-parallel-size 4 \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.7 \
    --enforce-eager \
    --served-model-name Qwen3-Next-80B-A3B-Instruct
```

result:

```text
{
    "model": "Qwen3-Next-80B-A3B-Instruct",
    "messages": [
         {"role": "system", "content": "You are a helpful assistant!"},
         {"role": "user", "content": "what can you do?"}
    ],
    "stream": false
}

INFO 11-22 08:07:16 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 11-22 08:07:16 [__init__.py:38] - ascend -> vllm_ascend:register
INFO 11-22 08:07:16 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 11-22 08:07:16 [__init__.py:207] Platform plugin ascend is activated
WARNING 11-22 08:07:22 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 11-22 08:07:24 [registry.py:582] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 11-22 08:07:24 [registry.py:582] Model architecture Qwen3VLMoeForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl_without_padding:AscendQwen3VLMoeForConditionalGeneration.
WARNING 11-22 08:07:24 [registry.py:582] Model architecture Qwen3VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl_without_padding:AscendQwen3VLForConditionalGeneration.
WARNING 11-22 08:07:24 [registry.py:582] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 11-22 08:07:24 [registry.py:582] Model architecture Qwen2_5OmniModel is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_omni_thinker:AscendQwen2_5OmniThinkerForConditionalGeneration.
WARNING 11-22 08:07:24 [registry.py:582] Model architecture DeepseekV32ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v3_2:CustomDeepseekV3ForCausalLM.
WARNING 11-22 08:07:24 [registry.py:582] Model architecture Qwen3NextForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_next:CustomQwen3NextForCausalLM.
(APIServer pid=1) INFO 11-22 08:07:24 [api_server.py:1839] vLLM API server version 0.11.0
(APIServer pid=1) INFO 11-22 08:07:24 [utils.py:233] non-default args: {'model_tag': '/data1/Qwen3-Next-80B-A3B-Instruct', 'api_key': ['9c8f3c6abb8eff91356e0056d3e43df0'], 'model': '/data1/Qwen3-Next-80B-A3B-Instruct', 'max_model_len': 32768, 'enforce_eager': True, 'served_model_name': ['Qwen3-235B-A22B-Instruct-2507'], 'tensor_parallel_size': 4, 'gpu_memory_utilization': 0.7}
(APIServer pid=1) INFO 11-22 08:07:41 [model.py:547] Resolved architecture: Qwen3NextForCausalLM
(APIServer pid=1) `torch_dtype` is deprecated! Use `dtype` instead!
(APIServer pid=1) INFO 11-22 08:07:41 [model.py:1510] Using max model len 32768
(APIServer pid=1) INFO 11-22 08:07:41 [scheduler.py:205] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=1) INFO 11-22 08:07:41 [config.py:297] Hybrid or mamba-based model detected: disabling prefix caching since it is not yet supported.
(APIServer pid=1) INFO 11-22 08:07:41 [config.py:308] Hybrid or mamba-based model detected: setting cudagraph mode to FULL_AND_PIECEWISE in order to optimize performance.
(APIServer pid=1) INFO 11-22 08:07:42 [__init__.py:381] Cudagraph is disabled under eager mode
(APIServer pid=1) INFO 11-22 08:07:42 [platform.py:152] Compilation disabled, using eager mode by default
(APIServer pid=1) WARNING 11-22 08:07:42 [platform.py:275] If chunked prefill or prefix caching is enabled, block size must be set to 128.
(APIServer pid=1) WARNING 11-22 08:07:42 [platform.py:282] When running qwen3-next model, block_size needs to be restored to its original value.
INFO 11-22 08:07:51 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 11-22 08:07:51 [__init__.py:38] - ascend -> vllm_ascend:register
INFO 11-22 08:07:51 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 11-22 08:07:51 [__init__.py:207] Platform plugin ascend is activated
WARNING 11-22 08:07:57 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
(EngineCore_DP0 pid=818) INFO 11-22 08:07:57 [core.py:644] Waiting for init message from front-end.
(EngineCore_DP0 pid=818) WARNING 11-22 08:07:57 [registry.py:582] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
(EngineCore_DP0 pid=818) WARNING 11-22 08:07:57 [registry.py:582] Model architecture Qwen3VLMoeForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl_without_padding:AscendQwen3VLMoeForConditionalGeneration.
(EngineCore_DP0 pid=818) WARNING 11-22 08:07:57 [registry.py:582] Model architecture Qwen3VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl_without_padding:AscendQwen3VLForConditionalGeneration.
(EngineCore_DP0 pid=818) WARNING 11-22 08:07:57 [registry.py:582] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
(EngineCore_DP0 pid=818) WARNING 11-22 08:07:57 [registry.py:582] Model architecture Qwen2_5OmniModel is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_omni_thinker:AscendQwen2_5OmniThinkerForConditionalGeneration.
(EngineCore_DP0 pid=818) WARNING 11-22 08:07:57 [registry.py:582] Model architecture DeepseekV32ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v3_2:CustomDeepseekV3ForCausalLM.
(EngineCore_DP0 pid=818) WARNING 11-22 08:07:57 [registry.py:582] Model architecture Qwen3NextForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_next:CustomQwen3NextForCausalLM.
(EngineCore_DP0 pid=818) INFO 11-22 08:07:57 [core.py:77] Initializing a V1 LLM engine (v0.11.0) with config: model='/data1/Qwen3-Next-80B-A3B-Instruct', speculative_config=None, tokenizer='/data1/Qwen3-Next-80B-A3B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=4, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=npu, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen3-235B-A22B-Instruct-2507, enable_prefix_caching=False, chunked_prefill_enabled=True, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":["all"],"splitting_ops":null,"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":0,"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"use_inductor_graph_partition":false,"pass_config":{},"max_capture_size":0,"local_cache_dir":null}
(EngineCore_DP0 pid=818) WARNING 11-22 08:07:57 [multiproc_executor.py:720] Reducing Torch parallelism from 192 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore_DP0 pid=818) INFO 11-22 08:07:57 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1, 2, 3], buffer_handle=(4, 16777216, 10, 'psm_3a4666fb'), local_subscribe_addr='ipc:///tmp/e6e0ef2f-fe7d-450c-bc80-071ba8c7a024', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 11-22 08:08:05 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 11-22 08:08:05 [__init__.py:38] - ascend -> vllm_ascend:register
INFO 11-22 08:08:05 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 11-22 08:08:05 [__init__.py:207] Platform plugin ascend is activated
INFO 11-22 08:08:06 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 11-22 08:08:06 [__init__.py:38] - ascend -> vllm_ascend:register
INFO 11-22 08:08:06 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 11-22 08:08:06 [__init__.py:207] Platform plugin ascend is activated
INFO 11-22 08:08:06 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 11-22 08:08:06 [__init__.py:38] - ascend -> vllm_ascend:register
INFO 11-22 08:08:06 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 11-22 08:08:06 [__init__.py:207] Platform plugin ascend is activated
INFO 11-22 08:08:06 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 11-22 08:08:06 [__init__.py:38] - ascend -> vllm_ascend:register
INFO 11-22 08:08:06 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 11-22 08:08:06 [__init__.py:207] Platform plugin ascend is activated
WARNING 11-22 08:08:12 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 11-22 08:08:12 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 11-22 08:08:12 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 11-22 08:08:12 [registry.py:582] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 11-22 08:08:12 [registry.py:582] Model architecture Qwen3VLMoeForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl_without_padding:AscendQwen3VLMoeForConditionalGeneration.
WARNING 11-22 08:08:12 [registry.py:582] Model architecture Qwen3VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl_without_padding:AscendQwen3VLForConditionalGeneration.
WARNING 11-22 08:08:12 [registry.py:582] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 11-22 08:08:12 [registry.py:582] Model architecture Qwen2_5OmniModel is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_omni_thinker:AscendQwen2_5OmniThinkerForConditionalGeneration.
WARNING 11-22 08:08:12 [registry.py:582] Model architecture DeepseekV32ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v3_2:CustomDeepseekV3ForCausalLM.
WARNING 11-22 08:08:12 [registry.py:582] Model architecture Qwen3NextForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_next:CustomQwen3NextForCausalLM.
WARNING 11-22 08:08:12 [registry.py:582] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 11-22 08:08:12 [registry.py:582] Model architecture Qwen3VLMoeForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl_without_padding:AscendQwen3VLMoeForConditionalGeneration.
WARNING 11-22 08:08:12 [registry.py:582] Model architecture Qwen3VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl_without_padding:AscendQwen3VLForConditionalGeneration.
WARNING 11-22 08:08:12 [registry.py:582] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 11-22 08:08:12 [registry.py:582] Model architecture Qwen2_5OmniModel is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_omni_thinker:AscendQwen2_5OmniThinkerForConditionalGeneration.
WARNING 11-22 08:08:12 [registry.py:582] Model architecture DeepseekV32ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v3_2:CustomDeepseekV3ForCausalLM.
WARNING 11-22 08:08:12 [registry.py:582] Model architecture Qwen3NextForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_next:CustomQwen3NextForCausalLM.
WARNING 11-22 08:08:13 [registry.py:582] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 11-22 08:08:13 [registry.py:582] Model architecture Qwen3VLMoeForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl_without_padding:AscendQwen3VLMoeForConditionalGeneration.
WARNING 11-22 08:08:13 [registry.py:582] Model architecture Qwen3VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl_without_padding:AscendQwen3VLForConditionalGeneration.
WARNING 11-22 08:08:13 [registry.py:582] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 11-22 08:08:13 [registry.py:582] Model architecture Qwen2_5OmniModel is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_omni_thinker:AscendQwen2_5OmniThinkerForConditionalGeneration.
WARNING 11-22 08:08:13 [registry.py:582] Model architecture DeepseekV32ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v3_2:CustomDeepseekV3ForCausalLM.
WARNING 11-22 08:08:13 [registry.py:582] Model architecture Qwen3NextForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_next:CustomQwen3NextForCausalLM.
WARNING 11-22 08:08:13 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 11-22 08:08:13 [registry.py:582] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 11-22 08:08:13 [registry.py:582] Model architecture Qwen3VLMoeForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl_without_padding:AscendQwen3VLMoeForConditionalGeneration.
WARNING 11-22 08:08:13 [registry.py:582] Model architecture Qwen3VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl_without_padding:AscendQwen3VLForConditionalGeneration.
WARNING 11-22 08:08:13 [registry.py:582] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 11-22 08:08:13 [registry.py:582] Model architecture Qwen2_5OmniModel is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_omni_thinker:AscendQwen2_5OmniThinkerForConditionalGeneration.
WARNING 11-22 08:08:13 [registry.py:582] Model architecture DeepseekV32ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v3_2:CustomDeepseekV3ForCausalLM.
WARNING 11-22 08:08:13 [registry.py:582] Model architecture Qwen3NextForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_next:CustomQwen3NextForCausalLM.
INFO 11-22 08:08:15 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_3272d29d'), local_subscribe_addr='ipc:///tmp/48684b2e-2bd0-4642-a7a8-f81f0a236647', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 11-22 08:08:15 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_f587ee33'), local_subscribe_addr='ipc:///tmp/f932e878-740b-4f6a-b8ad-258fe274a215', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 11-22 08:08:15 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_dcacadee'), local_subscribe_addr='ipc:///tmp/7057c2f7-fe92-4a3a-afda-a3c7f5fd71b0', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 11-22 08:08:16 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_a2636003'), local_subscribe_addr='ipc:///tmp/dfcfe164-3aee-4b47-9b40-70d97176bad4', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 11-22 08:08:30 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[1, 2, 3], buffer_handle=(3, 4194304, 6, 'psm_1fdfe493'), local_subscribe_addr='ipc:///tmp/68345f8f-cded-4c19-ba07-a1285752f245', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 11-22 08:08:30 [parallel_state.py:1208] rank 0 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
INFO 11-22 08:08:30 [parallel_state.py:1208] rank 2 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 2, EP rank 2
INFO 11-22 08:08:30 [parallel_state.py:1208] rank 1 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
INFO 11-22 08:08:30 [parallel_state.py:1208] rank 3 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 3, EP rank 3
(Worker_TP0 pid=954) INFO 11-22 08:08:31 [model_runner_v1.py:2641] Starting to load model /data1/Qwen3-Next-80B-A3B-Instruct...
(Worker_TP3 pid=957) INFO 11-22 08:08:31 [model_runner_v1.py:2641] Starting to load model /data1/Qwen3-Next-80B-A3B-Instruct...
(Worker_TP1 pid=955) INFO 11-22 08:08:31 [model_runner_v1.py:2641] Starting to load model /data1/Qwen3-Next-80B-A3B-Instruct...
(Worker_TP2 pid=956) INFO 11-22 08:08:31 [model_runner_v1.py:2641] Starting to load model /data1/Qwen3-Next-80B-A3B-Instruct...
Loading safetensors checkpoint shards:   0% Completed | 0/41 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   2% Completed | 1/41 [00:02<01:23,  2.08s/it]
Loading safetensors checkpoint shards:   5% Completed | 2/41 [00:04<01:35,  2.45s/it]
Loading safetensors checkpoint shards:   7% Completed | 3/41 [00:07<01:33,  2.46s/it]
Loading safetensors checkpoint shards:  10% Completed | 4/41 [00:09<01:31,  2.46s/it]
Loading safetensors checkpoint shards:  12% Completed | 5/41 [00:12<01:30,  2.52s/it]
Loading safetensors checkpoint shards:  15% Completed | 6/41 [00:14<01:27,  2.50s/it]
Loading safetensors checkpoint shards:  17% Completed | 7/41 [00:16<01:20,  2.37s/it]
Loading safetensors checkpoint shards:  20% Completed | 8/41 [00:19<01:20,  2.45s/it]
Loading safetensors checkpoint shards:  22% Completed | 9/41 [00:22<01:21,  2.53s/it]
Loading safetensors checkpoint shards:  24% Completed | 10/41 [00:24<01:16,  2.46s/it]
Loading safetensors checkpoint shards:  27% Completed | 11/41 [00:27<01:16,  2.56s/it]
Loading safetensors checkpoint shards:  29% Completed | 12/41 [00:29<01:13,  2.52s/it]
Loading safetensors checkpoint shards:  32% Completed | 13/41 [00:32<01:11,  2.56s/it]
Loading safetensors checkpoint shards:  34% Completed | 14/41 [00:34<01:07,  2.51s/it]
Loading safetensors checkpoint shards:  37% Completed | 15/41 [00:37<01:06,  2.56s/it]
Loading safetensors checkpoint shards:  39% Completed | 16/41 [00:37<00:45,  1.83s/it]
Loading safetensors checkpoint shards:  41% Completed | 17/41 [00:40<00:50,  2.09s/it]
Loading safetensors checkpoint shards:  44% Completed | 18/41 [00:42<00:51,  2.23s/it]
Loading safetensors checkpoint shards:  46% Completed | 19/41 [00:45<00:50,  2.31s/it]
Loading safetensors checkpoint shards:  49% Completed | 20/41 [00:48<00:52,  2.48s/it]
Loading safetensors checkpoint shards:  51% Completed | 21/41 [00:51<00:56,  2.85s/it]
Loading safetensors checkpoint shards:  54% Completed | 22/41 [00:54<00:53,  2.81s/it]
Loading safetensors checkpoint shards:  56% Completed | 23/41 [00:58<00:55,  3.08s/it]
Loading safetensors checkpoint shards:  59% Completed | 24/41 [01:01<00:51,  3.02s/it]
Loading safetensors checkpoint shards:  61% Completed | 25/41 [01:04<00:47,  2.97s/it]
Loading safetensors checkpoint shards:  63% Completed | 26/41 [01:06<00:43,  2.93s/it]
Loading safetensors checkpoint shards:  66% Completed | 27/41 [01:09<00:40,  2.88s/it]
Loading safetensors checkpoint shards:  68% Completed | 28/41 [01:12<00:36,  2.83s/it]
Loading safetensors checkpoint shards:  71% Completed | 29/41 [01:14<00:33,  2.75s/it]
Loading safetensors checkpoint shards:  73% Completed | 30/41 [01:17<00:29,  2.72s/it]
Loading safetensors checkpoint shards:  76% Completed | 31/41 [01:20<00:26,  2.62s/it]
Loading safetensors checkpoint shards:  78% Completed | 32/41 [01:22<00:23,  2.57s/it]
(Worker_TP2 pid=956) INFO 11-22 08:09:57 [default_loader.py:267] Loading weights took 82.99 seconds
Loading safetensors checkpoint shards:  80% Completed | 33/41 [01:24<00:20,  2.53s/it]
(Worker_TP2 pid=956) INFO 11-22 08:09:59 [model_runner_v1.py:2667] Loading model weights took 37.4525 GB
Loading safetensors checkpoint shards:  83% Completed | 34/41 [01:27<00:18,  2.60s/it]
Loading safetensors checkpoint shards:  85% Completed | 35/41 [01:30<00:15,  2.61s/it]
Loading safetensors checkpoint shards:  88% Completed | 36/41 [01:33<00:13,  2.68s/it]
Loading safetensors checkpoint shards:  90% Completed | 37/41 [01:36<00:11,  2.76s/it]
(Worker_TP3 pid=957) INFO 11-22 08:10:13 [default_loader.py:267] Loading weights took 99.08 seconds
Loading safetensors checkpoint shards:  93% Completed | 38/41 [01:39<00:08,  2.94s/it]
(Worker_TP3 pid=957) INFO 11-22 08:10:15 [model_runner_v1.py:2667] Loading model weights took 37.4525 GB
Loading safetensors checkpoint shards:  95% Completed | 39/41 [01:42<00:06,  3.04s/it]
Loading safetensors checkpoint shards:  98% Completed | 40/41 [01:45<00:03,  3.11s/it]
Loading safetensors checkpoint shards: 100% Completed | 41/41 [01:49<00:00,  3.09s/it]
Loading safetensors checkpoint shards: 100% Completed | 41/41 [01:49<00:00,  2.66s/it]
(Worker_TP0 pid=954)
(Worker_TP0 pid=954) INFO 11-22 08:10:23 [default_loader.py:267] Loading weights took 109.16 seconds
(Worker_TP0 pid=954) INFO 11-22 08:10:24 [model_runner_v1.py:2667] Loading model weights took 37.4525 GB
(Worker_TP1 pid=955) INFO 11-22 08:11:13 [default_loader.py:267] Loading weights took 158.22 seconds
(Worker_TP1 pid=955) INFO 11-22 08:11:13 [model_runner_v1.py:2667] Loading model weights took 37.4525 GB
(Worker_TP1 pid=955) WARNING 11-22 08:11:14 [cudagraph_dispatcher.py:106] cudagraph dispatching keys are not initialized. No cudagraph will be used.
(Worker_TP0 pid=954) WARNING 11-22 08:11:14 [cudagraph_dispatcher.py:106] cudagraph dispatching keys are not initialized. No cudagraph will be used.
(Worker_TP3 pid=957) WARNING 11-22 08:11:14 [cudagraph_dispatcher.py:106] cudagraph dispatching keys are not initialized. No cudagraph will be used.
(Worker_TP2 pid=956) WARNING 11-22 08:11:14 [cudagraph_dispatcher.py:106] cudagraph dispatching keys are not initialized. No cudagraph will be used.
(Worker_TP0 pid=954) INFO 11-22 08:11:17 [worker_v1.py:256] Available memory: 4062010368, total memory: 65452113920
(Worker_TP1 pid=955) INFO 11-22 08:11:17 [worker_v1.py:256] Available memory: 4068736000, total memory: 65452113920
(Worker_TP3 pid=957) INFO 11-22 08:11:17 [worker_v1.py:256] Available memory: 4071652352, total memory: 65452113920
(Worker_TP2 pid=956) INFO 11-22 08:11:17 [worker_v1.py:256] Available memory: 4066769920, total memory: 65452113920
(EngineCore_DP0 pid=818) INFO 11-22 08:11:17 [kv_cache_utils.py:1087] GPU KV cache size: 82,560 tokens
(EngineCore_DP0 pid=818) INFO 11-22 08:11:17 [kv_cache_utils.py:1091] Maximum concurrency for 32,768 tokens per request: 9.66x
(EngineCore_DP0 pid=818) INFO 11-22 08:11:17 [kv_cache_utils.py:1087] GPU KV cache size: 82,560 tokens
(EngineCore_DP0 pid=818) INFO 11-22 08:11:17 [kv_cache_utils.py:1091] Maximum concurrency for 32,768 tokens per request: 9.69x
(EngineCore_DP0 pid=818) INFO 11-22 08:11:17 [kv_cache_utils.py:1087] GPU KV cache size: 82,560 tokens
(EngineCore_DP0 pid=818) INFO 11-22 08:11:17 [kv_cache_utils.py:1091] Maximum concurrency for 32,768 tokens per request: 9.67x
(EngineCore_DP0 pid=818) INFO 11-22 08:11:17 [kv_cache_utils.py:1087] GPU KV cache size: 82,560 tokens
(EngineCore_DP0 pid=818) INFO 11-22 08:11:17 [kv_cache_utils.py:1091] Maximum concurrency for 32,768 tokens per request: 9.69x
(EngineCore_DP0 pid=818) INFO 11-22 08:11:17 [core.py:210] init engine (profile, create kv cache, warmup model) took 3.75 seconds
(EngineCore_DP0 pid=818) INFO 11-22 08:11:18 [__init__.py:381] Cudagraph is disabled under eager mode
(EngineCore_DP0 pid=818) INFO 11-22 08:11:18 [platform.py:152] Compilation disabled, using eager mode by default
(EngineCore_DP0 pid=818) WARNING 11-22 08:11:18 [platform.py:275] If chunked prefill or prefix caching is enabled, block size must be set to 128.
(EngineCore_DP0 pid=818) WARNING 11-22 08:11:18 [platform.py:282] When running qwen3-next model, block_size needs to be restored to its original value.
(APIServer pid=1) INFO 11-22 08:11:18 [loggers.py:147] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 860
(APIServer pid=1) INFO 11-22 08:11:19 [api_server.py:1634] Supported_tasks: ['generate']
(APIServer pid=1) WARNING 11-22 08:11:19 [model.py:1389] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=1) INFO 11-22 08:11:19 [serving_responses.py:137] Using default chat sampling params from model: {'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
(APIServer pid=1) INFO 11-22 08:11:19 [serving_chat.py:139] Using default chat sampling params from model: {'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
(APIServer pid=1) INFO 11-22 08:11:19 [serving_completion.py:76] Using default completion sampling params from model: {'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
(APIServer pid=1) INFO 11-22 08:11:19 [api_server.py:1912] Starting vLLM API server 0 on http://0.0.0.0:8000
(APIServer pid=1) INFO 11-22 08:11:19 [launcher.py:34] Available routes are:
(APIServer pid=1) INFO 11-22 08:11:19 [launcher.py:42] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=1) INFO 11-22 08:11:19 [launcher.py:42] Route: /docs, Methods: GET, HEAD
(APIServer pid=1) INFO 11-22 08:11:19 [launcher.py:42] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=1) INFO 11-22 08:11:19 [launcher.py:42] Route: /redoc, Methods: GET, HEAD
(APIServer pid=1) INFO 11-22 08:11:19 [launcher.py:42] Route: /health, Methods: GET
(APIServer pid=1) INFO 11-22 08:11:19 [launcher.py:42] Route: /load, Methods: GET
(APIServer pid=1) INFO 11-22 08:11:19 [launcher.py:42] Route: /ping, Methods: POST
(APIServer pid=1) INFO 11-22 08:11:19 [launcher.py:42] Route: /ping, Methods: GET
(APIServer pid=1) INFO 11-22 08:11:19 [launcher.py:42] Route: /tokenize, Methods: POST
(APIServer pid=1) INFO 11-22 08:11:19 [launcher.py:42] Route: /detokenize, Methods: POST
(APIServer pid=1) INFO 11-22 08:11:19 [launcher.py:42] Route: /v1/models, Methods: GET
(APIServer pid=1) INFO 11-22 08:11:19 [launcher.py:42] Route: /version, Methods: GET
(APIServer pid=1) INFO 11-22 08:11:19 [launcher.py:42] Route: /v1/responses, Methods: POST
(APIServer pid=1) INFO 11-22 08:11:19 [launcher.py:42] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=1) INFO 11-22 08:11:19 [launcher.py:42] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=1) INFO 11-22 08:11:19 [launcher.py:42] Route: /v1/chat/completions, Methods: POST
(APIServer pid=1) INFO 11-22 08:11:19 [launcher.py:42] Route: /v1/completions, Methods: POST
(APIServer pid=1) INFO 11-22 08:11:19 [launcher.py:42] Route: /v1/embeddings, Methods: POST
(APIServer pid=1) INFO 11-22 08:11:19 [launcher.py:42] Route: /pooling, Methods: POST
(APIServer pid=1) INFO 11-22 08:11:19 [launcher.py:42] Route: /classify, Methods: POST
(APIServer pid=1) INFO 11-22 08:11:19 [launcher.py:42] Route: /score, Methods: POST
(APIServer pid=1) INFO 11-22 08:11:19 [launcher.py:42] Route: /v1/score, Methods: POST
(APIServer pid=1) INFO 11-22 08:11:19 [launcher.py:42] Route: /v1/audio/transcriptions, Methods: POST
(APIServer pid=1) INFO 11-22 08:11:19 [launcher.py:42] Route: /v1/audio/translations, Methods: POST
(APIServer pid=1) INFO 11-22 08:11:19 [launcher.py:42] Route: /rerank, Methods: POST
(APIServer pid=1) INFO 11-22 08:11:19 [launcher.py:42] Route: /v1/rerank, Methods: POST
(APIServer pid=1) INFO 11-22 08:11:19 [launcher.py:42] Route: /v2/rerank, Methods: POST
(APIServer pid=1) INFO 11-22 08:11:19 [launcher.py:42] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=1) INFO 11-22 08:11:19 [launcher.py:42] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=1) INFO 11-22 08:11:19 [launcher.py:42] Route: /invocations, Methods: POST
(APIServer pid=1) INFO 11-22 08:11:19 [launcher.py:42] Route: /metrics, Methods: GET
(APIServer pid=1) INFO:     Started server process [1]
(APIServer pid=1) INFO:     Waiting for application startup.
(APIServer pid=1) INFO:     Application startup complete.
(APIServer pid=1) ERROR 11-22 08:11:43 [serving_chat.py:178] Error with model error=ErrorInfo(message='The model `Qwen3-Next-80B` does not exist.', type='NotFoundError', param=None, code=404)
(APIServer pid=1) INFO:     172.17.0.1:47806 - "POST /v1/chat/completions HTTP/1.0" 404 Not Found
(APIServer pid=1) INFO 11-22 08:12:04 [chat_utils.py:560] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
INFO 11-22 08:12:15 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 11-22 08:12:15 [__init__.py:38] - ascend -> vllm_ascend:register
INFO 11-22 08:12:15 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 11-22 08:12:15 [__init__.py:207] Platform plugin ascend is activated
INFO 11-22 08:12:15 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 11-22 08:12:15 [__init__.py:38] - ascend -> vllm_ascend:register
INFO 11-22 08:12:15 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 11-22 08:12:15 [__init__.py:207] Platform plugin ascend is activated
INFO 11-22 08:12:15 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 11-22 08:12:15 [__init__.py:38] - ascend -> vllm_ascend:register
INFO 11-22 08:12:15 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 11-22 08:12:15 [__init__.py:207] Platform plugin ascend is activated
INFO 11-22 08:12:15 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 11-22 08:12:15 [__init__.py:38] - ascend -> vllm_ascend:register
INFO 11-22 08:12:15 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 11-22 08:12:15 [__init__.py:207] Platform plugin ascend is activated
WARNING 11-22 08:12:21 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 11-22 08:12:21 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 11-22 08:12:21 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 11-22 08:12:22 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
(APIServer pid=1) INFO 11-22 08:13:09 [loggers.py:127] Engine 000: Avg prompt throughput: 2.4 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.5%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 11-22 08:13:19 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.5%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 11-22 08:13:39 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 2.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.5%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 11-22 08:13:49 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 5.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.5%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 11-22 08:13:59 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 5.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.5%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO:     172.17.0.1:47808 - "POST /v1/chat/completions HTTP/1.0" 200 OK
(APIServer pid=1) INFO 11-22 08:14:09 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 11-22 08:14:19 [loggers.py:127] Engine 000: Avg prompt throughput: 2.4 tokens/s, Avg generation throughput: 4.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.5%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 11-22 08:14:29 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 5.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.5%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO:     172.17.0.1:47810 - "POST /v1/chat/completions HTTP/1.0" 200 OK
(APIServer pid=1) INFO 11-22 08:14:39 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 3.6 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 11-22 08:14:49 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: Qwen3-Next-80B too slow 太慢 #4357

Your current environment

🐛 Describe the bug

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: Qwen3-Next-80B too slow 太慢 #4357

Description

Your current environment

🐛 Describe the bug

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions