Skip to content

[Feature]: Model Support: Qwen3 Token Classification #30107

@bd2lcco

Description

@bd2lcco

🚀 The feature, motivation and pitch

May I know if it's possible to support Qwen3ForTokenClassification (https://github.com/huggingface/transformers/blob/main/src/transformers/models/qwen3/modeling_qwen3.py#L541) for token classification? It seems that there is currently no such class support in vLLM (https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/registry.py#L240), which means the architecture cannot be overridden.

e.g.

vllm serve /tmp/sg_mount/mlop_pii_svc/pii_model_60K --tensor-parallel-size 1 --runner pooling --host 0.0.0.0 --hf-overrides '{"architectures": ["Qwen3ForTokenClassification"]}' --port 8080 --task classify

Full logs:

bash-5.1# vllm serve /tmp/sg_mount/mlop_pii_svc/pii_model_60K --tensor-parallel-size 1 --runner pooling --host 0.0.0.0 --hf-overrides '{"architectures": ["Qwen3ForTokenClassification"]}' --port 8080 --task classify
INFO 12-05 04:20:43 [scheduler.py:216] Chunked prefill is enabled with max_num_batched_tokens=2048.
WARNING 12-05 04:20:43 [argparse_utils.py:90] argument 'task' is deprecated
(APIServer pid=10278) INFO 12-05 04:20:43 [api_server.py:1977] vLLM API server version 0.11.2
(APIServer pid=10278) INFO 12-05 04:20:43 [utils.py:253] non-default args: {'model_tag': '/tmp/sg_mount/mlop_pii_svc/pii_model_60K', 'host': '0.0.0.0', 'port': 8080, 'model': '/tmp/sg_mount/mlop_pii_svc/pii_model_60K', 'runner': 'pooling', 'task': 'classify', 'hf_overrides': {'architectures': ['Qwen3ForTokenClassification']}}
(APIServer pid=10278) INFO 12-05 04:20:48 [model.py:631] Resolved architecture: TransformersForCausalLM
(APIServer pid=10278) INFO 12-05 04:20:48 [model.py:1968] Downcasting torch.float32 to torch.float16.
(APIServer pid=10278) INFO 12-05 04:20:48 [model.py:1745] Using max model len 40960
(APIServer pid=10278) INFO 12-05 04:20:48 [scheduler.py:216] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=10278) WARNING 12-05 04:20:48 [vllm.py:486] Pooling models do not support full cudagraphs. Overriding cudagraph_mode to PIECEWISE.
(EngineCore_DP0 pid=10543) INFO 12-05 04:20:53 [core.py:93] Initializing a V1 LLM engine (v0.11.2) with config: model='/tmp/sg_mount/mlop_pii_svc/pii_model_60K', speculative_config=None, tokenizer='/tmp/sg_mount/mlop_pii_svc/pii_model_60K', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=40960, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/tmp/sg_mount/mlop_pii_svc/pii_model_60K, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=PoolerConfig(pooling_type='LAST', normalize=None, dimensions=None, enable_chunked_processing=None, max_embed_len=None, softmax=None, activation=None, use_activation=None, logit_bias=None, step_tag_id=None, returned_token_ids=None), compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'use_inductor': None, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.PIECEWISE: 1>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {}, 'max_cudagraph_capture_size': 512, 'local_cache_dir': None}
(EngineCore_DP0 pid=10543) INFO 12-05 04:20:54 [parallel_state.py:1208] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.42.18.139:46997 backend=nccl
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore_DP0 pid=10543) INFO 12-05 04:20:54 [parallel_state.py:1394] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=10543) WARNING 12-05 04:20:54 [utils.py:177] TransformersForCausalLM has no vLLM implementation, falling back to Transformers implementation. Some features may not be supported and performance may not be optimal.
(EngineCore_DP0 pid=10543) INFO 12-05 04:20:54 [gpu_model_runner.py:3259] Starting to load model /tmp/sg_mount/mlop_pii_svc/pii_model_60K...
(EngineCore_DP0 pid=10543) INFO 12-05 04:20:54 [base.py:121] Using Transformers modeling backend.
(EngineCore_DP0 pid=10543) INFO 12-05 04:20:54 [cuda.py:377] Using AttentionBackendEnum.TRITON_ATTN backend.
(EngineCore_DP0 pid=10543) [2025-12-05 04:20:54] INFO _optional_torch_c_dlpack.py:119: JIT-compiling torch-c-dlpack-ext to cache...
(EngineCore_DP0 pid=10543) /usr/local/lib/python3.12/site-packages/tvm_ffi/_optional_torch_c_dlpack.py:161: UserWarning: Failed to JIT torch c dlpack extension, EnvTensorAllocator will not be enabled.
(EngineCore_DP0 pid=10543) We recommend installing via `pip install torch-c-dlpack-ext`
(EngineCore_DP0 pid=10543)   warnings.warn(
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842] EngineCore failed to start.
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842] Traceback (most recent call last):
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 833, in run_engine_core
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842]     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 606, in __init__
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842]     super().__init__(
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 102, in __init__
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842]     self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 101, in __init__
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842]     self._init_executor()
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 48, in _init_executor
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842]     self.driver_worker.load_model()
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 273, in load_model
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842]     self.model_runner.load_model(eep_scale_up=eep_scale_up)
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3276, in load_model
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842]     self.model = model_loader.load_model(
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842]                  ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842]   File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 55, in load_model
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842]     self.load_weights(model, model_config)
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842]   File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/model_loader/default_loader.py", line 292, in load_weights
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842]     loaded_weights = model.load_weights(
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842]                      ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842]   File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/models/adapters.py", line 336, in load_weights
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842]     return super().load_weights(weights)
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842]   File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/models/adapters.py", line 225, in load_weights
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842]     return orig_cls.load_weights(self, weights)  # type: ignore
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842]   File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/models/transformers/base.py", line 454, in load_weights
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842]     return loader.load_weights(weights, mapper=self.hf_to_vllm_mapper)
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842]   File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 332, in load_weights
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842]     autoloaded_weights = set(self._load_module("", self.module, weights))
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842]                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842]   File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 317, in _load_module
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842]     raise ValueError(msg)
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842] ValueError: There is no module or parameter named 'classifier' in TransformersForSequenceClassification
(EngineCore_DP0 pid=10543) Process EngineCore_DP0:
(EngineCore_DP0 pid=10543) Traceback (most recent call last):
(EngineCore_DP0 pid=10543)   File "/usr/local/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=10543)     self.run()
(EngineCore_DP0 pid=10543)   File "/usr/local/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=10543)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=10543)   File "/usr/local/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 846, in run_engine_core
(EngineCore_DP0 pid=10543)     raise e
(EngineCore_DP0 pid=10543)   File "/usr/local/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 833, in run_engine_core
(EngineCore_DP0 pid=10543)     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=10543)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=10543)   File "/usr/local/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 606, in __init__
(EngineCore_DP0 pid=10543)     super().__init__(
(EngineCore_DP0 pid=10543)   File "/usr/local/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 102, in __init__
(EngineCore_DP0 pid=10543)     self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=10543)                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=10543)   File "/usr/local/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 101, in __init__
(EngineCore_DP0 pid=10543)     self._init_executor()
(EngineCore_DP0 pid=10543)   File "/usr/local/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 48, in _init_executor
(EngineCore_DP0 pid=10543)     self.driver_worker.load_model()
(EngineCore_DP0 pid=10543)   File "/usr/local/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 273, in load_model
(EngineCore_DP0 pid=10543)     self.model_runner.load_model(eep_scale_up=eep_scale_up)
(EngineCore_DP0 pid=10543)   File "/usr/local/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3276, in load_model
(EngineCore_DP0 pid=10543)     self.model = model_loader.load_model(
(EngineCore_DP0 pid=10543)                  ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=10543)   File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 55, in load_model
(EngineCore_DP0 pid=10543)     self.load_weights(model, model_config)
(EngineCore_DP0 pid=10543)   File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/model_loader/default_loader.py", line 292, in load_weights
(EngineCore_DP0 pid=10543)     loaded_weights = model.load_weights(
(EngineCore_DP0 pid=10543)                      ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=10543)   File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/models/adapters.py", line 336, in load_weights
(EngineCore_DP0 pid=10543)     return super().load_weights(weights)
(EngineCore_DP0 pid=10543)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=10543)   File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/models/adapters.py", line 225, in load_weights
(EngineCore_DP0 pid=10543)     return orig_cls.load_weights(self, weights)  # type: ignore
(EngineCore_DP0 pid=10543)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=10543)   File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/models/transformers/base.py", line 454, in load_weights
(EngineCore_DP0 pid=10543)     return loader.load_weights(weights, mapper=self.hf_to_vllm_mapper)
(EngineCore_DP0 pid=10543)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=10543)   File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 332, in load_weights
(EngineCore_DP0 pid=10543)     autoloaded_weights = set(self._load_module("", self.module, weights))
(EngineCore_DP0 pid=10543)                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=10543)   File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 317, in _load_module
(EngineCore_DP0 pid=10543)     raise ValueError(msg)
(EngineCore_DP0 pid=10543) ValueError: There is no module or parameter named 'classifier' in TransformersForSequenceClassification
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:02<?, ?it/s]

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions