-
-
Notifications
You must be signed in to change notification settings - Fork 12k
Closed
Labels
feature requestNew feature or requestNew feature or request
Description
🚀 The feature, motivation and pitch
May I know if it's possible to support Qwen3ForTokenClassification (https://github.com/huggingface/transformers/blob/main/src/transformers/models/qwen3/modeling_qwen3.py#L541) for token classification? It seems that there is currently no such class support in vLLM (https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/registry.py#L240), which means the architecture cannot be overridden.
e.g.
vllm serve /tmp/sg_mount/mlop_pii_svc/pii_model_60K --tensor-parallel-size 1 --runner pooling --host 0.0.0.0 --hf-overrides '{"architectures": ["Qwen3ForTokenClassification"]}' --port 8080 --task classify
Full logs:
bash-5.1# vllm serve /tmp/sg_mount/mlop_pii_svc/pii_model_60K --tensor-parallel-size 1 --runner pooling --host 0.0.0.0 --hf-overrides '{"architectures": ["Qwen3ForTokenClassification"]}' --port 8080 --task classify
INFO 12-05 04:20:43 [scheduler.py:216] Chunked prefill is enabled with max_num_batched_tokens=2048.
WARNING 12-05 04:20:43 [argparse_utils.py:90] argument 'task' is deprecated
(APIServer pid=10278) INFO 12-05 04:20:43 [api_server.py:1977] vLLM API server version 0.11.2
(APIServer pid=10278) INFO 12-05 04:20:43 [utils.py:253] non-default args: {'model_tag': '/tmp/sg_mount/mlop_pii_svc/pii_model_60K', 'host': '0.0.0.0', 'port': 8080, 'model': '/tmp/sg_mount/mlop_pii_svc/pii_model_60K', 'runner': 'pooling', 'task': 'classify', 'hf_overrides': {'architectures': ['Qwen3ForTokenClassification']}}
(APIServer pid=10278) INFO 12-05 04:20:48 [model.py:631] Resolved architecture: TransformersForCausalLM
(APIServer pid=10278) INFO 12-05 04:20:48 [model.py:1968] Downcasting torch.float32 to torch.float16.
(APIServer pid=10278) INFO 12-05 04:20:48 [model.py:1745] Using max model len 40960
(APIServer pid=10278) INFO 12-05 04:20:48 [scheduler.py:216] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=10278) WARNING 12-05 04:20:48 [vllm.py:486] Pooling models do not support full cudagraphs. Overriding cudagraph_mode to PIECEWISE.
(EngineCore_DP0 pid=10543) INFO 12-05 04:20:53 [core.py:93] Initializing a V1 LLM engine (v0.11.2) with config: model='/tmp/sg_mount/mlop_pii_svc/pii_model_60K', speculative_config=None, tokenizer='/tmp/sg_mount/mlop_pii_svc/pii_model_60K', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=40960, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/tmp/sg_mount/mlop_pii_svc/pii_model_60K, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=PoolerConfig(pooling_type='LAST', normalize=None, dimensions=None, enable_chunked_processing=None, max_embed_len=None, softmax=None, activation=None, use_activation=None, logit_bias=None, step_tag_id=None, returned_token_ids=None), compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'use_inductor': None, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.PIECEWISE: 1>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {}, 'max_cudagraph_capture_size': 512, 'local_cache_dir': None}
(EngineCore_DP0 pid=10543) INFO 12-05 04:20:54 [parallel_state.py:1208] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.42.18.139:46997 backend=nccl
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore_DP0 pid=10543) INFO 12-05 04:20:54 [parallel_state.py:1394] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=10543) WARNING 12-05 04:20:54 [utils.py:177] TransformersForCausalLM has no vLLM implementation, falling back to Transformers implementation. Some features may not be supported and performance may not be optimal.
(EngineCore_DP0 pid=10543) INFO 12-05 04:20:54 [gpu_model_runner.py:3259] Starting to load model /tmp/sg_mount/mlop_pii_svc/pii_model_60K...
(EngineCore_DP0 pid=10543) INFO 12-05 04:20:54 [base.py:121] Using Transformers modeling backend.
(EngineCore_DP0 pid=10543) INFO 12-05 04:20:54 [cuda.py:377] Using AttentionBackendEnum.TRITON_ATTN backend.
(EngineCore_DP0 pid=10543) [2025-12-05 04:20:54] INFO _optional_torch_c_dlpack.py:119: JIT-compiling torch-c-dlpack-ext to cache...
(EngineCore_DP0 pid=10543) /usr/local/lib/python3.12/site-packages/tvm_ffi/_optional_torch_c_dlpack.py:161: UserWarning: Failed to JIT torch c dlpack extension, EnvTensorAllocator will not be enabled.
(EngineCore_DP0 pid=10543) We recommend installing via `pip install torch-c-dlpack-ext`
(EngineCore_DP0 pid=10543) warnings.warn(
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842] EngineCore failed to start.
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842] Traceback (most recent call last):
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842] File "/usr/local/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 833, in run_engine_core
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842] engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842] File "/usr/local/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 606, in __init__
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842] super().__init__(
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842] File "/usr/local/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 102, in __init__
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842] self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842] File "/usr/local/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 101, in __init__
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842] self._init_executor()
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842] File "/usr/local/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 48, in _init_executor
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842] self.driver_worker.load_model()
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842] File "/usr/local/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 273, in load_model
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842] self.model_runner.load_model(eep_scale_up=eep_scale_up)
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842] File "/usr/local/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3276, in load_model
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842] self.model = model_loader.load_model(
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842] ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842] File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 55, in load_model
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842] self.load_weights(model, model_config)
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842] File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/model_loader/default_loader.py", line 292, in load_weights
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842] loaded_weights = model.load_weights(
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842] ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842] File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/models/adapters.py", line 336, in load_weights
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842] return super().load_weights(weights)
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842] File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/models/adapters.py", line 225, in load_weights
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842] return orig_cls.load_weights(self, weights) # type: ignore
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842] File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/models/transformers/base.py", line 454, in load_weights
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842] return loader.load_weights(weights, mapper=self.hf_to_vllm_mapper)
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842] File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 332, in load_weights
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842] autoloaded_weights = set(self._load_module("", self.module, weights))
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842] File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 317, in _load_module
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842] raise ValueError(msg)
(EngineCore_DP0 pid=10543) ERROR 12-05 04:20:59 [core.py:842] ValueError: There is no module or parameter named 'classifier' in TransformersForSequenceClassification
(EngineCore_DP0 pid=10543) Process EngineCore_DP0:
(EngineCore_DP0 pid=10543) Traceback (most recent call last):
(EngineCore_DP0 pid=10543) File "/usr/local/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=10543) self.run()
(EngineCore_DP0 pid=10543) File "/usr/local/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=10543) self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=10543) File "/usr/local/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 846, in run_engine_core
(EngineCore_DP0 pid=10543) raise e
(EngineCore_DP0 pid=10543) File "/usr/local/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 833, in run_engine_core
(EngineCore_DP0 pid=10543) engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=10543) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=10543) File "/usr/local/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 606, in __init__
(EngineCore_DP0 pid=10543) super().__init__(
(EngineCore_DP0 pid=10543) File "/usr/local/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 102, in __init__
(EngineCore_DP0 pid=10543) self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=10543) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=10543) File "/usr/local/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 101, in __init__
(EngineCore_DP0 pid=10543) self._init_executor()
(EngineCore_DP0 pid=10543) File "/usr/local/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 48, in _init_executor
(EngineCore_DP0 pid=10543) self.driver_worker.load_model()
(EngineCore_DP0 pid=10543) File "/usr/local/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 273, in load_model
(EngineCore_DP0 pid=10543) self.model_runner.load_model(eep_scale_up=eep_scale_up)
(EngineCore_DP0 pid=10543) File "/usr/local/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3276, in load_model
(EngineCore_DP0 pid=10543) self.model = model_loader.load_model(
(EngineCore_DP0 pid=10543) ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=10543) File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 55, in load_model
(EngineCore_DP0 pid=10543) self.load_weights(model, model_config)
(EngineCore_DP0 pid=10543) File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/model_loader/default_loader.py", line 292, in load_weights
(EngineCore_DP0 pid=10543) loaded_weights = model.load_weights(
(EngineCore_DP0 pid=10543) ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=10543) File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/models/adapters.py", line 336, in load_weights
(EngineCore_DP0 pid=10543) return super().load_weights(weights)
(EngineCore_DP0 pid=10543) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=10543) File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/models/adapters.py", line 225, in load_weights
(EngineCore_DP0 pid=10543) return orig_cls.load_weights(self, weights) # type: ignore
(EngineCore_DP0 pid=10543) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=10543) File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/models/transformers/base.py", line 454, in load_weights
(EngineCore_DP0 pid=10543) return loader.load_weights(weights, mapper=self.hf_to_vllm_mapper)
(EngineCore_DP0 pid=10543) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=10543) File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 332, in load_weights
(EngineCore_DP0 pid=10543) autoloaded_weights = set(self._load_module("", self.module, weights))
(EngineCore_DP0 pid=10543) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=10543) File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 317, in _load_module
(EngineCore_DP0 pid=10543) raise ValueError(msg)
(EngineCore_DP0 pid=10543) ValueError: There is no module or parameter named 'classifier' in TransformersForSequenceClassification
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:02<?, ?it/s]
Alternatives
No response
Additional context
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Metadata
Metadata
Assignees
Labels
feature requestNew feature or requestNew feature or request