Skip to content

Commit 5f3cd7f

Browse files
authored
[Docs] Update the name of Transformers backend -> Transformers modeling backend (vllm-project#28725)
Signed-off-by: Harry Mellor <[email protected]>
1 parent c934cae commit 5f3cd7f

File tree

16 files changed

+46
-43
lines changed

16 files changed

+46
-43
lines changed

.github/CODEOWNERS

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,7 @@ CMakeLists.txt @tlrmchlsmth @LucasWilkinson
5757
/tests/v1/kv_connector @ApostaC
5858
/tests/v1/offloading @ApostaC
5959

60-
# Transformers backend
60+
# Transformers modeling backend
6161
/vllm/model_executor/models/transformers @hmellor
6262
/tests/models/test_transformers.py @hmellor
6363

docs/contributing/model/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# Summary
22

33
!!! important
4-
Many decoder language models can now be automatically loaded using the [Transformers backend](../../models/supported_models.md#transformers) without having to implement them in vLLM. See if `vllm serve <model>` works first!
4+
Many decoder language models can now be automatically loaded using the [Transformers modeling backend](../../models/supported_models.md#transformers) without having to implement them in vLLM. See if `vllm serve <model>` works first!
55

66
vLLM models are specialized [PyTorch](https://pytorch.org/) models that take advantage of various [features](../../features/README.md#compatibility-matrix) to optimize their performance.
77

docs/deployment/frameworks/hf_inference_endpoints.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -156,7 +156,7 @@ In this guide, we demonstrate manual deployment using the [`rednote-hilab/dots.o
156156

157157
## Advanced Deployment Details
158158

159-
With the [transformers backend integration](https://blog.vllm.ai/2025/04/11/transformers-backend.html), vLLM now offers Day 0 support for any model compatible with `transformers`. This means you can deploy such models immediately, leveraging vLLM’s optimized inference without additional backend modifications.
159+
With the [Transformers modeling backend integration](https://blog.vllm.ai/2025/04/11/transformers-backend.html), vLLM now offers Day 0 support for any model compatible with `transformers`. This means you can deploy such models immediately, leveraging vLLM’s optimized inference without additional backend modifications.
160160

161161
Hugging Face Inference Endpoints provides a fully managed environment for serving models via vLLM. You can deploy models without configuring servers, installing dependencies, or managing clusters. Endpoints also support deployment across multiple cloud providers (AWS, Azure, GCP) without the need for separate accounts.
162162

@@ -167,4 +167,4 @@ The platform integrates seamlessly with the Hugging Face Hub, allowing you to de
167167
- Explore the [Inference Endpoints](https://endpoints.huggingface.co/catalog) model catalog
168168
- Read the Inference Endpoints [documentation](https://huggingface.co/docs/inference-endpoints/en/index)
169169
- Learn about [Inference Endpoints engines](https://huggingface.co/docs/inference-endpoints/en/engines/vllm)
170-
- Understand the [transformers backend integration](https://blog.vllm.ai/2025/04/11/transformers-backend.html)
170+
- Understand the [Transformers modeling backend integration](https://blog.vllm.ai/2025/04/11/transformers-backend.html)

docs/models/supported_models.md

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -15,17 +15,17 @@ These models are what we list in [supported text models](#list-of-text-only-lang
1515

1616
### Transformers
1717

18-
vLLM also supports model implementations that are available in Transformers. You should expect the performance of a Transformers model implementation used in vLLM to be within <5% of the performance of a dedicated vLLM model implementation. We call this feature the "Transformers backend".
18+
vLLM also supports model implementations that are available in Transformers. You should expect the performance of a Transformers model implementation used in vLLM to be within <5% of the performance of a dedicated vLLM model implementation. We call this feature the "Transformers modeling backend".
1919

20-
Currently, the Transformers backend works for the following:
20+
Currently, the Transformers modeling backend works for the following:
2121

2222
- Modalities: embedding models, language models and vision-language models*
2323
- Architectures: encoder-only, decoder-only, mixture-of-experts
2424
- Attention types: full attention and/or sliding attention
2525

2626
_*Vision-language models currently accept only image inputs. Support for video inputs will be added in a future release._
2727

28-
If the Transformers model implementation follows all the steps in [writing a custom model](#writing-custom-models) then, when used with the Transformers backend, it will be compatible with the following features of vLLM:
28+
If the Transformers model implementation follows all the steps in [writing a custom model](#writing-custom-models) then, when used with the Transformers modeling backend, it will be compatible with the following features of vLLM:
2929

3030
- All the features listed in the [compatibility matrix](../features/README.md#feature-x-feature)
3131
- Any combination of the following vLLM parallelisation schemes:
@@ -44,7 +44,7 @@ llm.apply_model(lambda model: print(type(model)))
4444

4545
If the printed type starts with `Transformers...` then it's using the Transformers model implementation!
4646

47-
If a model has a vLLM implementation but you would prefer to use the Transformers implementation via the Transformers backend, set `model_impl="transformers"` for [offline inference](../serving/offline_inference.md) or `--model-impl transformers` for the [online serving](../serving/openai_compatible_server.md).
47+
If a model has a vLLM implementation but you would prefer to use the Transformers implementation via the Transformers modeling backend, set `model_impl="transformers"` for [offline inference](../serving/offline_inference.md) or `--model-impl transformers` for the [online serving](../serving/openai_compatible_server.md).
4848

4949
!!! note
5050
For vision-language models, if you are loading with `dtype="auto"`, vLLM loads the whole model with config's `dtype` if it exists. In contrast the native Transformers will respect the `dtype` attribute of each backbone in the model. That might cause a slight difference in performance.
@@ -53,26 +53,26 @@ If a model has a vLLM implementation but you would prefer to use the Transformer
5353

5454
If a model is neither supported natively by vLLM nor Transformers, it can still be used in vLLM!
5555

56-
For a model to be compatible with the Transformers backend for vLLM it must:
56+
For a model to be compatible with the Transformers modeling backend for vLLM it must:
5757

5858
- be a Transformers compatible custom model (see [Transformers - Customizing models](https://huggingface.co/docs/transformers/en/custom_models)):
5959
- The model directory must have the correct structure (e.g. `config.json` is present).
6060
- `config.json` must contain `auto_map.AutoModel`.
61-
- be a Transformers backend for vLLM compatible model (see [Writing custom models](#writing-custom-models)):
61+
- be a Transformers modeling backend for vLLM compatible model (see [Writing custom models](#writing-custom-models)):
6262
- Customisation should be done in the base model (e.g. in `MyModel`, not `MyModelForCausalLM`).
6363

6464
If the compatible model is:
6565

6666
- on the Hugging Face Model Hub, simply set `trust_remote_code=True` for [offline-inference](../serving/offline_inference.md) or `--trust-remote-code` for the [openai-compatible-server](../serving/openai_compatible_server.md).
6767
- in a local directory, simply pass directory path to `model=<MODEL_DIR>` for [offline-inference](../serving/offline_inference.md) or `vllm serve <MODEL_DIR>` for the [openai-compatible-server](../serving/openai_compatible_server.md).
6868

69-
This means that, with the Transformers backend for vLLM, new models can be used before they are officially supported in Transformers or vLLM!
69+
This means that, with the Transformers modeling backend for vLLM, new models can be used before they are officially supported in Transformers or vLLM!
7070

7171
#### Writing custom models
7272

73-
This section details the necessary modifications to make to a Transformers compatible custom model that make it compatible with the Transformers backend for vLLM. (We assume that a Transformers compatible custom model has already been created, see [Transformers - Customizing models](https://huggingface.co/docs/transformers/en/custom_models)).
73+
This section details the necessary modifications to make to a Transformers compatible custom model that make it compatible with the Transformers modeling backend for vLLM. (We assume that a Transformers compatible custom model has already been created, see [Transformers - Customizing models](https://huggingface.co/docs/transformers/en/custom_models)).
7474

75-
To make your model compatible with the Transformers backend, it needs:
75+
To make your model compatible with the Transformers modeling backend, it needs:
7676

7777
1. `kwargs` passed down through all modules from `MyModel` to `MyAttention`.
7878
- If your model is encoder-only:
@@ -134,7 +134,7 @@ Here is what happens in the background when this model is loaded:
134134

135135
1. The config is loaded.
136136
2. `MyModel` Python class is loaded from the `auto_map` in config, and we check that the model `is_backend_compatible()`.
137-
3. `MyModel` is loaded into one of the Transformers backend classes in [vllm/model_executor/models/transformers](../../vllm/model_executor/models/transformers) which sets `self.config._attn_implementation = "vllm"` so that vLLM's attention layer is used.
137+
3. `MyModel` is loaded into one of the Transformers modeling backend classes in [vllm/model_executor/models/transformers](../../vllm/model_executor/models/transformers) which sets `self.config._attn_implementation = "vllm"` so that vLLM's attention layer is used.
138138

139139
That's it!
140140

@@ -182,7 +182,7 @@ To determine whether a given model is natively supported, you can check the `con
182182
If the `"architectures"` field contains a model architecture listed below, then it should be natively supported.
183183

184184
Models do not _need_ to be natively supported to be used in vLLM.
185-
The [Transformers backend](#transformers) enables you to run models directly using their Transformers implementation (or even remote code on the Hugging Face Model Hub!).
185+
The [Transformers modeling backend](#transformers) enables you to run models directly using their Transformers implementation (or even remote code on the Hugging Face Model Hub!).
186186

187187
!!! tip
188188
The easiest way to check if your model is really supported at runtime is to run the program below:
@@ -451,7 +451,7 @@ th {
451451
| `Zamba2ForCausalLM` | Zamba2 | `Zyphra/Zamba2-7B-instruct`, `Zyphra/Zamba2-2.7B-instruct`, `Zyphra/Zamba2-1.2B-instruct`, etc. | | |
452452
| `LongcatFlashForCausalLM` | LongCat-Flash | `meituan-longcat/LongCat-Flash-Chat`, `meituan-longcat/LongCat-Flash-Chat-FP8` | ✅︎ | ✅︎ |
453453

454-
Some models are supported only via the [Transformers backend](#transformers). The purpose of the table below is to acknowledge models which we officially support in this way. The logs will say that the Transformers backend is being used, and you will see no warning that this is fallback behaviour. This means that, if you have issues with any of the models listed below, please [make an issue](https://github.com/vllm-project/vllm/issues/new/choose) and we'll do our best to fix it!
454+
Some models are supported only via the [Transformers modeling backend](#transformers). The purpose of the table below is to acknowledge models which we officially support in this way. The logs will say that the Transformers modeling backend is being used, and you will see no warning that this is fallback behaviour. This means that, if you have issues with any of the models listed below, please [make an issue](https://github.com/vllm-project/vllm/issues/new/choose) and we'll do our best to fix it!
455455

456456
| Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/parallelism_scaling.md) |
457457
|--------------|--------|-------------------|----------------------|---------------------------|
@@ -720,7 +720,7 @@ These models primarily accept the [`LLM.generate`](./generative_models.md#llmgen
720720
| `TarsierForConditionalGeneration` | Tarsier | T + I<sup>E+</sup> | `omni-search/Tarsier-7b`, `omni-search/Tarsier-34b` | | ✅︎ |
721721
| `Tarsier2ForConditionalGeneration`<sup>^</sup> | Tarsier2 | T + I<sup>E+</sup> + V<sup>E+</sup> | `omni-research/Tarsier2-Recap-7b`, `omni-research/Tarsier2-7b-0115` | | ✅︎ |
722722

723-
Some models are supported only via the [Transformers backend](#transformers). The purpose of the table below is to acknowledge models which we officially support in this way. The logs will say that the Transformers backend is being used, and you will see no warning that this is fallback behaviour. This means that, if you have issues with any of the models listed below, please [make an issue](https://github.com/vllm-project/vllm/issues/new/choose) and we'll do our best to fix it!
723+
Some models are supported only via the [Transformers modeling backend](#transformers). The purpose of the table below is to acknowledge models which we officially support in this way. The logs will say that the Transformers modeling backend is being used, and you will see no warning that this is fallback behaviour. This means that, if you have issues with any of the models listed below, please [make an issue](https://github.com/vllm-project/vllm/issues/new/choose) and we'll do our best to fix it!
724724

725725
| Architecture | Models | Inputs | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/parallelism_scaling.md) |
726726
|--------------|--------|--------|-------------------|-----------------------------|-----------------------------------------|

tests/models/test_transformers.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# SPDX-License-Identifier: Apache-2.0
22
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
3-
"""Test the functionality of the Transformers backend."""
3+
"""Test the functionality of the Transformers modeling backend."""
44

55
from typing import Any
66

@@ -85,7 +85,7 @@ def test_models(
8585
required = Version("5.0.0.dev")
8686
if model == "allenai/OLMoE-1B-7B-0924" and installed < required:
8787
pytest.skip(
88-
"MoE models with the Transformers backend require "
88+
"MoE models with the Transformers modeling backend require "
8989
f"transformers>={required}, but got {installed}"
9090
)
9191

vllm/config/model.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -732,7 +732,7 @@ def validate_model_config_after(self: "ModelConfig") -> "ModelConfig":
732732
return self
733733

734734
def _get_transformers_backend_cls(self) -> str:
735-
"""Determine which Transformers backend class will be used if
735+
"""Determine which Transformers modeling backend class will be used if
736736
`model_impl` is set to `transformers` or `auto`."""
737737
cls = "Transformers"
738738
# If 'hf_config != hf_text_config' it's a nested config, i.e. multimodal
@@ -746,8 +746,8 @@ def _get_transformers_backend_cls(self) -> str:
746746
# User specified value take precedence
747747
if self.runner != "auto":
748748
runner = self.runner
749-
# Only consider Transformers backend pooling classes if we're wrapping an
750-
# architecture that defaults to pooling. Otherwise, we return the LM class
749+
# Only consider Transformers modeling backend pooling classes if we're wrapping
750+
# an architecture that defaults to pooling. Otherwise, we return the LM class
751751
# and use adapters.
752752
if runner == "pooling" and task in {"embed", "classify"}:
753753
if task == "embed":
@@ -759,7 +759,7 @@ def _get_transformers_backend_cls(self) -> str:
759759
return cls
760760

761761
def using_transformers_backend(self) -> bool:
762-
"""Check if the model is using the Transformers backend class."""
762+
"""Check if the model is using the Transformers modeling backend class."""
763763
used_cls = self._model_info.architecture
764764
transformers_backend_cls = self._get_transformers_backend_cls()
765765
return used_cls == transformers_backend_cls

vllm/lora/layers/base_linear.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -121,7 +121,7 @@ def set_lora(
121121
def apply(self, x: torch.Tensor, bias: torch.Tensor | None = None) -> torch.Tensor:
122122
output = self.base_layer.quant_method.apply(self.base_layer, x, bias)
123123

124-
# In transformers backend, x and output have extra batch dimension like
124+
# In Transformers modeling backend, x and output have extra batch dimension like
125125
# (1, seq_len, hidden_dim), while punica expects (seq_len, hidden_dim),
126126
# therefore we need to flatten the batch dimensions.
127127
if x.ndim == 3 and output.ndim == 3:

vllm/model_executor/models/adapters.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -429,7 +429,7 @@ def load_weights_using_from_2_way_softmax(
429429
if text_config.tie_word_embeddings:
430430
# embed_tokens is the assumed name for input embeddings. If the model does not
431431
# have this attribute, we fallback to get_input_embeddings(), which is used by
432-
# the Transformers backend.
432+
# the Transformers modeling backend.
433433
embed_tokens = (
434434
model.model.embed_tokens
435435
if hasattr(model.model, "embed_tokens")
@@ -487,7 +487,7 @@ def load_weights_no_post_processing(model, weights: Iterable[tuple[str, torch.Te
487487
if text_config.tie_word_embeddings:
488488
# embed_tokens is the assumed name for input embeddings. If the model does not
489489
# have this attribute, we fallback to get_input_embeddings(), which is used by
490-
# the Transformers backend.
490+
# the Transformers modeling backend.
491491
embed_tokens = (
492492
model.model.embed_tokens
493493
if hasattr(model.model, "embed_tokens")

vllm/model_executor/models/transformers/__init__.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -120,8 +120,8 @@ def __getattr__(name: str):
120120
"""Handle imports of non-existent classes with a helpful error message."""
121121
if name not in globals():
122122
raise AttributeError(
123-
"The Transformers backend does not currently have a class to handle "
124-
f"the requested model type: {name}. Please open an issue at "
123+
"The Transformers modeling backend does not currently have a class to "
124+
f"handle the requested model type: {name}. Please open an issue at "
125125
"https://github.com/vllm-project/vllm/issues/new"
126126
)
127127
return globals()[name]

vllm/model_executor/models/transformers/base.py

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
1515
# See the License for the specific language governing permissions and
1616
# limitations under the License.
17-
"""Transformers backend base class."""
17+
"""Transformers modeling backend base class."""
1818

1919
from collections.abc import Iterable
2020
from typing import TYPE_CHECKING
@@ -118,7 +118,7 @@ def __init_subclass__(cls, *args, **kwargs):
118118

119119
def __init__(self, *, vllm_config: "VllmConfig", prefix: str = ""):
120120
super().__init__()
121-
logger.info("Using Transformers backend.")
121+
logger.info("Using Transformers modeling backend.")
122122

123123
self.config = vllm_config.model_config.hf_config
124124
self.text_config = self.config.get_text_config()
@@ -147,7 +147,8 @@ def __init__(self, *, vllm_config: "VllmConfig", prefix: str = ""):
147147
# Check for unsupported quantization methods.
148148
if quant_method_name == "mxfp4":
149149
raise NotImplementedError(
150-
"Transformers backend does not support MXFP4 quantization yet."
150+
"Transformers modeling backend does "
151+
"not support MXFP4 quantization yet."
151152
)
152153
# Skip loading extra bias for GPTQ models.
153154
if "gptq" in quant_method_name:
@@ -458,6 +459,6 @@ def check_version(min_version: str, feature: str):
458459
required = Version(min_version)
459460
if installed < required:
460461
raise ImportError(
461-
f"Transformers backend requires transformers>={required} "
462+
f"Transformers modeling backend requires transformers>={required} "
462463
f"for {feature}, but got {installed}"
463464
)

0 commit comments

Comments
 (0)