You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/contributing/model/README.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,7 @@
1
1
# Summary
2
2
3
3
!!! important
4
-
Many decoder language models can now be automatically loaded using the [Transformers backend](../../models/supported_models.md#transformers) without having to implement them in vLLM. See if `vllm serve <model>` works first!
4
+
Many decoder language models can now be automatically loaded using the [Transformers modeling backend](../../models/supported_models.md#transformers) without having to implement them in vLLM. See if `vllm serve <model>` works first!
5
5
6
6
vLLM models are specialized [PyTorch](https://pytorch.org/) models that take advantage of various [features](../../features/README.md#compatibility-matrix) to optimize their performance.
Copy file name to clipboardExpand all lines: docs/deployment/frameworks/hf_inference_endpoints.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -156,7 +156,7 @@ In this guide, we demonstrate manual deployment using the [`rednote-hilab/dots.o
156
156
157
157
## Advanced Deployment Details
158
158
159
-
With the [transformers backend integration](https://blog.vllm.ai/2025/04/11/transformers-backend.html), vLLM now offers Day 0 support forany model compatible with`transformers`. This means you can deploy such models immediately, leveraging vLLM’s optimized inference without additional backend modifications.
159
+
With the [Transformers modeling backend integration](https://blog.vllm.ai/2025/04/11/transformers-backend.html), vLLM now offers Day 0 support forany model compatible with`transformers`. This means you can deploy such models immediately, leveraging vLLM’s optimized inference without additional backend modifications.
160
160
161
161
Hugging Face Inference Endpoints provides a fully managed environment for serving models via vLLM. You can deploy models without configuring servers, installing dependencies, or managing clusters. Endpoints also support deployment across multiple cloud providers (AWS, Azure, GCP) without the need for separate accounts.
162
162
@@ -167,4 +167,4 @@ The platform integrates seamlessly with the Hugging Face Hub, allowing you to de
167
167
- Explore the [Inference Endpoints](https://endpoints.huggingface.co/catalog) model catalog
168
168
- Read the Inference Endpoints [documentation](https://huggingface.co/docs/inference-endpoints/en/index)
169
169
- Learn about [Inference Endpoints engines](https://huggingface.co/docs/inference-endpoints/en/engines/vllm)
170
-
- Understand the [transformers backend integration](https://blog.vllm.ai/2025/04/11/transformers-backend.html)
170
+
- Understand the [Transformers modeling backend integration](https://blog.vllm.ai/2025/04/11/transformers-backend.html)
Copy file name to clipboardExpand all lines: docs/models/supported_models.md
+13-13Lines changed: 13 additions & 13 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -15,17 +15,17 @@ These models are what we list in [supported text models](#list-of-text-only-lang
15
15
16
16
### Transformers
17
17
18
-
vLLM also supports model implementations that are available in Transformers. You should expect the performance of a Transformers model implementation used in vLLM to be within <5% of the performance of a dedicated vLLM model implementation. We call this feature the "Transformers backend".
18
+
vLLM also supports model implementations that are available in Transformers. You should expect the performance of a Transformers model implementation used in vLLM to be within <5% of the performance of a dedicated vLLM model implementation. We call this feature the "Transformers modeling backend".
19
19
20
-
Currently, the Transformers backend works for the following:
20
+
Currently, the Transformers modeling backend works for the following:
21
21
22
22
- Modalities: embedding models, language models and vision-language models*
- Attention types: full attention and/or sliding attention
25
25
26
26
_*Vision-language models currently accept only image inputs. Support for video inputs will be added in a future release._
27
27
28
-
If the Transformers model implementation follows all the steps in [writing a custom model](#writing-custom-models) then, when used with the Transformers backend, it will be compatible with the following features of vLLM:
28
+
If the Transformers model implementation follows all the steps in [writing a custom model](#writing-custom-models) then, when used with the Transformers modeling backend, it will be compatible with the following features of vLLM:
29
29
30
30
- All the features listed in the [compatibility matrix](../features/README.md#feature-x-feature)
31
31
- Any combination of the following vLLM parallelisation schemes:
If the printed type starts with `Transformers...` then it's using the Transformers model implementation!
46
46
47
-
If a model has a vLLM implementation but you would prefer to use the Transformers implementation via the Transformers backend, set `model_impl="transformers"` for [offline inference](../serving/offline_inference.md) or `--model-impl transformers` for the [online serving](../serving/openai_compatible_server.md).
47
+
If a model has a vLLM implementation but you would prefer to use the Transformers implementation via the Transformers modeling backend, set `model_impl="transformers"` for [offline inference](../serving/offline_inference.md) or `--model-impl transformers` for the [online serving](../serving/openai_compatible_server.md).
48
48
49
49
!!! note
50
50
For vision-language models, if you are loading with `dtype="auto"`, vLLM loads the whole model with config's `dtype` if it exists. In contrast the native Transformers will respect the `dtype` attribute of each backbone in the model. That might cause a slight difference in performance.
@@ -53,26 +53,26 @@ If a model has a vLLM implementation but you would prefer to use the Transformer
53
53
54
54
If a model is neither supported natively by vLLM nor Transformers, it can still be used in vLLM!
55
55
56
-
For a model to be compatible with the Transformers backend for vLLM it must:
56
+
For a model to be compatible with the Transformers modeling backend for vLLM it must:
57
57
58
58
- be a Transformers compatible custom model (see [Transformers - Customizing models](https://huggingface.co/docs/transformers/en/custom_models)):
59
59
- The model directory must have the correct structure (e.g. `config.json` is present).
60
60
-`config.json` must contain `auto_map.AutoModel`.
61
-
- be a Transformers backend for vLLM compatible model (see [Writing custom models](#writing-custom-models)):
61
+
- be a Transformers modeling backend for vLLM compatible model (see [Writing custom models](#writing-custom-models)):
62
62
- Customisation should be done in the base model (e.g. in `MyModel`, not `MyModelForCausalLM`).
63
63
64
64
If the compatible model is:
65
65
66
66
- on the Hugging Face Model Hub, simply set `trust_remote_code=True` for [offline-inference](../serving/offline_inference.md) or `--trust-remote-code` for the [openai-compatible-server](../serving/openai_compatible_server.md).
67
67
- in a local directory, simply pass directory path to `model=<MODEL_DIR>` for [offline-inference](../serving/offline_inference.md) or `vllm serve <MODEL_DIR>` for the [openai-compatible-server](../serving/openai_compatible_server.md).
68
68
69
-
This means that, with the Transformers backend for vLLM, new models can be used before they are officially supported in Transformers or vLLM!
69
+
This means that, with the Transformers modeling backend for vLLM, new models can be used before they are officially supported in Transformers or vLLM!
70
70
71
71
#### Writing custom models
72
72
73
-
This section details the necessary modifications to make to a Transformers compatible custom model that make it compatible with the Transformers backend for vLLM. (We assume that a Transformers compatible custom model has already been created, see [Transformers - Customizing models](https://huggingface.co/docs/transformers/en/custom_models)).
73
+
This section details the necessary modifications to make to a Transformers compatible custom model that make it compatible with the Transformers modeling backend for vLLM. (We assume that a Transformers compatible custom model has already been created, see [Transformers - Customizing models](https://huggingface.co/docs/transformers/en/custom_models)).
74
74
75
-
To make your model compatible with the Transformers backend, it needs:
75
+
To make your model compatible with the Transformers modeling backend, it needs:
76
76
77
77
1.`kwargs` passed down through all modules from `MyModel` to `MyAttention`.
78
78
- If your model is encoder-only:
@@ -134,7 +134,7 @@ Here is what happens in the background when this model is loaded:
134
134
135
135
1. The config is loaded.
136
136
2.`MyModel` Python class is loaded from the `auto_map` in config, and we check that the model `is_backend_compatible()`.
137
-
3.`MyModel` is loaded into one of the Transformers backend classes in [vllm/model_executor/models/transformers](../../vllm/model_executor/models/transformers) which sets `self.config._attn_implementation = "vllm"` so that vLLM's attention layer is used.
137
+
3.`MyModel` is loaded into one of the Transformers modeling backend classes in [vllm/model_executor/models/transformers](../../vllm/model_executor/models/transformers) which sets `self.config._attn_implementation = "vllm"` so that vLLM's attention layer is used.
138
138
139
139
That's it!
140
140
@@ -182,7 +182,7 @@ To determine whether a given model is natively supported, you can check the `con
182
182
If the `"architectures"` field contains a model architecture listed below, then it should be natively supported.
183
183
184
184
Models do not _need_ to be natively supported to be used in vLLM.
185
-
The [Transformers backend](#transformers) enables you to run models directly using their Transformers implementation (or even remote code on the Hugging Face Model Hub!).
185
+
The [Transformers modeling backend](#transformers) enables you to run models directly using their Transformers implementation (or even remote code on the Hugging Face Model Hub!).
186
186
187
187
!!! tip
188
188
The easiest way to check if your model is really supported at runtime is to run the program below:
@@ -451,7 +451,7 @@ th {
451
451
|`Zamba2ForCausalLM`| Zamba2 |`Zyphra/Zamba2-7B-instruct`, `Zyphra/Zamba2-2.7B-instruct`, `Zyphra/Zamba2-1.2B-instruct`, etc. |||
Some models are supported only via the [Transformers backend](#transformers). The purpose of the table below is to acknowledge models which we officially support in this way. The logs will say that the Transformers backend is being used, and you will see no warning that this is fallback behaviour. This means that, if you have issues with any of the models listed below, please [make an issue](https://github.com/vllm-project/vllm/issues/new/choose) and we'll do our best to fix it!
454
+
Some models are supported only via the [Transformers modeling backend](#transformers). The purpose of the table below is to acknowledge models which we officially support in this way. The logs will say that the Transformers modeling backend is being used, and you will see no warning that this is fallback behaviour. This means that, if you have issues with any of the models listed below, please [make an issue](https://github.com/vllm-project/vllm/issues/new/choose) and we'll do our best to fix it!
455
455
456
456
| Architecture | Models | Example HF Models |[LoRA](../features/lora.md)|[PP](../serving/parallelism_scaling.md)|
Some models are supported only via the [Transformers backend](#transformers). The purpose of the table below is to acknowledge models which we officially support in this way. The logs will say that the Transformers backend is being used, and you will see no warning that this is fallback behaviour. This means that, if you have issues with any of the models listed below, please [make an issue](https://github.com/vllm-project/vllm/issues/new/choose) and we'll do our best to fix it!
723
+
Some models are supported only via the [Transformers modeling backend](#transformers). The purpose of the table below is to acknowledge models which we officially support in this way. The logs will say that the Transformers modeling backend is being used, and you will see no warning that this is fallback behaviour. This means that, if you have issues with any of the models listed below, please [make an issue](https://github.com/vllm-project/vllm/issues/new/choose) and we'll do our best to fix it!
0 commit comments