Skip to content

Commit c680b7b

Browse files
authored
Merge branch 'main' into fix/validate-tool-requests-29432
2 parents 7909101 + 29f7d97 commit c680b7b

File tree

26 files changed

+335
-213
lines changed

26 files changed

+335
-213
lines changed

docs/deployment/docker.md

Lines changed: 20 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -82,7 +82,7 @@ DOCKER_BUILDKIT=1 docker build . \
8282

8383
## Building for Arm64/aarch64
8484

85-
A docker container can be built for aarch64 systems such as the Nvidia Grace-Hopper. At time of this writing, this should be considered **experimental**. Using the flag `--platform "linux/arm64"` will attempt to build for arm64.
85+
A docker container can be built for aarch64 systems such as the Nvidia Grace-Hopper and Grace-Blackwell. Using the flag `--platform "linux/arm64"` will build for arm64.
8686

8787
!!! note
8888
Multiple modules must be compiled, so this process can take a while. Recommend using `--build-arg max_jobs=` & `--build-arg nvcc_threads=`
@@ -104,6 +104,25 @@ A docker container can be built for aarch64 systems such as the Nvidia Grace-Hop
104104
--build-arg RUN_WHEEL_CHECK=false
105105
```
106106

107+
For (G)B300, we recommend using CUDA 13, as shown in the following command.
108+
109+
??? console "Command"
110+
111+
```bash
112+
DOCKER_BUILDKIT=1 docker build \
113+
--build-arg CUDA_VERSION=13.0.1 \
114+
--build-arg BUILD_BASE_IMAGE=nvidia/cuda:13.0.1-devel-ubuntu22.04 \
115+
--build-arg max_jobs=256 \
116+
--build-arg nvcc_threads=2 \
117+
--build-arg RUN_WHEEL_CHECK=false \
118+
--build-arg torch_cuda_arch_list='9.0 10.0+PTX' \
119+
--platform "linux/arm64" \
120+
--tag vllm/vllm-gb300-openai:latest \
121+
--target vllm-openai \
122+
-f docker/Dockerfile \
123+
.
124+
```
125+
107126
!!! note
108127
If you are building the `linux/arm64` image on a non-ARM host (e.g., an x86_64 machine), you need to ensure your system is set up for cross-compilation using QEMU. This allows your host machine to emulate ARM64 execution.
109128

docs/serving/data_parallel_deployment.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,11 +8,11 @@ For MoE models, particularly those like DeepSeek that employ MLA (Multi-head Lat
88

99
In these cases, the data parallel ranks are not completely independent. Forward passes must be aligned, and expert layers across all ranks are required to synchronize during every forward pass, even when there are fewer requests to be processed than DP ranks.
1010

11-
The expert layers will by default form a (DP x TP) sized tensor parallel group. To enable expert parallelism, include the `--enable-expert-parallel` CLI arg (on all nodes in the multi-node case).
11+
By default, expert layers form a tensor parallel group of size `DP × TP`. To use expert parallelism instead, include the `--enable-expert-parallel` CLI arg (on all nodes in the multi-node case). See [Expert Parallel Deployment](expert_parallel_deployment.md) for details on how attention and expert layers behave differently with EP enabled.
1212

1313
In vLLM, each DP rank is deployed as a separate "core engine" process that communicates with front-end process(es) via ZMQ sockets. Data Parallel attention can be combined with Tensor Parallel attention, in which case each DP engine owns a number of per-GPU worker processes equal to the configured TP size.
1414

15-
For MoE models, when any requests are in progress in any rank, we must ensure that empty "dummy" forward passes are performed in all ranks that don't currently have any requests scheduled. This is handled via a separate DP Coordinator process that communicates with all ranks, and a collective operation performed every N steps to determine when all ranks become idle and can be paused. When TP is used in conjunction with DP, expert layers form an EP or TP group of size (DP x TP).
15+
For MoE models, when any requests are in progress in any rank, we must ensure that empty "dummy" forward passes are performed in all ranks that don't currently have any requests scheduled. This is handled via a separate DP Coordinator process that communicates with all ranks, and a collective operation performed every N steps to determine when all ranks become idle and can be paused. When TP is used in conjunction with DP, expert layers form a group of size `DP × TP` (using either tensor parallelism by default, or expert parallelism if `--enable-expert-parallel` is set).
1616

1717
In all cases, it is beneficial to load-balance requests between DP ranks. For online deployments, this balancing can be optimized by taking into account the state of each DP engine - in particular its currently scheduled and waiting (queued) requests, and KV cache state. Each DP engine has an independent KV cache, and the benefit of prefix caching can be maximized by directing prompts intelligently.
1818

docs/serving/expert_parallel_deployment.md

Lines changed: 21 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,27 @@ Where:
4444
- `DP_SIZE`: Data parallel size
4545
- `EP_SIZE`: Expert parallel size (computed automatically)
4646

47-
When EP is enabled, MoE layers use expert parallelism instead of tensor parallelism, while attention layers continue to use tensor parallelism if `TP_SIZE > 1`.
47+
### Layer Behavior with EP Enabled
48+
49+
When EP is enabled, different layers in MoE models behave differently:
50+
51+
| Layer Type | Behavior | Parallelism Used |
52+
|------------|----------|------------------|
53+
| **Expert (MoE) Layers** | Sharded across all EP ranks | Expert Parallel (EP) of size `TP × DP` |
54+
| **Attention Layers** | Behavior depends on TP size | See below |
55+
56+
**Attention layer parallelism:**
57+
58+
- **When `TP = 1`**: Attention weights are **replicated** across all DP ranks (data parallelism)
59+
- **When `TP > 1`**: Attention weights are **sharded** using tensor parallelism across TP ranks within each DP group
60+
61+
For example, with `TP=2, DP=4` (8 GPUs total):
62+
63+
- Expert layers form an EP group of size 8, with experts distributed across all GPUs
64+
- Attention layers use TP=2 within each of the 4 DP groups
65+
66+
!!! note "Key Difference from Data Parallel Deployment"
67+
Without `--enable-expert-parallel`, MoE layers would use tensor parallelism (forming a TP group of size `TP × DP`), similar to dense models. With EP enabled, expert layers switch to expert parallelism, which can provide better efficiency and locality for MoE models.
4868

4969
### Example Command
5070

docs/serving/parallelism_scaling.md

Lines changed: 23 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,7 @@ If a single node lacks sufficient GPUs to hold the model, deploy vLLM across mul
6262

6363
### What is Ray?
6464

65-
Ray is a distributed computing framework for scaling Python programs. Multi-node vLLM deployments require Ray as the runtime engine.
65+
Ray is a distributed computing framework for scaling Python programs. Multi-node vLLM deployments can use Ray as the runtime engine.
6666

6767
vLLM uses Ray to manage the distributed execution of tasks across multiple nodes and control where execution happens.
6868

@@ -130,6 +130,28 @@ vllm serve /path/to/the/model/in/the/container \
130130
--distributed-executor-backend ray
131131
```
132132

133+
### Running vLLM with MultiProcessing
134+
135+
Besides Ray, Multi-node vLLM deployments can also use `multiprocessing` as the runtime engine. Here's an example to deploy model across 2 nodes (8 GPUs per node) with `tp_size=8` and `pp_size=2`.
136+
137+
Choose one node as the head node and run:
138+
139+
```bash
140+
vllm serve /path/to/the/model/in/the/container \
141+
--tensor-parallel-size 8 --pipeline-parallel-size 2 \
142+
--nnodes 2 --node-rank 0 \
143+
--master-addr <HEAD_NODE_IP>
144+
```
145+
146+
On the other worker node, run:
147+
148+
```bash
149+
vllm serve /path/to/the/model/in/the/container \
150+
--tensor-parallel-size 8 --pipeline-parallel-size 2 \
151+
--nnodes 2 --node-rank 1 \
152+
--master-addr <HEAD_NODE_IP> --headless
153+
```
154+
133155
## Optimizing network communication for tensor parallelism
134156

135157
Efficient tensor parallelism requires fast inter-node communication, preferably through high-speed network adapters such as InfiniBand.

tests/test_inputs.py

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
from vllm.inputs import zip_enc_dec_prompts
88
from vllm.inputs.parse import parse_raw_prompts
99
from vllm.inputs.preprocess import InputPreprocessor
10-
from vllm.tokenizers import init_tokenizer_from_config
10+
from vllm.tokenizers import cached_tokenizer_from_config
1111

1212
pytestmark = pytest.mark.cpu_test
1313

@@ -34,6 +34,13 @@
3434
]
3535

3636

37+
# Test that a nested mixed-type list of lists raises a TypeError.
38+
@pytest.mark.parametrize("invalid_input", [[[1, 2], ["foo", "bar"]]])
39+
def test_invalid_input_raise_type_error(invalid_input):
40+
with pytest.raises(TypeError):
41+
parse_raw_prompts(invalid_input)
42+
43+
3744
def test_parse_raw_single_batch_empty():
3845
with pytest.raises(ValueError, match="at least one prompt"):
3946
parse_raw_prompts([])
@@ -108,7 +115,7 @@ def test_zip_enc_dec_prompts(mm_processor_kwargs, expected_mm_kwargs):
108115
)
109116
def test_preprocessor_always_mm_code_path(model_id, prompt):
110117
model_config = ModelConfig(model=model_id)
111-
tokenizer = init_tokenizer_from_config(model_config)
118+
tokenizer = cached_tokenizer_from_config(model_config)
112119
input_preprocessor = InputPreprocessor(model_config, tokenizer)
113120

114121
# HF processor adds sep token

tests/tokenizers_/test_basic.py

Lines changed: 24 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -3,38 +3,39 @@
33
from typing import _get_protocol_attrs # type: ignore
44

55
import pytest
6-
from transformers import PreTrainedTokenizerBase
6+
from transformers import (
7+
PreTrainedTokenizer,
8+
PreTrainedTokenizerBase,
9+
PreTrainedTokenizerFast,
10+
)
711

812
from vllm.tokenizers import TokenizerLike, get_tokenizer
13+
from vllm.tokenizers.mistral import MistralTokenizer
914

1015

1116
def _get_missing_attrs(obj: object, target: type):
1217
return [k for k in _get_protocol_attrs(target) if not hasattr(obj, k)]
1318

1419

20+
def _assert_tokenizer_like(tokenizer: object):
21+
missing_attrs = _get_missing_attrs(tokenizer, TokenizerLike)
22+
assert not missing_attrs, f"Missing attrs: {missing_attrs}"
23+
24+
1525
def test_tokenizer_like_protocol():
16-
assert not (
17-
missing_attrs := _get_missing_attrs(
18-
get_tokenizer("gpt2", use_fast=False),
19-
TokenizerLike,
20-
)
21-
), f"Missing attrs: {missing_attrs}"
22-
23-
assert not (
24-
missing_attrs := _get_missing_attrs(
25-
get_tokenizer("gpt2", use_fast=True),
26-
TokenizerLike,
27-
)
28-
), f"Missing attrs: {missing_attrs}"
29-
30-
assert not (
31-
missing_attrs := _get_missing_attrs(
32-
get_tokenizer(
33-
"mistralai/Mistral-7B-Instruct-v0.3", tokenizer_mode="mistral"
34-
),
35-
TokenizerLike,
36-
)
37-
), f"Missing attrs: {missing_attrs}"
26+
tokenizer = get_tokenizer("gpt2", use_fast=False)
27+
assert isinstance(tokenizer, PreTrainedTokenizer)
28+
_assert_tokenizer_like(tokenizer)
29+
30+
tokenizer = get_tokenizer("gpt2", use_fast=True)
31+
assert isinstance(tokenizer, PreTrainedTokenizerFast)
32+
_assert_tokenizer_like(tokenizer)
33+
34+
tokenizer = get_tokenizer(
35+
"mistralai/Mistral-7B-Instruct-v0.3", tokenizer_mode="mistral"
36+
)
37+
assert isinstance(tokenizer, MistralTokenizer)
38+
_assert_tokenizer_like(tokenizer)
3839

3940

4041
@pytest.mark.parametrize("tokenizer_name", ["facebook/opt-125m", "gpt2"])

tests/tokenizers_/test_registry.py

Lines changed: 21 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,14 @@
22
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
33
from pathlib import Path
44

5-
from vllm.tokenizers import TokenizerLike, TokenizerRegistry, get_tokenizer
5+
import pytest
6+
7+
from vllm.tokenizers import TokenizerLike
8+
from vllm.tokenizers.registry import (
9+
TokenizerRegistry,
10+
get_tokenizer,
11+
resolve_tokenizer_args,
12+
)
613

714

815
class TestTokenizer(TokenizerLike):
@@ -40,10 +47,22 @@ def is_fast(self) -> bool:
4047
return True
4148

4249

50+
@pytest.mark.parametrize("runner_type", ["generate", "pooling"])
51+
def test_resolve_tokenizer_args_idempotent(runner_type):
52+
tokenizer_mode, tokenizer_name, args, kwargs = resolve_tokenizer_args(
53+
"facebook/opt-125m",
54+
runner_type=runner_type,
55+
)
56+
57+
assert (tokenizer_mode, tokenizer_name, args, kwargs) == resolve_tokenizer_args(
58+
tokenizer_name, *args, **kwargs
59+
)
60+
61+
4362
def test_customized_tokenizer():
4463
TokenizerRegistry.register("test_tokenizer", __name__, TestTokenizer.__name__)
4564

46-
tokenizer = TokenizerRegistry.get_tokenizer("test_tokenizer", "abc")
65+
tokenizer = TokenizerRegistry.load_tokenizer("test_tokenizer", "abc")
4766
assert isinstance(tokenizer, TestTokenizer)
4867
assert tokenizer.path_or_repo_id == "abc"
4968
assert tokenizer.bos_token_id == 0

vllm/compilation/decorators.py

Lines changed: 22 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@
2828
from vllm.logger import init_logger
2929
from vllm.sequence import IntermediateTensors
3030
from vllm.utils.import_utils import resolve_obj_by_qualname
31-
from vllm.utils.torch_utils import supports_dynamo
31+
from vllm.utils.torch_utils import is_torch_equal_or_newer, supports_dynamo
3232

3333
from .monitor import start_monitoring_torch_compile
3434

@@ -316,7 +316,13 @@ def __init__(
316316
def _mark_dynamic_inputs(mod, type, *args, **kwargs):
317317
def mark_dynamic(arg, dims):
318318
if type == DynamicShapesType.UNBACKED:
319-
torch._dynamo.decorators.mark_unbacked(arg, dims)
319+
if is_torch_equal_or_newer("2.10.0.dev"):
320+
for dim in dims:
321+
torch._dynamo.decorators.mark_unbacked(
322+
arg, dim, hint_override=arg.size()[dim]
323+
)
324+
else:
325+
torch._dynamo.decorators.mark_unbacked(arg, dims)
320326
else:
321327
torch._dynamo.mark_dynamic(arg, dims)
322328

@@ -350,7 +356,13 @@ def mark_dynamic(arg, dims):
350356
if isinstance(arg, torch.Tensor):
351357
# In case dims is specified with negative indexing
352358
dims = [arg.ndim + dim if dim < 0 else dim for dim in dims]
353-
torch._dynamo.decorators.mark_unbacked(arg, dims)
359+
if is_torch_equal_or_newer("2.10.0.dev"):
360+
for dim in dims:
361+
torch._dynamo.decorators.mark_unbacked(
362+
arg, dim, hint_override=arg.size()[dim]
363+
)
364+
else:
365+
torch._dynamo.decorators.mark_unbacked(arg, dims)
354366

355367
def __call__(self, *args, **kwargs):
356368
# torch.compiler.is_compiling() means we are inside the compilation
@@ -488,6 +500,12 @@ def patched_inline_call(self_):
488500
if ds_type == DynamicShapesType.BACKED_SIZE_OBLIVIOUS:
489501
fx_config_patches["backed_size_oblivious"] = True
490502

503+
# Prepare inductor config patches
504+
# assume_32bit_indexing is only available in torch 2.10.0.dev+
505+
inductor_config_patches = {}
506+
if is_torch_equal_or_newer("2.10.0.dev"):
507+
inductor_config_patches["assume_32bit_indexing"] = True
508+
491509
with (
492510
patch.object(
493511
InliningInstructionTranslator, "inline_call_", patched_inline_call
@@ -496,6 +514,7 @@ def patched_inline_call(self_):
496514
maybe_use_cudagraph_partition_wrapper(self.vllm_config),
497515
torch.fx.experimental._config.patch(**fx_config_patches),
498516
_torch27_patch_tensor_subclasses(),
517+
torch._inductor.config.patch(**inductor_config_patches),
499518
):
500519
if envs.VLLM_USE_AOT_COMPILE:
501520
self.aot_compiled_fn = self.aot_compile(*args, **kwargs)

vllm/distributed/kv_transfer/kv_connector/v1/metrics.py

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,6 @@
77

88
from vllm.config import KVTransferConfig, VllmConfig
99
from vllm.distributed.kv_transfer.kv_connector.factory import KVConnectorFactory
10-
from vllm.distributed.kv_transfer.kv_transfer_state import has_kv_transfer_group
1110
from vllm.logger import init_logger
1211

1312
PromMetric: TypeAlias = Gauge | Counter | Histogram
@@ -53,8 +52,6 @@ def is_empty(self) -> bool:
5352

5453
class KVConnectorLogging:
5554
def __init__(self, kv_transfer_config: KVTransferConfig | None):
56-
# This should be called on frontend process.
57-
assert not has_kv_transfer_group()
5855
# Instantiate the connector's stats class.
5956
if kv_transfer_config and kv_transfer_config.kv_connector:
6057
self.connector_cls = KVConnectorFactory.get_connector_class(

vllm/entrypoints/chat_utils.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,6 @@
5050
from vllm.multimodal import MULTIMODAL_REGISTRY, MultiModalDataDict, MultiModalUUIDDict
5151
from vllm.multimodal.utils import MEDIA_CONNECTOR_REGISTRY, MediaConnector
5252
from vllm.tokenizers import TokenizerLike
53-
from vllm.tokenizers.mistral import MistralTokenizer
5453
from vllm.transformers_utils.chat_templates import get_chat_template_fallback_path
5554
from vllm.transformers_utils.processor import cached_get_processor
5655
from vllm.utils import random_uuid
@@ -60,6 +59,8 @@
6059

6160
if TYPE_CHECKING:
6261
import torch
62+
63+
from vllm.tokenizers.mistral import MistralTokenizer
6364
else:
6465
torch = LazyLoader("torch", globals(), "torch")
6566

@@ -1832,7 +1833,7 @@ def apply_hf_chat_template(
18321833

18331834

18341835
def apply_mistral_chat_template(
1835-
tokenizer: MistralTokenizer,
1836+
tokenizer: "MistralTokenizer",
18361837
messages: list[ChatCompletionMessageParam],
18371838
chat_template: str | None,
18381839
tools: list[dict[str, Any]] | None,

0 commit comments

Comments
 (0)