Skip to content

Commit 4101040

Browse files
committed
Merge branch 'add_mamba1_apc_support' of github.com:Josephasafg/vllm into add_mamba1_apc_support
2 parents bb2819d + 6819c74 commit 4101040

File tree

35 files changed

+1040
-547
lines changed

35 files changed

+1040
-547
lines changed

.buildkite/test-pipeline.yaml

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -340,6 +340,16 @@ steps:
340340
commands:
341341
- pytest -v -s v1/attention
342342

343+
- label: V1 Test attention (B200) # 10min
344+
timeout_in_minutes: 30
345+
gpu: b200
346+
source_file_dependencies:
347+
- vllm/v1/attention
348+
- tests/v1/attention
349+
commands:
350+
- export VLLM_DISABLE_FLASHINFER_PREFILL=1 # TODO: FI prefill is bugged and causes incorrectness, fix this
351+
- pytest -v -s v1/attention
352+
343353
- label: V1 Test others (CPU) # 5 mins
344354
source_file_dependencies:
345355
- vllm/

docs/configuration/tpu.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ This doc serves as a collection of handy tips for optimizing your vLLM on TPU wo
44

55
## Get started
66

7-
Looking for setup and installation instructions? Find them [here](../getting_started/installation/google_tpu.md).
7+
Looking for setup and installation instructions? Find them [here](https://docs.vllm.ai/projects/tpu/en/latest/getting_started/installation/).
88

99
### TPU workload sizing
1010

docs/features/batch_invariance.md

Lines changed: 133 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,133 @@
1+
# Batch Invariance
2+
3+
!!! note
4+
Batch invariance is currently in beta. Some features are still under active development.
5+
Track progress and planned improvements at <https://github.com/vllm-project/vllm/issues/27433>
6+
7+
This document shows how to enable batch invariance in vLLM. Batch invariance ensures that the output of a model is deterministic and independent of the batch size or the order of requests in a batch.
8+
9+
## Motivation
10+
11+
Batch invariance is crucial for several use cases:
12+
13+
- **Framework debugging**: Deterministic outputs make it easier to debug issues in the inference framework, as the same input will always produce the same output regardless of batching.
14+
- **Model debugging**: Helps identify issues in model implementations by ensuring consistent behavior across different batch configurations.
15+
- **Reinforcement Learning (RL)**: RL training often requires deterministic rollouts for reproducibility and stable training.
16+
- **Large-scale inference systems**: Systems that use vLLM as a component benefit from deterministic behavior for testing, validation, and consistency guarantees.
17+
18+
## Hardware Requirements
19+
20+
Batch invariance currently requires NVIDIA GPUs with compute capability 9.0 or higher:
21+
22+
- **H-series**: H100, H200
23+
- **B-series**: B100, B200
24+
25+
## Enabling Batch Invariance
26+
27+
Batch invariance can be enabled by setting the `VLLM_BATCH_INVARIANT` environment variable to `1`:
28+
29+
```bash
30+
export VLLM_BATCH_INVARIANT=1
31+
```
32+
33+
### Online Inference (Server Mode)
34+
35+
To start a vLLM server with batch invariance enabled:
36+
37+
```bash
38+
VLLM_BATCH_INVARIANT=1 vllm serve meta-llama/Llama-3.1-8B-Instruct
39+
```
40+
41+
Then use the OpenAI-compatible client:
42+
43+
```python
44+
from openai import OpenAI
45+
46+
client = OpenAI(
47+
api_key="EMPTY",
48+
base_url="http://localhost:8000/v1",
49+
)
50+
51+
# These requests will produce deterministic outputs
52+
# regardless of batch size or order
53+
response = client.completions.create(
54+
model="meta-llama/Llama-3.1-8B-Instruct",
55+
prompt="The future of AI is",
56+
max_tokens=100,
57+
temperature=0.7,
58+
seed=42,
59+
)
60+
61+
print(response.choices[0].text)
62+
```
63+
64+
### Offline Inference
65+
66+
For offline batch inference with batch invariance:
67+
68+
```python
69+
import os
70+
os.environ["VLLM_BATCH_INVARIANT"] = "1"
71+
72+
from vllm import LLM, SamplingParams
73+
74+
prompts = [
75+
"The future of AI is",
76+
"Machine learning enables",
77+
"Deep learning models can",
78+
]
79+
80+
sampling_params = SamplingParams(
81+
temperature=0.7,
82+
top_p=0.95,
83+
max_tokens=100,
84+
seed=42,
85+
)
86+
87+
llm = LLM(
88+
model="meta-llama/Llama-3.1-8B-Instruct",
89+
tensor_parallel_size=1,
90+
)
91+
92+
# Outputs will be deterministic regardless of batch size
93+
outputs = llm.generate(prompts, sampling_params)
94+
95+
for output in outputs:
96+
prompt = output.prompt
97+
generated_text = output.outputs[0].text
98+
print(f"Prompt: {prompt!r}")
99+
print(f"Generated: {generated_text!r}\n")
100+
```
101+
102+
## Tested Models
103+
104+
Batch invariance has been tested and verified on the following models:
105+
106+
- **DeepSeek series**: `deepseek-ai/DeepSeek-V3`, `deepseek-ai/DeepSeek-V3-0324`, `deepseek-ai/DeepSeek-R1`, `deepseek-ai/DeepSeek-V3.1`
107+
- **Qwen3 (Dense)**: `Qwen/Qwen3-1.7B`, `Qwen/Qwen3-8B`
108+
- **Qwen3 (MoE)**: `Qwen/Qwen3-30B-A3B`, `Qwen/Qwen3-Next-80B-A3B-Instruct`
109+
- **Llama 3**: `meta-llama/Llama-3.1-8B-Instruct`, `meta-llama/Llama-3.2-1B-Instruct`
110+
111+
Other models may also work, but these have been explicitly validated. If you encounter issues with a specific model, please report them on the [GitHub issue tracker](https://github.com/vllm-project/vllm/issues/new/choose).
112+
113+
## Implementation Details
114+
115+
When batch invariance is enabled, vLLM:
116+
117+
1. Uses deterministic kernel implementations for attention and other operations
118+
2. Ensures consistent numerical behavior across different batch sizes
119+
3. Disables certain optimizations that may introduce non-determinism (such as custom all-reduce operations in tensor parallel mode)
120+
121+
!!! note
122+
Enabling batch invariance may impact performance compared to the default non-deterministic mode. This trade-off is intentional to guarantee reproducibility.
123+
124+
## Future Improvements
125+
126+
The batch invariance feature is under active development. Planned improvements include:
127+
128+
- Support for additional GPU architectures
129+
- Expanded model coverage
130+
- Performance optimizations
131+
- Additional testing and validation
132+
133+
For the latest status and to contribute ideas, see the [tracking issue](https://github.com/vllm-project/vllm/issues/27433).

docs/features/nixl_connector_usage.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -81,7 +81,7 @@ python tests/v1/kv_connector/nixl_integration/toy_proxy_server.py \
8181
- Default: 5600
8282
- **Required for both prefiller and decoder instances**
8383
- Each vLLM worker needs a unique port on its host; using the same port number across different hosts is fine
84-
- For TP/DP deployments, each worker's port on a node is computed as: base_port + dp_rank * tp_size + tp_rank (e.g., with `--tensor-parallel-size=4` and base_port=5600, tp_rank 0..3 use ports 5600, 5601, 5602, 5603 on that node).
84+
- For TP/DP deployments, each worker's port on a node is computed as: base_port + dp_rank (e.g., with `--data-parallel-size=2` and base_port=5600, dp_rank 0..1 use port 5600, 5601 on that node).
8585
- Used for the initial NIXL handshake between the prefiller and the decoder
8686

8787
- `VLLM_NIXL_SIDE_CHANNEL_HOST`: Host for side channel communication

docs/getting_started/installation/.nav.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,4 +2,4 @@ nav:
22
- README.md
33
- gpu.md
44
- cpu.md
5-
- google_tpu.md
5+
- TPU: https://docs.vllm.ai/projects/tpu/en/latest/getting_started/installation/

docs/getting_started/installation/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,6 @@ vLLM supports the following hardware platforms:
1111
- [ARM AArch64](cpu.md#arm-aarch64)
1212
- [Apple silicon](cpu.md#apple-silicon)
1313
- [IBM Z (S390X)](cpu.md#ibm-z-s390x)
14-
- [Google TPU](google_tpu.md)
1514

1615
## Hardware Plugins
1716

@@ -20,6 +19,7 @@ The backends below live **outside** the main `vllm` repository and follow the
2019

2120
| Accelerator | PyPI / package | Repository |
2221
|-------------|----------------|------------|
22+
| Google TPU | `tpu-inference` | <https://github.com/vllm-project/tpu-inference> |
2323
| Ascend NPU | `vllm-ascend` | <https://github.com/vllm-project/vllm-ascend> |
2424
| Intel Gaudi (HPU) | N/A, install from source | <https://github.com/vllm-project/vllm-gaudi> |
2525
| MetaX MACA GPU | N/A, install from source | <https://github.com/MetaX-MACA/vLLM-metax> |

docs/getting_started/installation/google_tpu.md

Lines changed: 0 additions & 193 deletions
This file was deleted.

0 commit comments

Comments
 (0)