You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/contributing/benchmarks.md
+67Lines changed: 67 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -321,6 +321,73 @@ The following arguments can be used to control the ramp-up:
321
321
-`--ramp-up-start-rps`: The request rate at the beginning of the benchmark.
322
322
-`--ramp-up-end-rps`: The request rate at the end of the benchmark.
323
323
324
+
##### Load Pattern Configuration
325
+
326
+
vLLM's benchmark serving script provides sophisticated load pattern simulation capabilities through three key parameters that control request generation and concurrency behavior:
327
+
328
+
###### Load Pattern Control Parameters
329
+
330
+
-`--request-rate`: Controls the target request generation rate (requests per second). Set to `inf` for maximum throughput testing or finite values for controlled load simulation.
331
+
-`--burstiness`: Controls traffic variability using a Gamma distribution (range: > 0). Lower values create bursty traffic, higher values create uniform traffic.
332
+
-`--max-concurrency`: Limits concurrent outstanding requests. If this argument is not provided, concurrency is unlimited. Set a value to simulate backpressure.
333
+
334
+
These parameters work together to create realistic load patterns with carefully chosen defaults. The `--request-rate` parameter defaults to `inf` (infinite), which sends all requests immediately for maximum throughput testing. When set to finite values, it uses either a Poisson process (default `--burstiness=1.0`) or Gamma distribution for realistic request timing. The `--burstiness` parameter only takes effect when `--request-rate` is not infinite - a value of 1.0 creates natural Poisson traffic, while lower values (0.1-0.5) create bursty patterns and higher values (2.0-5.0) create uniform spacing. The `--max-concurrency` parameter defaults to `None` (unlimited) but can be set to simulate real-world constraints where a load balancer or API gateway limits concurrent connections. When combined, these parameters allow you to simulate everything from unrestricted stress testing (`--request-rate=inf`) to production-like scenarios with realistic arrival patterns and resource constraints.
335
+
336
+
The `--burstiness` parameter mathematically controls request arrival patterns using a Gamma distribution where:
337
+
338
+
- Shape parameter: `burstiness` value
339
+
- Coefficient of Variation (CV): $\frac{1}{\sqrt{burstiness}}$
*Figure: Load pattern examples for each use case. Top row: Request arrival timelines showing cumulative requests over time. Bottom row: Inter-arrival time distributions showing traffic variability patterns. Each column represents a different use case with its specific parameter settings and resulting traffic characteristics.*
348
+
349
+
Load Pattern Recommendations by Use Case:
350
+
351
+
| Use Case | Burstiness | Request Rate | Max Concurrency | Description |
352
+
| --- | --- | --- | --- | --- |
353
+
| Maximum Throughput | N/A | Infinite | Limited |**Most common**: Simulates load balancer/gateway limits with unlimited user demand |
These load patterns help evaluate different aspects of your vLLM deployment, from basic performance characteristics to resilience under challenging traffic conditions.
361
+
362
+
The **Maximum Throughput** pattern (`--request-rate=inf --max-concurrency=<limit>`) is the most commonly used configuration for production benchmarking. This simulates real-world deployment architectures where:
363
+
364
+
- Users send requests as fast as they can (infinite rate)
365
+
- A load balancer or API gateway controls the maximum concurrent connections
366
+
- The system operates at its concurrency limit, revealing true throughput capacity
367
+
-`--burstiness` has no effect since request timing is not controlled when rate is infinite
368
+
369
+
This pattern helps determine optimal concurrency settings for your production load balancer configuration.
370
+
371
+
To effectively configure load patterns, especially for **Capacity Planning** and **SLA Validation** use cases, you need to understand your system's resource limits. During startup, vLLM reports KV cache configuration that directly impacts your load testing parameters:
372
+
373
+
```text
374
+
GPU KV cache size: 15,728,640 tokens
375
+
Maximum concurrency for 8,192 tokens per request: 1920
376
+
```
377
+
378
+
Where:
379
+
380
+
- GPU KV cache size: Total tokens that can be cached across all concurrent requests
381
+
- Maximum concurrency: Theoretical maximum concurrent requests for the given `max_model_len`
0 commit comments