Skip to content

Commit 130aa8c

Browse files
Add load pattern configuration guide to benchmarks (#26886)
Signed-off-by: Matvei Pashkovskii <[email protected]> Signed-off-by: Matvei Pashkovskii <[email protected]> Co-authored-by: Harry Mellor <[email protected]>
1 parent e3d8186 commit 130aa8c

File tree

2 files changed

+67
-0
lines changed

2 files changed

+67
-0
lines changed
577 KB
Loading

docs/contributing/benchmarks.md

Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -321,6 +321,73 @@ The following arguments can be used to control the ramp-up:
321321
- `--ramp-up-start-rps`: The request rate at the beginning of the benchmark.
322322
- `--ramp-up-end-rps`: The request rate at the end of the benchmark.
323323

324+
##### Load Pattern Configuration
325+
326+
vLLM's benchmark serving script provides sophisticated load pattern simulation capabilities through three key parameters that control request generation and concurrency behavior:
327+
328+
###### Load Pattern Control Parameters
329+
330+
- `--request-rate`: Controls the target request generation rate (requests per second). Set to `inf` for maximum throughput testing or finite values for controlled load simulation.
331+
- `--burstiness`: Controls traffic variability using a Gamma distribution (range: > 0). Lower values create bursty traffic, higher values create uniform traffic.
332+
- `--max-concurrency`: Limits concurrent outstanding requests. If this argument is not provided, concurrency is unlimited. Set a value to simulate backpressure.
333+
334+
These parameters work together to create realistic load patterns with carefully chosen defaults. The `--request-rate` parameter defaults to `inf` (infinite), which sends all requests immediately for maximum throughput testing. When set to finite values, it uses either a Poisson process (default `--burstiness=1.0`) or Gamma distribution for realistic request timing. The `--burstiness` parameter only takes effect when `--request-rate` is not infinite - a value of 1.0 creates natural Poisson traffic, while lower values (0.1-0.5) create bursty patterns and higher values (2.0-5.0) create uniform spacing. The `--max-concurrency` parameter defaults to `None` (unlimited) but can be set to simulate real-world constraints where a load balancer or API gateway limits concurrent connections. When combined, these parameters allow you to simulate everything from unrestricted stress testing (`--request-rate=inf`) to production-like scenarios with realistic arrival patterns and resource constraints.
335+
336+
The `--burstiness` parameter mathematically controls request arrival patterns using a Gamma distribution where:
337+
338+
- Shape parameter: `burstiness` value
339+
- Coefficient of Variation (CV): $\frac{1}{\sqrt{burstiness}}$
340+
- Traffic characteristics:
341+
- `burstiness = 0.1`: Highly bursty traffic (CV ≈ 3.16) - stress testing
342+
- `burstiness = 1.0`: Natural Poisson traffic (CV = 1.0) - realistic simulation
343+
- `burstiness = 5.0`: Uniform traffic (CV ≈ 0.45) - controlled load testing
344+
345+
![Load Pattern Examples](../assets/contributing/load-pattern-examples.png)
346+
347+
*Figure: Load pattern examples for each use case. Top row: Request arrival timelines showing cumulative requests over time. Bottom row: Inter-arrival time distributions showing traffic variability patterns. Each column represents a different use case with its specific parameter settings and resulting traffic characteristics.*
348+
349+
Load Pattern Recommendations by Use Case:
350+
351+
| Use Case | Burstiness | Request Rate | Max Concurrency | Description |
352+
| --- | --- | --- | --- | --- |
353+
| Maximum Throughput | N/A | Infinite | Limited | **Most common**: Simulates load balancer/gateway limits with unlimited user demand |
354+
| Realistic Testing | 1.0 | Moderate (5-20) | Infinite | Natural Poisson traffic patterns for baseline performance |
355+
| Stress Testing | 0.1-0.5 | High (20-100) | Infinite | Challenging burst patterns to test resilience |
356+
| Latency Profiling | 2.0-5.0 | Low (1-10) | Infinite | Uniform load for consistent timing analysis |
357+
| Capacity Planning | 1.0 | Variable | Limited | Test resource limits with realistic constraints |
358+
| SLA Validation | 1.0 | Target rate | SLA limit | Production-like constraints for compliance testing |
359+
360+
These load patterns help evaluate different aspects of your vLLM deployment, from basic performance characteristics to resilience under challenging traffic conditions.
361+
362+
The **Maximum Throughput** pattern (`--request-rate=inf --max-concurrency=<limit>`) is the most commonly used configuration for production benchmarking. This simulates real-world deployment architectures where:
363+
364+
- Users send requests as fast as they can (infinite rate)
365+
- A load balancer or API gateway controls the maximum concurrent connections
366+
- The system operates at its concurrency limit, revealing true throughput capacity
367+
- `--burstiness` has no effect since request timing is not controlled when rate is infinite
368+
369+
This pattern helps determine optimal concurrency settings for your production load balancer configuration.
370+
371+
To effectively configure load patterns, especially for **Capacity Planning** and **SLA Validation** use cases, you need to understand your system's resource limits. During startup, vLLM reports KV cache configuration that directly impacts your load testing parameters:
372+
373+
```text
374+
GPU KV cache size: 15,728,640 tokens
375+
Maximum concurrency for 8,192 tokens per request: 1920
376+
```
377+
378+
Where:
379+
380+
- GPU KV cache size: Total tokens that can be cached across all concurrent requests
381+
- Maximum concurrency: Theoretical maximum concurrent requests for the given `max_model_len`
382+
- Calculation: `max_concurrency = kv_cache_size / max_model_len`
383+
384+
Using KV cache metrics for load pattern configuration:
385+
386+
- For Capacity Planning: Set `--max-concurrency` to 80-90% of the reported maximum to test realistic resource constraints
387+
- For SLA Validation: Use the reported maximum as your SLA limit to ensure compliance testing matches production capacity
388+
- For Realistic Testing: Monitor memory usage when approaching theoretical limits to understand sustainable request rates
389+
- Request rate guidance: Use the KV cache size to estimate sustainable request rates for your specific workload and sequence lengths
390+
324391
</details>
325392

326393
#### 📈 Offline Throughput Benchmark

0 commit comments

Comments
 (0)