Skip to content

Commit 518ec6b

Browse files
authored
[Docs] Clean up README_TUNING.md (#28088)
Signed-off-by: windsonsea <[email protected]>
1 parent 802748b commit 518ec6b

File tree

1 file changed

+41
-41
lines changed

1 file changed

+41
-41
lines changed
Lines changed: 41 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -1,60 +1,60 @@
11
# Multi-LoRA Tuning
22

3-
**Note**: The LoRA configuration folder should be specified by exporting `VLLM_TUNED_CONFIG_FOLDER=/path/to/configs`. Without this, the shrink/expand kernels will use default configurations.
3+
**Note**: The LoRA configuration folder should be specified by exporting `VLLM_TUNED_CONFIG_FOLDER=/path/to/configs`.
4+
Without this, the shrink/expand kernels will use default configurations.
45

56
## Tuning Process
67

7-
Multi-lora shrink/expand Triton kernel tuning follows a similar methodology from [Triton MoE tuning](https://github.com/vllm-project/vllm/blob/main/benchmarks/kernels/benchmark_moe.py).
8+
Multi-lora shrink/expand Triton kernel tuning follows a similar methodology from
9+
[Triton MoE tuning](https://github.com/vllm-project/vllm/blob/main/benchmarks/kernels/benchmark_moe.py).
810

9-
**Step 1**
10-
Define the searching space. An example searching space:
11+
1. Define the searching space. Here is an example of searching space:
1112

12-
```python
13-
block_m_range = [16, 32, 64, 128, 256]
14-
block_n_range = [32, 64, 128, 256]
15-
block_k_range = [32, 64, 128, 256]
16-
num_warps_range = [4, 8]
17-
num_stage_range = [2, 3, 4, 5]
18-
num_ctas_range = [1]
19-
split_k_range = [4, 8, 16, 32, 64]
20-
```
13+
```python
14+
block_m_range = [16, 32, 64, 128, 256]
15+
block_n_range = [32, 64, 128, 256]
16+
block_k_range = [32, 64, 128, 256]
17+
num_warps_range = [4, 8]
18+
num_stage_range = [2, 3, 4, 5]
19+
num_ctas_range = [1]
20+
split_k_range = [4, 8, 16, 32, 64]
21+
```
2122

22-
**Step 2**
23-
Get all hidden_state sizes and num_slices that the target model uses for a specific TP size.
23+
2. Get all hidden_state sizes and num_slices that the target model uses for a specific TP size.
2424

25-
For example, we can aquire those info by simply checking [add_lora_linear](https://github.com/li2haipeng/vllm/blob/multi_lora_v01011/vllm/lora/punica_wrapper/punica_gpu.py#L192):
25+
For example, you can acquire the info by simply checking
26+
[add_lora_linear](https://github.com/vllm-project/vllm/blob/main/vllm/lora/punica_wrapper/punica_gpu.py#L181):
2627

27-
```python
28-
print(f"x_shape: {x.view(-1, x.shape[-1]).shape}")
29-
print(f"num_sclises: {len(output_slices)}")
30-
for i in range(len(output_slices)):
31-
print(f"a{i} shape: {lora_a_stacked[i].shape}")
32-
print(f"b{i} shape: {lora_b_stacked[i].shape}")
33-
print("y_shape", y.shape)
34-
```
28+
```python
29+
print(f"x_shape: {x.view(-1, x.shape[-1]).shape}")
30+
print(f"num_slices: {len(output_slices)}")
31+
for i in range(len(output_slices)):
32+
print(f"a{i} shape: {lora_a_stacked[i].shape}")
33+
print(f"b{i} shape: {lora_b_stacked[i].shape}")
34+
print("y_shape", y.shape)
35+
```
3536

36-
**Step 3**
37-
Benchmark the shrink/expand kernel runtime with different kernel configurations generated from the pre-defined search space by performing a grid search to find the optimal kernel configuration. vLLM's [benchmark_lora.py](https://github.com/vllm-project/vllm/blob/main/benchmarks/kernels/benchmark_lora.py) can be used to search for configurations for different shapes.
37+
3. Benchmark the shrink/expand kernel runtime with different kernel configurations generated from the pre-defined search space
38+
by performing a grid search to find the optimal kernel configuration.
39+
vLLM's [benchmark_lora.py](https://github.com/vllm-project/vllm/blob/main/benchmarks/kernels/benchmark_lora.py)
40+
can be used to search for configurations for different shapes.
3841

3942
## Config Files
4043

41-
### File Name
44+
### File Naming
4245

43-
For `shrink`, the config file is named as `{gpu_name}_SHRINK.json`, e.g. `NVIDIA_H200_SHRINK.json`.
46+
| Kernel Type | File Name Template | Example |
47+
|---------------------------|--------------------------------------------|---------------------------------------------|
48+
| shrink | `{gpu_name}_SHRINK.json` | `NVIDIA_H200_SHRINK.json` |
49+
| expand | `{gpu_name}_EXPAND_{add_input}.json` | `NVIDIA_H200_EXPAND_TRUE.json` |
50+
| fused_moe_lora_w13_shrink | `{gpu_name}_FUSED_MOE_LORA_W13_SHRINK.json` | `NVIDIA_H200_FUSED_MOE_LORA_W13_SHRINK.json` |
51+
| fused_moe_lora_w13_expand | `{gpu_name}_FUSED_MOE_LORA_W13_EXPAND.json` | `NVIDIA_H200_FUSED_MOE_LORA_W13_EXPAND.json` |
52+
| fused_moe_lora_w2_shrink | `{gpu_name}_FUSED_MOE_LORA_W2_SHRINK.json` | `NVIDIA_H200_FUSED_MOE_LORA_W2_SHRINK.json` |
53+
| fused_moe_lora_w2_expand | `{gpu_name}_FUSED_MOE_LORA_W2_EXPAND.json` | `NVIDIA_H200_FUSED_MOE_LORA_W2_EXPAND.json` |
4454

45-
For `expand`, the config fileis named as `{gpu_name}_EXPAND_{add_input}.json`, e.g. `NVIDIA_H200_EXPAND_TRUE.json`.
55+
The `gpu_name` can be automatically detected by calling `torch.cuda.get_device_name()`.
4656

47-
For `fused_moe_lora_w13_shrink`, the config file is named as `{gpu_name}_FUSED_MOE_LORA_W13_SHRINK.json`, e.g. `NVIDIA_H200_FUSED_MOE_LORA_W13_SHRINK.json`.
57+
### JSON Structure
4858

49-
For `fused_moe_lora_w13_expand`, the config file is named as `{gpu_name}_FUSED_MOE_LORA_W13_EXPAND.json`, e.g. `NVIDIA_H200_FUSED_MOE_LORA_W13_EXPAND.json`.
50-
51-
For `fused_moe_lora_w2_shrink`, the config file is named as `{gpu_name}_FUSED_MOE_LORA_W2_SHRINK.json`, e.g. `NVIDIA_H200_FUSED_MOE_LORA_W2_SHRINK.json`.
52-
53-
For `fused_moe_lora_w2_expand`, the config file is named as `{gpu_name}_FUSED_MOE_LORA_W2_EXPAND.json`, e.g. `NVIDIA_H200_FUSED_MOE_LORA_W2_EXPAND.json`.
54-
55-
The `gpu_name` can be automatically detected by calling `torch.cuda.get_device_name()`
56-
57-
### Json Structure
58-
59-
Optimal kernel configuration files are saved as JSON files with the structure `config_data[max_loras][num_slices][m][k][n][i]`
59+
Optimal kernel configuration files are saved as JSON files with the structure `config_data[max_loras][num_slices][m][k][n][i]`,
6060
where `i` is an optional dimension in the `fused_moe_lora` configuration, representing the intermediate size of the MoE layer.

0 commit comments

Comments
 (0)