|
1 | 1 | # Multi-LoRA Tuning |
2 | 2 |
|
3 | | -**Note**: The LoRA configuration folder should be specified by exporting `VLLM_TUNED_CONFIG_FOLDER=/path/to/configs`. Without this, the shrink/expand kernels will use default configurations. |
| 3 | +**Note**: The LoRA configuration folder should be specified by exporting `VLLM_TUNED_CONFIG_FOLDER=/path/to/configs`. |
| 4 | +Without this, the shrink/expand kernels will use default configurations. |
4 | 5 |
|
5 | 6 | ## Tuning Process |
6 | 7 |
|
7 | | -Multi-lora shrink/expand Triton kernel tuning follows a similar methodology from [Triton MoE tuning](https://github.com/vllm-project/vllm/blob/main/benchmarks/kernels/benchmark_moe.py). |
| 8 | +Multi-lora shrink/expand Triton kernel tuning follows a similar methodology from |
| 9 | +[Triton MoE tuning](https://github.com/vllm-project/vllm/blob/main/benchmarks/kernels/benchmark_moe.py). |
8 | 10 |
|
9 | | -**Step 1** |
10 | | -Define the searching space. An example searching space: |
| 11 | +1. Define the searching space. Here is an example of searching space: |
11 | 12 |
|
12 | | -```python |
13 | | -block_m_range = [16, 32, 64, 128, 256] |
14 | | -block_n_range = [32, 64, 128, 256] |
15 | | -block_k_range = [32, 64, 128, 256] |
16 | | -num_warps_range = [4, 8] |
17 | | -num_stage_range = [2, 3, 4, 5] |
18 | | -num_ctas_range = [1] |
19 | | -split_k_range = [4, 8, 16, 32, 64] |
20 | | -``` |
| 13 | + ```python |
| 14 | + block_m_range = [16, 32, 64, 128, 256] |
| 15 | + block_n_range = [32, 64, 128, 256] |
| 16 | + block_k_range = [32, 64, 128, 256] |
| 17 | + num_warps_range = [4, 8] |
| 18 | + num_stage_range = [2, 3, 4, 5] |
| 19 | + num_ctas_range = [1] |
| 20 | + split_k_range = [4, 8, 16, 32, 64] |
| 21 | + ``` |
21 | 22 |
|
22 | | -**Step 2** |
23 | | -Get all hidden_state sizes and num_slices that the target model uses for a specific TP size. |
| 23 | +2. Get all hidden_state sizes and num_slices that the target model uses for a specific TP size. |
24 | 24 |
|
25 | | -For example, we can aquire those info by simply checking [add_lora_linear](https://github.com/li2haipeng/vllm/blob/multi_lora_v01011/vllm/lora/punica_wrapper/punica_gpu.py#L192): |
| 25 | + For example, you can acquire the info by simply checking |
| 26 | + [add_lora_linear](https://github.com/vllm-project/vllm/blob/main/vllm/lora/punica_wrapper/punica_gpu.py#L181): |
26 | 27 |
|
27 | | -```python |
28 | | -print(f"x_shape: {x.view(-1, x.shape[-1]).shape}") |
29 | | -print(f"num_sclises: {len(output_slices)}") |
30 | | -for i in range(len(output_slices)): |
31 | | - print(f"a{i} shape: {lora_a_stacked[i].shape}") |
32 | | - print(f"b{i} shape: {lora_b_stacked[i].shape}") |
33 | | -print("y_shape", y.shape) |
34 | | -``` |
| 28 | + ```python |
| 29 | + print(f"x_shape: {x.view(-1, x.shape[-1]).shape}") |
| 30 | + print(f"num_slices: {len(output_slices)}") |
| 31 | + for i in range(len(output_slices)): |
| 32 | + print(f"a{i} shape: {lora_a_stacked[i].shape}") |
| 33 | + print(f"b{i} shape: {lora_b_stacked[i].shape}") |
| 34 | + print("y_shape", y.shape) |
| 35 | + ``` |
35 | 36 |
|
36 | | -**Step 3** |
37 | | -Benchmark the shrink/expand kernel runtime with different kernel configurations generated from the pre-defined search space by performing a grid search to find the optimal kernel configuration. vLLM's [benchmark_lora.py](https://github.com/vllm-project/vllm/blob/main/benchmarks/kernels/benchmark_lora.py) can be used to search for configurations for different shapes. |
| 37 | +3. Benchmark the shrink/expand kernel runtime with different kernel configurations generated from the pre-defined search space |
| 38 | + by performing a grid search to find the optimal kernel configuration. |
| 39 | + vLLM's [benchmark_lora.py](https://github.com/vllm-project/vllm/blob/main/benchmarks/kernels/benchmark_lora.py) |
| 40 | + can be used to search for configurations for different shapes. |
38 | 41 |
|
39 | 42 | ## Config Files |
40 | 43 |
|
41 | | -### File Name |
| 44 | +### File Naming |
42 | 45 |
|
43 | | -For `shrink`, the config file is named as `{gpu_name}_SHRINK.json`, e.g. `NVIDIA_H200_SHRINK.json`. |
| 46 | +| Kernel Type | File Name Template | Example | |
| 47 | +|---------------------------|--------------------------------------------|---------------------------------------------| |
| 48 | +| shrink | `{gpu_name}_SHRINK.json` | `NVIDIA_H200_SHRINK.json` | |
| 49 | +| expand | `{gpu_name}_EXPAND_{add_input}.json` | `NVIDIA_H200_EXPAND_TRUE.json` | |
| 50 | +| fused_moe_lora_w13_shrink | `{gpu_name}_FUSED_MOE_LORA_W13_SHRINK.json` | `NVIDIA_H200_FUSED_MOE_LORA_W13_SHRINK.json` | |
| 51 | +| fused_moe_lora_w13_expand | `{gpu_name}_FUSED_MOE_LORA_W13_EXPAND.json` | `NVIDIA_H200_FUSED_MOE_LORA_W13_EXPAND.json` | |
| 52 | +| fused_moe_lora_w2_shrink | `{gpu_name}_FUSED_MOE_LORA_W2_SHRINK.json` | `NVIDIA_H200_FUSED_MOE_LORA_W2_SHRINK.json` | |
| 53 | +| fused_moe_lora_w2_expand | `{gpu_name}_FUSED_MOE_LORA_W2_EXPAND.json` | `NVIDIA_H200_FUSED_MOE_LORA_W2_EXPAND.json` | |
44 | 54 |
|
45 | | -For `expand`, the config fileis named as `{gpu_name}_EXPAND_{add_input}.json`, e.g. `NVIDIA_H200_EXPAND_TRUE.json`. |
| 55 | +The `gpu_name` can be automatically detected by calling `torch.cuda.get_device_name()`. |
46 | 56 |
|
47 | | -For `fused_moe_lora_w13_shrink`, the config file is named as `{gpu_name}_FUSED_MOE_LORA_W13_SHRINK.json`, e.g. `NVIDIA_H200_FUSED_MOE_LORA_W13_SHRINK.json`. |
| 57 | +### JSON Structure |
48 | 58 |
|
49 | | -For `fused_moe_lora_w13_expand`, the config file is named as `{gpu_name}_FUSED_MOE_LORA_W13_EXPAND.json`, e.g. `NVIDIA_H200_FUSED_MOE_LORA_W13_EXPAND.json`. |
50 | | - |
51 | | -For `fused_moe_lora_w2_shrink`, the config file is named as `{gpu_name}_FUSED_MOE_LORA_W2_SHRINK.json`, e.g. `NVIDIA_H200_FUSED_MOE_LORA_W2_SHRINK.json`. |
52 | | - |
53 | | -For `fused_moe_lora_w2_expand`, the config file is named as `{gpu_name}_FUSED_MOE_LORA_W2_EXPAND.json`, e.g. `NVIDIA_H200_FUSED_MOE_LORA_W2_EXPAND.json`. |
54 | | - |
55 | | -The `gpu_name` can be automatically detected by calling `torch.cuda.get_device_name()` |
56 | | - |
57 | | -### Json Structure |
58 | | - |
59 | | -Optimal kernel configuration files are saved as JSON files with the structure `config_data[max_loras][num_slices][m][k][n][i]` |
| 59 | +Optimal kernel configuration files are saved as JSON files with the structure `config_data[max_loras][num_slices][m][k][n][i]`, |
60 | 60 | where `i` is an optional dimension in the `fused_moe_lora` configuration, representing the intermediate size of the MoE layer. |
0 commit comments