You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
|`torchair_graph_config`| dict |`{}`| Configuration options for torchair graph mode |
30
+
|`ascend_scheduler_config`| dict |`{}`| Configuration options for ascend scheduler |
30
31
|`weight_prefetch_config`| dict |`{}`| Configuration options for weight prefetch |
31
32
|`refresh`| bool |`false`| Whether to refresh global Ascend configuration content. This is usually used by rlhf or ut/e2e test case. |
32
33
|`expert_map_path`| str |`None`| When using expert load balancing for an MoE model, an expert map path needs to be passed in. |
@@ -60,6 +61,18 @@ The details of each configuration option are as follows:
60
61
|`enable_kv_nz`| bool |`False`| Whether to enable KV Cache NZ layout. This option only takes effect on models using MLA (for example, DeepSeek). |
61
62
|`enable_super_kernel`| bool |`False`| Whether to enable super kernel to fuse operators in deepseek moe layers. This option only takes effects on moe models using dynamic w8a8 quantization.|
62
63
64
+
**ascend_scheduler_config**
65
+
66
+
| Name | Type | Default | Description |
67
+
| ---- | ---- | ------- | ----------- |
68
+
|`enabled`| bool |`False`| Whether to enable ascend scheduler for V1 engine.|
69
+
|`enable_pd_transfer`| bool |`False`| Whether to enable P-D transfer. When it is enabled, decode is started only when prefill of all requests is done. This option only takes effect on offline inference. |
70
+
|`decode_max_num_seqs`| int |`0`| Whether to change max_num_seqs of decode phase when P-D transfer is enabled. This option only takes effect when enable_pd_transfer is True. |
71
+
|`max_long_partial_prefills`| Union[int, float]|`float('inf')`| The maximum number of prompts longer than long_prefill_token_threshold that will be prefilled concurrently. |
72
+
|`long_prefill_token_threshold`| Union[int, float]|`float('inf')`| a request is considered long if the prompt is longer than this number of tokens. |
73
+
74
+
ascend_scheduler_config also supports the options from [vllm scheduler config](https://docs.vllm.ai/en/stable/api/vllm/config.html#vllm.config.SchedulerConfig). For example, you can add `enable_chunked_prefill: True` to ascend_scheduler_config as well.
75
+
63
76
**weight_prefetch_config**
64
77
65
78
| Name | Type | Default | Description |
@@ -80,6 +93,12 @@ An example of additional configuration is as follows:
0 commit comments