You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
### What this PR does / why we need it?
bugfix for mtp in fullgraph
### Does this PR introduce _any_ user-facing change?
no
---------
Signed-off-by: zouyida2052 <[email protected]>
Copy file name to clipboardExpand all lines: docs/source/community/versioning_policy.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -74,7 +74,7 @@ vLLM Ascend includes two branches: main and dev.
74
74
Commits should typically be merged into the main branch first, and only then backported to the dev branch, to reduce maintenance costs as much as possible.
Copy file name to clipboardExpand all lines: docs/source/developer_guide/feature_guide/ModelRunner_prepare_inputs.md
+13-13Lines changed: 13 additions & 13 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -92,7 +92,7 @@ As the maximum number of tokens that can be schedules is 10, the scheduled token
92
92
##### 1. Get token positions:
93
93
First, determine which request each token belongs to: tokens 0–2 are assigned to **request_0**, tokens 3–4 to **request_1**, and tokens 5–9 to **request_2**. To represent this mapping, we use `request indices`, for example, `request indices`: `[0, 0, 0, 1, 1, 2, 2, 2, 2, 2]`.
94
94
95
-
For each request, use **the number of computed tokens** + **the relative position of current scheduled tokens** (`request_0: [0 + 0, 0 + 1, 0 + 2]`, `request_1: [0 + 0, 0 + 1]`, `request_2: [0 + 0, 0 + 1,..., 0 + 4]`) and then concatenate them together (`[0, 1, 2, 0, 1, 0, 1, 2, 3, 4]`).
95
+
For each request, use **the number of computed tokens** + **the relative position of current scheduled tokens** (`request_0: [0 + 0, 0 + 1, 0 + 2]`, `request_1: [0 + 0, 0 + 1]`, `request_2: [0 + 0, 0 + 1,..., 0 + 4]`) and then concatenate them together (`[0, 1, 2, 0, 1, 0, 1, 2, 3, 4]`).
96
96
97
97
Note: there is more efficient way (using `request indices`) to create positions in actual code.
98
98
@@ -152,33 +152,33 @@ The KV cache block in the device memory is like:
152
152
Let's say `K = max model len / block size = 6`, and we can get token `device block number`.
153
153
154
154
The workflow of achieving slot mapping:
155
-
1. Get `block table indices` using `K`, `positions` and `request indices`.
155
+
1. Get `block table indices` using `K`, `positions` and `request indices`.
156
156
157
157
Purpose: For each token, it could be used to select `device block number` from `block table`.
158
158
159
-
2. Get `device block number` using `block table indices`.
159
+
2. Get `device block number` using `block table indices`.
160
160
161
161
Purpose: `device block number` indicates which device block each token belongs to.
162
162
163
-
3. Get `block offsets` using `positions` and `block size`.
163
+
3. Get `block offsets` using `positions` and `block size`.
164
164
165
165
Purpose: `block offsets` indicates the offsets of each token within a block.
166
166
167
-
4. construct `slot mapping` using `device block number` and `block offsets`.
167
+
4. construct `slot mapping` using `device block number` and `block offsets`.
168
168
169
169
Purpose: we can use `slot mapping` to store Token IDs into token slots.
170
170
171
171
Details:
172
-
1. (**Token level**) Use a simple formula to calculate `block table indices`: `request indices * K + positions / block size`. So it equal to `[0 * 6 + 0 / 2, 0 * 6 + 1 / 2, 0 * 6 + 2 / 2, 1 * 6 + 0 / 2, 1 * 6 + 1 / 2, 2 * 6 + 0 / 2, 2 * 6 + 1 / 2, 2 * 6 + 2 / 2, 2 * 6 + 3 / 2, 2 * 6 + 4 / 2] = [0, 0, 1, 6, 6, 12, 12, 13, 13, 14]`. This could be used to select `device block number` from `block table`.
172
+
1. (**Token level**) Use a simple formula to calculate `block table indices`: `request indices * K + positions / block size`. So it equal to `[0 * 6 + 0 / 2, 0 * 6 + 1 / 2, 0 * 6 + 2 / 2, 1 * 6 + 0 / 2, 1 * 6 + 1 / 2, 2 * 6 + 0 / 2, 2 * 6 + 1 / 2, 2 * 6 + 2 / 2, 2 * 6 + 3 / 2, 2 * 6 + 4 / 2] = [0, 0, 1, 6, 6, 12, 12, 13, 13, 14]`. This could be used to select `device block number` from `block table`.
173
173
2. (**Token level**) Use `block table indices` to select out `device block number` for each scheduled token. The Pseudocode is `block_numbers = block_table[block_table_indices]`. So `device block number=[1, 1, 2, 3, 3, 4, 4, 5, 5, 6]`
174
-
3. (**Token level**) `block offsets` could be computed by `block offsets = positions % block size = [0, 1, 0, 0, 1, 0, 1, 0, 1, 0]`.
174
+
3. (**Token level**) `block offsets` could be computed by `block offsets = positions % block size = [0, 1, 0, 0, 1, 0, 1, 0, 1, 0]`.
175
175
4. At last, use `block offsets` and `device block number` to create `slot mapping`: `device block number * block size + block_offsets = [2, 3, 4, 6, 7, 8, 9, 10, 11, 12]`
176
176
177
-
(**Request level**) As we know the scheduled token count is `[3, 2, 5]`:
177
+
(**Request level**) As we know the scheduled token count is `[3, 2, 5]`:
178
178
179
-
- (**Request level**) Use prefix sum to calculate `query start location`: `[0, 3, 5, 10]`.
180
-
- (**Request level**) All tokens in step 1 are in the prefill stage, and the computed tokens count is 0; then `sequence length` = `[3, 2, 5]`.
181
-
- (**Request level**) As mentioned above, `number of computed tokens` are all 0s: `[0, 0, 0]`.
179
+
- (**Request level**) Use prefix sum to calculate `query start location`: `[0, 3, 5, 10]`.
180
+
- (**Request level**) All tokens in step 1 are in the prefill stage, and the computed tokens count is 0; then `sequence length` = `[3, 2, 5]`.
181
+
- (**Request level**) As mentioned above, `number of computed tokens` are all 0s: `[0, 0, 0]`.
182
182
-`number of requests`: `3`
183
183
- (**Request level**) `number of tokens`: `[3, 2, 5]`
184
184
-`max query len`: `5`
@@ -235,7 +235,7 @@ KV cache block in the device memory:
0 commit comments