【OPS】qwen3-next support triton chunk_gated_delta_rule ops #4070

shiyuan680 · 2025-11-08T08:31:39Z

What this PR does / why we need it?

qwen3-next suppot triton chunk_gated_delta_rule ops

Does this PR introduce any user-facing change?

How was this patch tested?

ttft reduce over 1/2 times

co-owners

@OsirisDuan

vLLM version: v0.11.2
vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

github-actions · 2025-11-08T08:31:49Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request adds Triton kernel implementations for chunk_gated_delta_rule operations, seemingly for use with Huawei Ascend NPUs within the vLLM framework. The implementation is a substantial port from the fla library, introducing several new files for forward and backward passes. While the effort to optimize these operations is commendable, the current implementation has several critical issues that will prevent it from running correctly. My review has identified missing Python imports, undefined variables causing NameErrors, inconsistent and likely incorrect use of chunk_size, missing parameters in a kernel launch, and a leftover debugging statement. These issues must be addressed to ensure the correctness and performance of the new operations.

gemini-code-assist · 2025-11-08T08:34:33Z

vllm_ascend/ops/triton/chunk.py

+import warnings
+
+import torch


The type hint Optional is used in this file (e.g., on line 188), but it is not imported from the typing module. This will cause a NameError at runtime.

Suggested change

import warnings

import torch

import warnings

from typing import Optional

import torch

gemini-code-assist · 2025-11-08T08:34:34Z

vllm_ascend/ops/triton/chunk.py

+    output_final_state: bool,
+    cu_seqlens: torch.LongTensor | None = None,
+):
+    g = chunk_local_cumsum(g, chunk_size=64, cu_seqlens=cu_seqlens)


There is an inconsistency in the chunk_size used. Here, chunk_local_cumsum is called with chunk_size=64, but all subsequent chunked operations in both the forward and backward passes use chunk_size=16. Since chunk_local_cumsum performs a cumulative sum within chunks, this discrepancy will likely lead to incorrect calculations in later stages that expect data to be processed in chunks of 16. To ensure correctness, the chunk size should be consistent across all related operations.

Suggested change

g = chunk_local_cumsum(g, chunk_size=64, cu_seqlens=cu_seqlens)

g = chunk_local_cumsum(g, chunk_size=16, cu_seqlens=cu_seqlens)

gemini-code-assist · 2025-11-08T08:34:34Z

vllm_ascend/ops/triton/chunk_o.py

+        g += bos * H + i_h
+        p_g = tl.make_block_ptr(g, (T,), (H,), (i_t * BT,), (BT,), (0,))
+        b_g = tl.load(p_g, boundary_check=(0,))
+        b_o = b_o * exp(b_g)[:, None]


The function exp is used within this Triton kernel, but it is not a standard tl function and has not been imported. This will result in a NameError. In other files within this PR, a custom exp function is imported from fla.ops.utils.op. The same import is needed here. Please add from fla.ops.utils.op import exp to the imports at the top of the file.

gemini-code-assist · 2025-11-08T08:34:34Z

vllm_ascend/ops/triton/chunk_o.py

+        for num_stages in [2, 3, 4]
+    ],
+    key=['H', 'K', 'V', 'BT', 'BK', 'BV', 'USE_G', 'USE_G_GAMMA', 'USE_DW'],
+    **autotune_cache_kwargs,


The variable autotune_cache_kwargs is used in the @triton.autotune decorator, but it is not defined anywhere in this file. This will cause a NameError during module loading. You should define it at the top of the file, for example, by adapting the definition from the fla library.

gemini-code-assist · 2025-11-08T08:34:34Z

vllm_ascend/ops/triton/chunk_o.py

+def chunk_fwd_o(
+    q: torch.Tensor,
+    k: torch.Tensor,
+    v: torch.Tensor,
+    h: torch.Tensor,
+    g: torch.Tensor | None = None,
+    g_gamma: torch.Tensor | None = None,
+    scale: float | None = None,
+    cu_seqlens: torch.LongTensor | None = None,
+    chunk_size: int = 64,
+) -> torch.Tensor:
+    B, T, H, K, V = *q.shape, v.shape[-1]
+    BT = chunk_size
+    chunk_indices = prepare_chunk_indices(cu_seqlens, BT) if cu_seqlens is not None else None
+    NT = triton.cdiv(T, BT) if cu_seqlens is None else len(chunk_indices)
+    if scale is None:
+        scale = k.shape[-1] ** -0.5
+
+    o = torch.empty_like(v)
+    def grid(meta): return (triton.cdiv(V, meta['BV']), NT, B * H)
+    chunk_fwd_kernel_o[grid](
+        q=q,
+        k=k,
+        v=v,
+        h=h,
+        g=g,
+        g_gamma=g_gamma,
+        o=o,
+        cu_seqlens=cu_seqlens,
+        chunk_indices=chunk_indices,
+        scale=scale,
+        T=T,
+        H=H,
+        K=K,
+        V=V,
+        BT=BT,
+    )
+    return o


The chunk_fwd_kernel_o kernel is not autotuned and requires BK and BV to be passed as constexpr arguments. However, these are missing from the kernel launch call, which will lead to a runtime error. You should define BK and BV and pass them to the kernel, similar to how it's done in other wrapper functions in this file.

def chunk_fwd_o( q: torch.Tensor, k: torch.Tensor, v: torch.Tensor, h: torch.Tensor, g: torch.Tensor | None = None, g_gamma: torch.Tensor | None = None, scale: float | None = None, cu_seqlens: torch.LongTensor | None = None, chunk_size: int = 64, ) -> torch.Tensor: B, T, H, K, V = *q.shape, v.shape[-1] BT = chunk_size chunk_indices = prepare_chunk_indices(cu_seqlens, BT) if cu_seqlens is not None else None NT = triton.cdiv(T, BT) if cu_seqlens is None else len(chunk_indices) if scale is None: scale = k.shape[-1] ** -0.5 if check_shared_mem('hopper', k.device.index): CONST_TILING = 128 elif check_shared_mem: CONST_TILING = 64 else: CONST_TILING = 32 BK = min(max(triton.next_power_of_2(K), 16), CONST_TILING) BV = min(max(triton.next_power_of_2(V), 16), CONST_TILING) o = torch.empty_like(v) def grid(meta): return (triton.cdiv(V, meta['BV']), NT, B * H) chunk_fwd_kernel_o[grid]( q=q, k=k, v=v, h=h, g=g, g_gamma=g_gamma, o=o, cu_seqlens=cu_seqlens, chunk_indices=chunk_indices, scale=scale, T=T, H=H, K=K, V=V, BT=BT, BK=BK, BV=BV ) return o

gemini-code-assist · 2025-11-08T08:34:34Z

vllm_ascend/ops/triton/chunk_o.py

+        p_dw = tl.make_block_ptr(dw, (T, K), (H*K, 1), (i_t * BT, i_k * BK), (BT, BK), (1, 0))
+        tl.store(p_dw, -b_dw.to(p_dw.dtype.element_ty), boundary_check=(0, 1))
+
+    tl.debug_barrier()


A tl.debug_barrier() is present here. This is typically used for debugging and should be removed from production code as it forces synchronization and can negatively impact performance.

github-actions · 2025-11-12T15:05:36Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

wangxiyuan · 2025-11-20T01:12:52Z

tests/e2e/multicard/test_chunk_gated_delta_rule.py

+        assert last_recurrent_state.shape == (3, 8, 128, 128)
+
+
+if __name__ == '__main__':


remove this two line

wangxiyuan · 2025-11-20T01:13:51Z

tests/e2e/multicard/test_chunk_gated_delta_rule.py

+@pytest.fixture
+def mock_moe_env():
+
+    with patch("torch_npu.npu_moe_finalize_routing",


why patch it?

wangxiyuan · 2025-11-20T01:14:34Z

tests/e2e/multicard/test_chunk_gated_delta_rule.py

+import pytest
+import torch
+
+from tests.ut.base import PytestBase


this is a ut or e2e? you create this file in e2e moudle but import ut base?

this ut use triton_npu,ut test is not install the package

wangxiyuan · 2025-11-20T01:14:58Z

tests/e2e/multicard/test_chunk_gated_delta_rule.py

@@ -0,0 +1,51 @@
+import unittest


if this is a e2e, you should enable this test in .github/workflow as well

this ut use triton_npu,ut test is not install the package

wangxiyuan · 2025-11-20T01:15:30Z

vllm_ascend/ops/triton/fla/chunk.py

@@ -0,0 +1,226 @@
+# SPDX-License-Identifier: Apache-2.0
+# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
+# SPDX-FileCopyrightText: Songlin Yang, Yu Zhang


so this file is copied from other place? where? it's better to add the origin link as well

copy from the vllm origin file

wangxiyuan · 2025-11-20T01:15:53Z

vllm_ascend/ops/triton/fla/chunk.py

+# The original source code was licensed under the MIT license and included
+# the following copyright notice:
+# Copyright (c) 2023-2025, Songlin Yang, Yu Zhang
+# ruff: noqa: E501


why the ruff and mypy is skipped?

triton files will have check error, i find the other triton files in project also skipped.

github-actions · 2025-11-24T09:10:38Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions · 2025-11-28T06:08:11Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: shiyuan680 <[email protected]>

wangxiyuan · 2025-11-28T12:53:38Z

.github/workflows/_e2e_test.yaml

        run: |
          . /usr/local/Ascend/ascend-toolkit/8.3.RC2/bisheng_toolkit/set_env.sh
-          python3 -m pip install "https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/triton_ascend-3.2.0.dev20250914-cp311-cp311-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl"
+          python3 -m pip install "https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/triton_ascend-3.2.0.dev2025110717-cp311-cp311-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl"


triton_ascend-3.2.0.dev2025110717-cp311-cp311-manylinux_2_27.whl

@OsirisDuan

…ct#4070) ### What this PR does / why we need it? qwen3-next suppot triton chunk_gated_delta_rule ops ### co-owners @OsirisDuan - vLLM version: v0.11.2 Signed-off-by: shiyuan680 <[email protected]>

@OsirisDuan

…ct#4070) ### What this PR does / why we need it? qwen3-next suppot triton chunk_gated_delta_rule ops ### co-owners @OsirisDuan - vLLM version: v0.11.2 Signed-off-by: shiyuan680 <[email protected]> Signed-off-by: Che Ruan <[email protected]>

@OsirisDuan

…ct#4070) ### What this PR does / why we need it? qwen3-next suppot triton chunk_gated_delta_rule ops ### co-owners @OsirisDuan - vLLM version: v0.11.2 Signed-off-by: shiyuan680 <[email protected]> Signed-off-by: Che Ruan <[email protected]>

github-actions bot added the module:ops label Nov 8, 2025

gemini-code-assist bot reviewed Nov 8, 2025

View reviewed changes

shiyuan680 force-pushed the triton branch 3 times, most recently from fe2a876 to 315ec77 Compare November 12, 2025 09:39

github-actions bot added the merge-conflicts label Nov 12, 2025

shiyuan680 force-pushed the triton branch from 315ec77 to dcbb35a Compare November 13, 2025 01:58

github-actions bot added the module:tests label Nov 13, 2025

shiyuan680 changed the title ~~【Draft】support triton chunk_gated_delta_rule ops~~ 【OPS】qwen3-next support triton chunk_gated_delta_rule ops Nov 13, 2025

shiyuan680 force-pushed the triton branch 4 times, most recently from 352e435 to 7b4db5a Compare November 13, 2025 09:08

github-actions bot removed the merge-conflicts label Nov 13, 2025

shiyuan680 force-pushed the triton branch 5 times, most recently from 48a8164 to abdaca1 Compare November 13, 2025 12:35

MengqingCao added ready read for review ready-for-test start test by label for PR labels Nov 14, 2025

weijinqian0 approved these changes Nov 17, 2025

View reviewed changes

shiyuan680 force-pushed the triton branch 3 times, most recently from a4074c5 to a2c834e Compare November 18, 2025 11:04

wangxiyuan reviewed Nov 20, 2025

View reviewed changes

shiyuan680 force-pushed the triton branch 2 times, most recently from 110ea91 to f569a9f Compare November 20, 2025 02:15

shiyuan680 force-pushed the triton branch 3 times, most recently from 20d201e to 2fee3b1 Compare November 21, 2025 03:46

github-actions bot added the merge-conflicts label Nov 24, 2025

shiyuan680 force-pushed the triton branch from 2fee3b1 to 6633a2b Compare November 26, 2025 07:18

github-actions bot removed the merge-conflicts label Nov 26, 2025

shiyuan680 force-pushed the triton branch from 6633a2b to 484d594 Compare November 26, 2025 08:28

github-actions bot added the merge-conflicts label Nov 28, 2025

shiyuan680 force-pushed the triton branch from 484d594 to e4f08c5 Compare November 28, 2025 08:39

support triton chunk_gated_delta_rule ops

2073338

Signed-off-by: shiyuan680 <[email protected]>

shiyuan680 force-pushed the triton branch from e4f08c5 to 2073338 Compare November 28, 2025 08:44

github-actions bot removed the merge-conflicts label Nov 28, 2025

wangxiyuan reviewed Nov 28, 2025

View reviewed changes

wangxiyuan merged commit 1c4a046 into vllm-project:main Nov 28, 2025
21 of 22 checks passed

	g = chunk_local_cumsum(g, chunk_size=64, cu_seqlens=cu_seqlens)
	g = chunk_local_cumsum(g, chunk_size=16, cu_seqlens=cu_seqlens)

		assert last_recurrent_state.shape == (3, 8, 128, 128)


		if __name__ == '__main__':

【OPS】qwen3-next support triton chunk_gated_delta_rule ops #4070

【OPS】qwen3-next support triton chunk_gated_delta_rule ops #4070

Uh oh!

Conversation

shiyuan680 commented Nov 8, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

co-owners

Uh oh!

github-actions bot commented Nov 8, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 12, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 24, 2025

Uh oh!

github-actions bot commented Nov 28, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

shiyuan680 commented Nov 8, 2025 •

edited by github-actions bot

Loading