[Attention] Temporarily add back pa for small batch sizes. #4765

whx-sjtu · 2025-12-07T10:53:33Z

What this PR does / why we need it?

This PR adds back pa in scenarios of small batch sizes due to performance consideration. Will remove pa once fia performs better than pa in all scenarios.

Does this PR introduce any user-facing change?

No

How was this patch tested?

CI passed with existing test.

vLLM version: v0.12.0
vLLM main: vllm-project/vllm@ad32e3e

github-actions · 2025-12-07T10:53:40Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request introduces a temporary performance optimization by using paged attention for small batch sizes during graph capture. The changes are logical and well-contained. I've identified one critical issue in the implementation of the new attention function that could lead to runtime errors under certain conditions. My feedback includes a code suggestion to resolve this issue.

gemini-code-assist · 2025-12-07T10:55:25Z

vllm_ascend/attention/attention_v1.py

+        if forward_context.capturing:
+            # Get workspace from cache or calculate it if not present.
+            workspace = graph_params.workspaces.get(num_tokens)
+            if workspace is None:
+                workspace = torch_npu._npu_paged_attention_get_workspace(
+                    query=query,
+                    key_cache=self.key_cache,
+                    value_cache=self.value_cache,
+                    num_kv_heads=self.num_kv_heads,
+                    num_heads=self.num_heads,
+                    scale_value=self.scale,
+                    block_table=attn_metadata.block_tables,
+                    context_lens=attn_metadata.seq_lens,
+                    out=output)
+                update_graph_params_workspaces(num_tokens,
+                                               weak_ref_tensors(workspace))
+
+            # Handle graph capturing mode
+            stream = torch_npu.npu.current_stream()
+
+            event = torch.npu.ExternalEvent()
+            event.wait(stream)
+            event.reset(stream)
+            graph_params.events[num_tokens].append(event)
+            graph_params.attn_params[num_tokens].append((
+                weak_ref_tensors(query),
+                weak_ref_tensors(self.key_cache),
+                weak_ref_tensors(self.value_cache),
+                self.num_kv_heads,
+                self.num_heads,
+                self.scale,
+                attn_metadata.block_tables,
+                attn_metadata.seq_lens,
+                weak_ref_tensors(output),
+            ))
+
+            torch.npu.graph_task_group_begin(stream)
+            torch_npu._npu_paged_attention(
+                query=query,
+                key_cache=self.key_cache,
+                value_cache=self.value_cache,
+                num_kv_heads=self.num_kv_heads,
+                num_heads=self.num_heads,
+                scale_value=self.scale,
+                block_table=attn_metadata.block_tables,
+                context_lens=attn_metadata.seq_lens,
+                out=output,
+                workspace=workspace)
+            handle = torch.npu.graph_task_group_end(stream)
+            graph_params.handles[num_tokens].append(handle)
+            return output


The if forward_context.capturing: check is problematic. If this function is ever called in a non-capturing context, it will implicitly return None, but the caller expects a torch.Tensor. This would cause a runtime error.

The call site in forward_impl already ensures this function is only called when forward_context.capturing is true, making this check redundant. For consistency with the similar full_graph_attention function and to prevent potential bugs, this check should be removed. The function's logic should assume it is running in a capturing context.

# Get workspace from cache or calculate it if not present. workspace = graph_params.workspaces.get(num_tokens) if workspace is None: workspace = torch_npu._npu_paged_attention_get_workspace( query=query, key_cache=self.key_cache, value_cache=self.value_cache, num_kv_heads=self.num_kv_heads, num_heads=self.num_heads, scale_value=self.scale, block_table=attn_metadata.block_tables, context_lens=attn_metadata.seq_lens, out=output) update_graph_params_workspaces(num_tokens, weak_ref_tensors(workspace)) # Handle graph capturing mode stream = torch_npu.npu.current_stream() event = torch.npu.ExternalEvent() event.wait(stream) event.reset(stream) graph_params.events[num_tokens].append(event) graph_params.attn_params[num_tokens].append(( weak_ref_tensors(query), weak_ref_tensors(self.key_cache), weak_ref_tensors(self.value_cache), self.num_kv_heads, self.num_heads, self.scale, attn_metadata.block_tables, attn_metadata.seq_lens, weak_ref_tensors(output), )) torch.npu.graph_task_group_begin(stream) torch_npu._npu_paged_attention( query=query, key_cache=self.key_cache, value_cache=self.value_cache, num_kv_heads=self.num_kv_heads, num_heads=self.num_heads, scale_value=self.scale, block_table=attn_metadata.block_tables, context_lens=attn_metadata.seq_lens, out=output, workspace=workspace) handle = torch.npu.graph_task_group_end(stream) graph_params.handles[num_tokens].append(handle) return output

vllm_ascend/attention/attention_v1.py

yiz-liu

One last question, when will we finally remove this PA?

vllm_ascend/attention/attention_v1.py

vllm_ascend/compilation/acl_graph.py

whx-sjtu · 2025-12-12T03:12:44Z

One last question, when will we finally remove this PA?

After 1230, fia will support flash decoding. Then all GQA models will do performance tests of different scenarios using new fia. If results show no performances problem, I will remove pa finally.

github-actions · 2025-12-12T08:12:16Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

wangxiyuan

@weijinqian0

Signed-off-by: whx-sjtu <[email protected]>

vllm_ascend/attention/attention_v1.py

…ect#4765) ### What this PR does / why we need it? This PR adds back pa in scenarios of small batch sizes due to performance consideration. Will remove pa once fia performs better than pa in all scenarios. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed with existing test. - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: whx-sjtu <[email protected]> Co-authored-by: weijinqian0 <[email protected]>

whx-sjtu requested a review from yiz-liu December 7, 2025 10:53

whx-sjtu requested a review from weijinqian0 December 7, 2025 10:53

gemini-code-assist bot reviewed Dec 7, 2025

View reviewed changes

whx-sjtu changed the title ~~[Attention] Temporarily add back pa in small batch sizes.~~ [Attention] Temporarily add back pa for small batch sizes. Dec 7, 2025

weijinqian0 reviewed Dec 8, 2025

View reviewed changes

vllm_ascend/attention/attention_v1.py Outdated Show resolved Hide resolved

Angazenn mentioned this pull request Dec 8, 2025

[v0.11.0-dev]Add FIA to decode_only_state in v0.11.0-dev branch #4524

Closed

whx-sjtu force-pushed the add_back_pa_main branch from da55e0d to 91ba5ec Compare December 8, 2025 04:41

whx-sjtu added ready read for review ready-for-test start test by label for PR labels Dec 8, 2025

yiz-liu reviewed Dec 12, 2025

View reviewed changes

vllm_ascend/attention/attention_v1.py Show resolved Hide resolved

vllm_ascend/compilation/acl_graph.py Show resolved Hide resolved

whx-sjtu force-pushed the add_back_pa_main branch from 91ba5ec to 58876f1 Compare December 12, 2025 03:08

github-actions bot added the merge-conflicts label Dec 12, 2025

whx-sjtu force-pushed the add_back_pa_main branch from 58876f1 to 859dc59 Compare December 12, 2025 08:20

github-actions bot removed the merge-conflicts label Dec 12, 2025

wangxiyuan approved these changes Dec 12, 2025

View reviewed changes

whx-sjtu force-pushed the add_back_pa_main branch 4 times, most recently from fb032fb to b44b7d7 Compare December 15, 2025 06:53

whx-sjtu added 4 commits December 15, 2025 15:36

temporarily add back pa in small batch sizes

fc9004e

Signed-off-by: whx-sjtu <[email protected]>

fix lint

cda1964

Signed-off-by: whx-sjtu <[email protected]>

add extra constrains and new config

5879be4

Signed-off-by: whx-sjtu <[email protected]>

add graph mode check

a875aa7

Signed-off-by: whx-sjtu <[email protected]>

whx-sjtu force-pushed the add_back_pa_main branch from b44b7d7 to a875aa7 Compare December 15, 2025 07:36

yiz-liu approved these changes Dec 15, 2025

View reviewed changes

vllm_ascend/attention/attention_v1.py Show resolved Hide resolved

Merge branch 'main' into add_back_pa_main

10f2b1c

weijinqian0 merged commit a962585 into vllm-project:main Dec 15, 2025
12 of 14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Attention] Temporarily add back pa for small batch sizes. #4765

[Attention] Temporarily add back pa for small batch sizes. #4765

whx-sjtu commented Dec 7, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Dec 7, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 7, 2025

Uh oh!

Uh oh!

yiz-liu left a comment

Uh oh!

Uh oh!

Uh oh!

whx-sjtu commented Dec 12, 2025

Uh oh!

github-actions bot commented Dec 12, 2025

Uh oh!

wangxiyuan left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[Attention] Temporarily add back pa for small batch sizes. #4765

[Attention] Temporarily add back pa for small batch sizes. #4765

Conversation

whx-sjtu commented Dec 7, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Dec 7, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yiz-liu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

whx-sjtu commented Dec 12, 2025

Uh oh!

github-actions bot commented Dec 12, 2025

Uh oh!

wangxiyuan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

whx-sjtu commented Dec 7, 2025 •

edited by github-actions bot

Loading