[WIP][Feat] Eagle Proposer support `FULL_DECODE_ONLY` graph mode #4530

yiz-liu · 2025-11-28T05:01:05Z

What this PR does / why we need it?

WIP

Does this PR introduce any user-facing change?

None

How was this patch tested?

None

vLLM version: v0.11.2
vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

Signed-off-by: Yizhou Liu <[email protected]>

…raph capture and execution for the draft model's forward pass to improve performance. Signed-off-by: Yizhou Liu <[email protected]>

github-actions · 2025-11-28T05:01:12Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

github-actions · 2025-12-01T12:48:50Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Sparkheart · 2025-12-02T14:43:30Z

vllm_ascend/spec_decode/eagle_proposer.py

+                    is_draft_model=True):
+                forward_context = get_forward_context()
+                self.model(
+                    input_ids=self.input_ids[:num_tokens],


It is recommended to add length validation for slicing operations to prevent out-of-bounds errors, or to state that there is no risk of out-of-bounds access.

As you can see in vLLM (GPUModelRunner), we also slice like this without no further comments, since variable like self.input_ids are designed as a persistent buffer for graph mode. Take a look at the initialization and we shall see it's allocated with max_tokens. But you are right, we may need to clean this up later, if you are able to step in, that'll be much appreciated.

Sparkheart · 2025-12-03T12:23:54Z

vllm_ascend/spec_decode/eagle_proposer.py

+            for layer_name in [self.attn_layer_name]:
+                attn_metadata[layer_name] = attn_metadata_mtp
+        for i in range(self.num_speculative_tokens):
+            if i > 0:


Here, since i > 0, it is necessary to forcibly modify aclgraph_runtime_mode. It is recommended to add comments for better maintainability.

This logic is flawed and I will deprecate this as soon as possible once we have a new design.

Sparkheart · 2025-12-03T13:37:36Z

vllm_ascend/spec_decode/eagle_proposer.py

+                self.runner.input_batch.
+                num_computed_tokens_cpu_tensor[:num_reqs])
+            common_attn_metadata = AscendCommonAttentionMetadata(
+                query_start_loc=self.runner.query_start_loc[:num_reqs + 1],


Check whether num_reqs has out-of-bounds risks. If yes, add verification methods.

Again, no need to worry here, just like the former question.

yiz-liu · 2025-12-04T08:05:20Z

@Sparkheart Thank you for your review, this is a highly unstable version, I only draft this one so that the others can have something to start with. We may address the remaining issues in the next few days.

yiz-liu added 2 commits November 27, 2025 17:58

Fix attention mask device and data type

c5a3c4d

Signed-off-by: Yizhou Liu <[email protected]>

Feat: introduces ACL graph support for the Eagle proposer, enabling g…

0233d9f

…raph capture and execution for the draft model's forward pass to improve performance. Signed-off-by: Yizhou Liu <[email protected]>

github-actions bot added module:core merge-conflicts labels Nov 28, 2025

Sparkheart reviewed Dec 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP][Feat] Eagle Proposer support `FULL_DECODE_ONLY` graph mode #4530

[WIP][Feat] Eagle Proposer support `FULL_DECODE_ONLY` graph mode #4530

yiz-liu commented Nov 28, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Nov 28, 2025

Uh oh!

github-actions bot commented Dec 1, 2025

Uh oh!

Sparkheart Dec 2, 2025

Uh oh!

yiz-liu Dec 4, 2025

Uh oh!

Sparkheart Dec 3, 2025

Uh oh!

yiz-liu Dec 4, 2025

Uh oh!

Sparkheart Dec 3, 2025

Uh oh!

yiz-liu Dec 4, 2025

Uh oh!

yiz-liu commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[WIP][Feat] Eagle Proposer support FULL_DECODE_ONLY graph mode #4530

Are you sure you want to change the base?

[WIP][Feat] Eagle Proposer support FULL_DECODE_ONLY graph mode #4530

Conversation

yiz-liu commented Nov 28, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Nov 28, 2025

Uh oh!

github-actions bot commented Dec 1, 2025

Uh oh!

Sparkheart Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

yiz-liu Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Sparkheart Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

yiz-liu Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Sparkheart Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

yiz-liu Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

yiz-liu commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[WIP][Feat] Eagle Proposer support `FULL_DECODE_ONLY` graph mode #4530

[WIP][Feat] Eagle Proposer support `FULL_DECODE_ONLY` graph mode #4530

yiz-liu commented Nov 28, 2025 •

edited by github-actions bot

Loading