Skip to content

Conversation

@anon189Ty
Copy link
Contributor

@anon189Ty anon189Ty commented Dec 4, 2025

What this PR does / why we need it?

There is a new command "enforce_eager" in vllm's SpeculativeConfig. When it open, spec model should only run in eager mode and no impact the main model. This PR will let the eagle in vllm-ascend can use this command.
You can see this command in vllm/vllm/config/speculative.py

Does this PR introduce any user-facing change?

If you want to make eagle model always run in eager mode, you should add the '"enforce_eager": true' command when you set the speculative_config in your start sever command.
For example:
"""
vllm server /weights/Qwen3
--speculative-config '{"enforce_eager": true, "method": "eagle3", "model": "/weights/Eagle-Qwen3", "num_speculative_tpkens": 1}'
"""

How was this patch tested?

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request successfully integrates the speculative_config.enforce_eager flag into the EagleProposer, ensuring that CUDA graph usage is correctly disabled when eager mode is enforced. Key changes include updating the use_cuda_graph condition, passing the device to AttentionMaskBuilder, and initializing aclgraph_runtime_mode and batch_descriptor appropriately. However, there are a couple of areas where the code can be improved for clarity and to remove redundancy.

self.positions[:num_tokens] = target_positions.to(device)
self.hidden_states[:num_tokens] = target_hidden_states
attn_metadata.block_tables = block_table.to(device)
# Make sure the uspeculative_config.enforce_eager validated to be compatible with eagle full graph.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is a typo in this comment. It says "uspeculative_config.enforce_eager" instead of "speculative_config.enforce_eager". Please correct it for better readability.

Suggested change
# Make sure the uspeculative_config.enforce_eager validated to be compatible with eagle full graph.
# Make sure the speculative_config.enforce_eager validated to be compatible with eagle full graph.

Comment on lines +477 to +479
if not self.use_cuda_graph:
aclgraph_runtime_mode = CUDAGraphMode.NONE
batch_descriptor = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This block of code is redundant. aclgraph_runtime_mode is already initialized to CUDAGraphMode.NONE on line 462, and batch_descriptor is initialized to None on line 463. If self.use_cuda_graph is false, these variables would already hold their intended values. Removing this redundant check will make the code cleaner and easier to maintain.

@github-actions
Copy link

github-actions bot commented Dec 4, 2025

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant