-
Notifications
You must be signed in to change notification settings - Fork 624
[Feat] Eagle adapt vllm's speculative_config.enforce_eager #4698
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: anon189Ty <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
The pull request successfully integrates the speculative_config.enforce_eager flag into the EagleProposer, ensuring that CUDA graph usage is correctly disabled when eager mode is enforced. Key changes include updating the use_cuda_graph condition, passing the device to AttentionMaskBuilder, and initializing aclgraph_runtime_mode and batch_descriptor appropriately. However, there are a couple of areas where the code can be improved for clarity and to remove redundancy.
| self.positions[:num_tokens] = target_positions.to(device) | ||
| self.hidden_states[:num_tokens] = target_hidden_states | ||
| attn_metadata.block_tables = block_table.to(device) | ||
| # Make sure the uspeculative_config.enforce_eager validated to be compatible with eagle full graph. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a typo in this comment. It says "uspeculative_config.enforce_eager" instead of "speculative_config.enforce_eager". Please correct it for better readability.
| # Make sure the uspeculative_config.enforce_eager validated to be compatible with eagle full graph. | |
| # Make sure the speculative_config.enforce_eager validated to be compatible with eagle full graph. |
| if not self.use_cuda_graph: | ||
| aclgraph_runtime_mode = CUDAGraphMode.NONE | ||
| batch_descriptor = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block of code is redundant. aclgraph_runtime_mode is already initialized to CUDAGraphMode.NONE on line 462, and batch_descriptor is initialized to None on line 463. If self.use_cuda_graph is false, these variables would already hold their intended values. Removing this redundant check will make the code cleaner and easier to maintain.
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
What this PR does / why we need it?
There is a new command "enforce_eager" in vllm's SpeculativeConfig. When it open, spec model should only run in eager mode and no impact the main model. This PR will let the eagle in vllm-ascend can use this command.
You can see this command in vllm/vllm/config/speculative.py
Does this PR introduce any user-facing change?
If you want to make eagle model always run in eager mode, you should add the '"enforce_eager": true' command when you set the speculative_config in your start sever command.
For example:
"""
vllm server /weights/Qwen3
--speculative-config '{"enforce_eager": true, "method": "eagle3", "model": "/weights/Eagle-Qwen3", "num_speculative_tpkens": 1}'
"""
How was this patch tested?