-
-
Notifications
You must be signed in to change notification settings - Fork 11.8k
[Spec Decode][Model]Add qwen2-eagle #24187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for qwen2-eagle models in speculative decoding. The key changes involve normalizing EAGLE model architecture names, accommodating different KV head counts between target and draft models by aligning the KV cache space, and implementing the EagleQwen2ForCausalLM model. The approach is sound and the implementation aligns well with the project's existing patterns. I have one high-severity suggestion to improve error handling, making it more specific to avoid masking potential bugs.
vllm/config/__init__.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The except Exception: block is overly broad and can hide important errors, such as a misconfigured model path or typos in model configuration attributes. This makes debugging difficult because the root cause is suppressed, and the user only sees a generic warning. It is better to catch more specific exceptions that are expected during configuration loading (e.g., OSError, ValueError) and to log the actual exception for better diagnostics. This ensures that unexpected errors are not silenced.
| except Exception: | |
| # If we can't determine, assume they're different for safety | |
| logger.warning( | |
| "Could not determine KV heads compatibility for EAGLE") | |
| return True | |
| except (OSError, ValueError) as e: | |
| # If we can't determine, assume they're different for safety | |
| logger.warning( | |
| "Could not determine KV heads compatibility for EAGLE due to " | |
| "an error when loading the draft model config: %s. " | |
| "Assuming they are different for safety.", e) | |
| return True |
|
This pull request has merge conflicts that must be resolved before it can be |
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: wwl2755 <[email protected]>
Signed-off-by: wwl2755 <[email protected]>
Signed-off-by: wwl2755 <[email protected]>
Signed-off-by: wwl2755 <[email protected]>
Signed-off-by: wwl2755 <[email protected]>
Signed-off-by: wwl2755 <[email protected]>
Signed-off-by: wwl2755 <[email protected]>
| speculative_config=speculative_config, | ||
| disable_log_stats=False, | ||
| max_model_len=16384, | ||
| max_model_len=8192, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change intended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can revert this right before merging. I just lowered it to avoid OOM in 40GB HBM (as I tested locally).
| logger = init_logger(__name__) | ||
|
|
||
|
|
||
| class Qwen2DecoderLayer(Qwen2DecoderLayer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid confusion, we should use a different name for this class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I defined in this way to keep it consistent with the existing llama_eagle. Do you think we should also update them as well? https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama_eagle.py#L27
|
@luccafong can you help review this in more detail? Thanks |
|
This pull request has merge conflicts that must be resolved before it can be |
|
Documentation preview: https://vllm--24187.org.readthedocs.build/en/24187/ |
Purpose
Add qwen2-eagle (and qwen2.5-eagle) support.
According to Slack, Qwen2 is not supported yet because of different number of kv heads in the target and draft model. This PR addresses this by making the kv cache space aligns with the larger one (though it can cause some waste).
Another thing this PR addresses is that some eagle model places "eagle" at the end of its architecture (e.g. Qwen2ForCausalLMEagle), making it hard to be recognized by the registry.
This PR includes an important fix for test_spec_decode.py by @ekagra-ranjan in #23461. I can either wait that PR to be merged first or add @ekagra-ranjan as co-author in respect of his fixing.#23461 (#24257) is mergedcc: @WoosukKwon @LiuXiaoxuanPKU @mgoin
Test
Test Result
Detailed results
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.