-
Notifications
You must be signed in to change notification settings - Fork 629
[WIP][Feat] Eagle Proposer support FULL_DECODE_ONLY graph mode
#4530
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Yizhou Liu <[email protected]>
…raph capture and execution for the draft model's forward pass to improve performance. Signed-off-by: Yizhou Liu <[email protected]>
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
| is_draft_model=True): | ||
| forward_context = get_forward_context() | ||
| self.model( | ||
| input_ids=self.input_ids[:num_tokens], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is recommended to add length validation for slicing operations to prevent out-of-bounds errors, or to state that there is no risk of out-of-bounds access.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you can see in vLLM (GPUModelRunner), we also slice like this without no further comments, since variable like self.input_ids are designed as a persistent buffer for graph mode. Take a look at the initialization and we shall see it's allocated with max_tokens. But you are right, we may need to clean this up later, if you are able to step in, that'll be much appreciated.
| for layer_name in [self.attn_layer_name]: | ||
| attn_metadata[layer_name] = attn_metadata_mtp | ||
| for i in range(self.num_speculative_tokens): | ||
| if i > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, since i > 0, it is necessary to forcibly modify aclgraph_runtime_mode. It is recommended to add comments for better maintainability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic is flawed and I will deprecate this as soon as possible once we have a new design.
| self.runner.input_batch. | ||
| num_computed_tokens_cpu_tensor[:num_reqs]) | ||
| common_attn_metadata = AscendCommonAttentionMetadata( | ||
| query_start_loc=self.runner.query_start_loc[:num_reqs + 1], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check whether num_reqs has out-of-bounds risks. If yes, add verification methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, no need to worry here, just like the former question.
|
@Sparkheart Thank you for your review, this is a highly unstable version, I only draft this one so that the others can have something to start with. We may address the remaining issues in the next few days. |
What this PR does / why we need it?
WIP
Does this PR introduce any user-facing change?
None
How was this patch tested?
None