Skip to content

Conversation

@yiz-liu
Copy link
Collaborator

@yiz-liu yiz-liu commented Nov 28, 2025

What this PR does / why we need it?

WIP

Does this PR introduce any user-facing change?

None

How was this patch tested?

None

…raph capture and execution for the draft model's forward pass to improve performance.

Signed-off-by: Yizhou Liu <[email protected]>
@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@github-actions
Copy link

github-actions bot commented Dec 1, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

is_draft_model=True):
forward_context = get_forward_context()
self.model(
input_ids=self.input_ids[:num_tokens],

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is recommended to add length validation for slicing operations to prevent out-of-bounds errors, or to state that there is no risk of out-of-bounds access.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you can see in vLLM (GPUModelRunner), we also slice like this without no further comments, since variable like self.input_ids are designed as a persistent buffer for graph mode. Take a look at the initialization and we shall see it's allocated with max_tokens. But you are right, we may need to clean this up later, if you are able to step in, that'll be much appreciated.

for layer_name in [self.attn_layer_name]:
attn_metadata[layer_name] = attn_metadata_mtp
for i in range(self.num_speculative_tokens):
if i > 0:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, since i > 0, it is necessary to forcibly modify aclgraph_runtime_mode. It is recommended to add comments for better maintainability.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is flawed and I will deprecate this as soon as possible once we have a new design.

self.runner.input_batch.
num_computed_tokens_cpu_tensor[:num_reqs])
common_attn_metadata = AscendCommonAttentionMetadata(
query_start_loc=self.runner.query_start_loc[:num_reqs + 1],

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check whether num_reqs has out-of-bounds risks. If yes, add verification methods.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, no need to worry here, just like the former question.

@yiz-liu
Copy link
Collaborator Author

yiz-liu commented Dec 4, 2025

@Sparkheart Thank you for your review, this is a highly unstable version, I only draft this one so that the others can have something to start with. We may address the remaining issues in the next few days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants