Skip to content

Conversation

@shen-shanshan
Copy link
Contributor

@shen-shanshan shen-shanshan commented Dec 2, 2025

Purpose

  1. In some modeling files, there are direct calling of apply_rotary_emb function by using pre-computed cos/sin cache, like: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/qwen2_5_vl.py#L383-L385. This is just a function, not an operation, and cannot be overwritten by some plugins (e.g., vllm-ascend). By extracting it as an CustomOp, we can just extend this class and implement our forward_oot() function, after which we register it to easily replace this op with other op of OOT device.
  2. There are several dispatch function for apply_rotary_emb which may make users or developers confused, such as https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/rotary_embedding/common.py#L56-L70 and https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/rotary_embedding/common.py#L73-L97. This PR has unified these separate dispatch logic into one CustomOp to make it clearer.
  3. There are also some redundant definitions of some functions, such as rotate_half() and apply_rotary_emb_torch(). This PR has removed these replicated definitions in different modeling files and just remained one replica in https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/rotary_embedding/common.py.

Test Plan

I have tested this PR on Ascend NPU together with vllm-project/vllm-ascend#4667.

Run:

vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct \
--max_model_len 16384 \
--max-num-batched-tokens 16384 \
--tensor-parallel-size 2 \
--enforce-eager

Test Result

Output:

{"id":"chatcmpl-9ab4de23690c85aa","object":"chat.completion","created":1764748509,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the image reads \"TONGYI Qwen.\" The word \"TONGYI\" is written in blue, and \"Qwen\" is written in gray. The font appears to be modern and clean, with \"TONGYI\" being slightly larger than \"Qwen.\" The design includes a geometric, abstract shape on the left side of the logo, which complements the text.","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":78,"total_tokens":162,"completion_tokens":84,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify mergify bot added the qwen Related to Qwen models label Dec 2, 2025
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request successfully refactors the apply_rotary_emb functionality into a CustomOp class, unifying dispatch logic and removing redundant definitions across several files. This is a positive step towards better modularity and extensibility. However, there are a few critical issues that need to be addressed to ensure correctness and prevent runtime errors.

Copy link
Collaborator

@ProExpertProg ProExpertProg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this necessary - there's already a RotaryEmbedding custom op class?

@shen-shanshan
Copy link
Contributor Author

Why is this necessary - there's already a RotaryEmbedding custom op class?

@ProExpertProg

  1. In some modeling files, there are direct calling of apply_rotary_emb function by using pre-computed cos/sin cache, like: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/qwen2_5_vl.py#L383-L385. This is just a function, not an operation, and cannot be overwritten by some plugins (e.g., vllm-ascend). By extracting it as an CustomOp, we can just extend this class and implement our forward_oot() function, after which we register it to easily replace this op with other op of OOT device.
  2. There are several dispatch function for apply_rotary_emb which may make users or developers confused, such as https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/rotary_embedding/common.py#L56-L70 and https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/rotary_embedding/common.py#L73-L97. This PR has unified these separate dispatch logic into one CustomOp to make it clearer.
  3. There are also some redundant definitions of some functions, such as rotate_half() and apply_rotary_emb_torch(). This PR has removed these replicated definitions in different modeling files and just remained one replica in https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/rotary_embedding/common.py.

@shen-shanshan
Copy link
Contributor Author

shen-shanshan commented Dec 3, 2025

Why is this necessary - there's already a RotaryEmbedding custom op class?

I have also tested this PR together with vllm-project/vllm-ascend#4667 on Ascend NPU.
Maybe we need to add a ready label to see whether this PR will break other backend, such as Cuda, Rocm, ...

@robertgshaw2-redhat
Copy link
Collaborator

@ProExpertProg - mind following up on this?

Copy link
Collaborator

@ProExpertProg ProExpertProg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice dispatch cleanup, approving so that it's not blocked while I'm gone for the next 2 weeks but please address comments!!

query_rot = query[..., :rotary_dim]
query_pass = query[..., rotary_dim:]
query_rot = apply_rotary_emb_torch(query_rot, cos, sin, is_neox_style)
query_rot = ApplyRotaryEmb.forward_static(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use self.apply_rotary_emb?

Copy link
Contributor Author

@shen-shanshan shen-shanshan Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose self.apply_rotary_emb can not be called in this static method.

)


def _apply_rotary_emb_torch(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this just live in ApplyRotaryEmb.forward_static?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea!

# If torch compile is not enabled, use rotary embedding function from
# flash_attn package, otherwise use the naive pytorch embedding
# implementation is faster when torch compile is enabled.
if not torch.compiler.is_compiling():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For follow-up: can we change this so the dispatch happens inside __init__ (check compilation_config.mode != CompilationMode.NONE)?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we should just always use FA in forward_hip as forward_native is used by default anyway when torch Inductor is used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you are right. When using graph mode and backend = inductor, CustomOp will be disabled by default, and forward_hip won't be called in this case.

@ProExpertProg ProExpertProg added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 5, 2025
@shen-shanshan
Copy link
Contributor Author

shen-shanshan commented Dec 6, 2025

Nice dispatch cleanup, approving so that it's not blocked while I'm gone for the next 2 weeks but please address comments!!

Really thanks for your review. I will address the comments and fix CI errors recently.

Signed-off-by: shen-shanshan <[email protected]>
Signed-off-by: shen-shanshan <[email protected]>
Signed-off-by: shen-shanshan <[email protected]>
Signed-off-by: shen-shanshan <[email protected]>
Signed-off-by: shen-shanshan <[email protected]>
Signed-off-by: shen-shanshan <[email protected]>
Signed-off-by: shen-shanshan <[email protected]>
@shen-shanshan shen-shanshan changed the title [CustomOp] Extract apply_rotary_emb as CustomOp and unify the dispatch logic [CustomOp] Extract ApplyRotaryEmb as CustomOp and unify the dispatch logic Dec 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants