Skip to content

Conversation

@xiaofeihan1
Copy link
Contributor

Description

When is_packed_qkv_ and do_rotary_, call a new SplitPackedQKVWithRotaryEmbedding which fused SplitPackedQKV with FusedQKRotaryEmbedding.

Dispatch size is BSN*work_per_head. (work_per_head is head_size - half_rotary_embedding_dim, is equal half_rotary_embedding_dim + need_copy_dim)

  • For half_rotary_embedding_dim, we split packedQKV and then do rotary for pairs q/k and directly store v.
  • For need_copy_dim, we split packedQKV and then directly store q/k/v

Motivation and Context

On NV5080, the token generation speed improve ~3%.

generation tps Before After
NV5080 129 133
Intel 15.4 15.5
Mac 69.0 71.0

@xiaofeihan1 xiaofeihan1 added the ep:WebGPU ort-web webgpu provider label Oct 30, 2025
@xiaofeihan1 xiaofeihan1 requested a review from qjia7 October 30, 2025 05:10
qjia7
qjia7 previously approved these changes Nov 7, 2025
@qjia7 qjia7 requested review from fs-eire and guschmue November 7, 2025 09:53
@xiaofeihan1 xiaofeihan1 merged commit cf8476b into microsoft:main Nov 11, 2025
91 of 92 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ep:WebGPU ort-web webgpu provider

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants