Skip to content

Conversation

@emlin
Copy link
Contributor

@emlin emlin commented Nov 4, 2025

Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2090

with feature score eviction, tbe will call backend to update feature score metadata separately in forward pass.
this process is designed for asynchronous update without blocking forward/backward pass, however the cpu blocking operation blocked the main stream, so after get_cuda, all2all cannot be started immediately.
from dummy profile, we can see this trace:
{F1983224804}

the set metadata operation becomes a blocker in critical path, which took 217ms

With this change, we can see the trace is updated to:
{F1983224830}

where overall prefetch is reduced to less than 70ms, also the get_cuda is followed by all2all immediately without other waiting and stream sync

Differential Revision: D86013406

@netlify
Copy link

netlify bot commented Nov 4, 2025

Deploy Preview for pytorch-fbgemm-docs ready!

Name Link
🔨 Latest commit dfeac9d
🔍 Latest deploy log https://app.netlify.com/projects/pytorch-fbgemm-docs/deploys/690a3953df5c7b000885a950
😎 Deploy Preview https://deploy-preview-5082--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@meta-cla meta-cla bot added the cla signed label Nov 4, 2025
@meta-codesync
Copy link
Contributor

meta-codesync bot commented Nov 4, 2025

@emlin has exported this pull request. If you are a Meta employee, you can view the originating Diff in D86013406.

Summary:

X-link: facebookresearch/FBGEMM#2090

with feature score eviction, tbe will call backend to update feature score metadata separately in forward pass.
this process is designed for asynchronous update without blocking forward/backward pass, however the cpu blocking operation blocked the main stream, so after get_cuda, all2all cannot be started immediately.
from dummy profile, we can see this trace:
 {F1983224804} 

the set metadata operation becomes a blocker in critical path, which took 217ms

With this change, we can see the trace is updated to:
 {F1983224830} 

where overall prefetch is reduced to less than 70ms, also the get_cuda is followed by all2all immediately without other waiting and stream sync

Differential Revision: D86013406
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant