[causal attn mask] replaced for loop over tensor with pytorch tensor ops #208

dabeschte · 2025-01-29T15:20:20Z

The original causal attention mask generation is very slow, especially when the tensor is created on the GPU, because it needs to make 1000s of calls.

I tried to compile it...which works and makes it fast too, but compilation unfortunately also takes a long time when using a long sequence length.

This implementation is ~20-70x faster depending on the sequence lengths and since it is re-created for every SDPA, this accumulates to multiple seconds per step for larger videos

dabeschte added 2 commits January 29, 2025 15:16

replaced for loop over tensor with pytorch tensor ops

0a8dd56

cleanup

833a836

dabeschte changed the title ~~replaced for loop over tensor with pytorch tensor ops~~ [causal attn mask] replaced for loop over tensor with pytorch tensor ops Jan 29, 2025

dabeschte mentioned this pull request Feb 11, 2025

speedup hunyuan encoder causal mask generation huggingface/diffusers#10764

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[causal attn mask] replaced for loop over tensor with pytorch tensor ops #208

[causal attn mask] replaced for loop over tensor with pytorch tensor ops #208

Uh oh!

dabeschte commented Jan 29, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[causal attn mask] replaced for loop over tensor with pytorch tensor ops #208

Are you sure you want to change the base?

[causal attn mask] replaced for loop over tensor with pytorch tensor ops #208

Uh oh!

Conversation

dabeschte commented Jan 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dabeschte commented Jan 29, 2025 •

edited

Loading