Add SDPA and FlashAttention support to T5 #42453

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

DuyguA wants to merge 8 commits into huggingface:main from DuyguA:refactor/t5-new-attention

Contributor

DuyguA commented Nov 27, 2025

I made some changes to the T5 modeling file to support new attention interface. I made a bit of rearrangements to employ position_bias correctly into the attention mask.

Fixes #26350

A note though, I made a make fix-copies , however it broke several related models such as longt5 and mt5. Somehow fix script didn't copy over the imports, couldn't grab the attention code correctly hence I skipped that part. If applicable we can merge this PR + I can work on related models in another PR or I'm happy to take some hints to make the script work properly.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[ x] Did you read the contributor guideline,
Pull Request section?
[ x] Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
[ x] Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
[ x] Did you write any new necessary tests?

@ArthurZucker @Cyrilvallez @vasqu


          changes for new attention interface

d678fb2

vasqu reviewed

View reviewed changes

Contributor

vasqu left a comment

Sorry to be so strict about this but T5 is not a good candidate for flash attention / sdpa. The reason is that the relative attention bias has to be modeled there and as of now, it's not possible with base flash attention (might be possible with sdpa but needs proper mask preparation). tl;dr: It will only support eager attention in the end

We can still refactor this to have the attention interface-like implementation but only for eager in the end (i.e. _supports_sdpa/flash_attn remain False). Wdyt?

Contributor Author

DuyguA commented Nov 27, 2025

Sorry to be so strict about this but T5 is not a good candidate for flash attention / sdpa. The reason is that the relative attention bias has to be modeled there and as of now, it's not possible with base flash attention (might be possible with sdpa but needs proper mask preparation). tl;dr: It will only support eager attention in the end

We can still refactor this to have the attention interface-like implementation but only for eager in the end (i.e. _supports_sdpa/flash_attn remain False). Wdyt?

Sounds reasonable to me!

DuyguA added 6 commits

December 2, 2025 11:52


          no support for flash attn

fbe163f


          Merge branch 'main' into refactor/t5-new-attention

ae02420


          Merge branch 'main' into refactor/t5-new-attention

84cfb85


          restrict only eager attention

d93f0c4


          fixed typo

f9f47a4


          minor cosmetics

145b10f

Contributor Author

DuyguA commented Dec 2, 2025

Heys again @vasqu , I made the changes for restricting only eager attention. Model tests are passing, only repo consistency checks fail as I mentioned above. PR is ready for merge 😊


          Merge branch 'main' into refactor/t5-new-attention

d0e14b1

Contributor

github-actions bot commented Dec 2, 2025

[For maintainers] Suggested jobs to run (before merge)

run-slow: t5

vasqu reviewed

View reviewed changes

Contributor

vasqu left a comment

Some initial comments. Would be nice if we could go further to include the recorder and avoid unnecessary code along output_xxx.

src/transformers/models/t5/modeling_t5.py

    
                      return hidden_states

              def eager_attention_forward(

Contributor

vasqu Dec 2, 2025

I would rather have the relative position bias within here, see #38301 or more specifically

transformers/src/transformers/models/bert/modeling_bert.py

Lines 121 to 176 in 1c3188f

    
           def eager_attention_forward( 
        
               module: nn.Module, 
        
               query: torch.Tensor, 
        
               key: torch.Tensor, 
        
               value: torch.Tensor, 
        
               attention_mask: Optional[torch.Tensor], 
        
               scaling: Optional[float] = None, 
        
               dropout: float = 0.0, 
        
               head_mask: Optional[torch.Tensor] = None, 
        
               use_cache: Optional[bool] = None, 
        
               **kwargs: Unpack[TransformersKwargs], 
        
           ): 
        
               if scaling is None: 
        
                   scaling = query.size(-1) ** -0.5 
        
               # Take the dot product between "query" and "key" to get the raw attention scores. 
        
               attn_weights = torch.matmul(query, key.transpose(2, 3)) 
        
               # Relative positional embeddings 
        
               if module.position_embedding_type == "relative_key" or module.position_embedding_type == "relative_key_query": 
        
                   query_length, key_length = query.shape[2], key.shape[2] 
        
                   if use_cache: 
        
                       position_ids_l = torch.tensor(key_length - 1, dtype=torch.long, device=query.device).view(-1, 1) 
        
                   else: 
        
                       position_ids_l = torch.arange(query_length, dtype=torch.long, device=query.device).view(-1, 1) 
        
                   position_ids_r = torch.arange(key_length, dtype=torch.long, device=query.device).view(1, -1) 
        
                   distance = position_ids_l - position_ids_r 
        
                   positional_embedding = module.distance_embedding(distance + module.max_position_embeddings - 1) 
        
                   positional_embedding = positional_embedding.to(dtype=query.dtype)  # fp16 compatibility 
        
                   if module.position_embedding_type == "relative_key": 
        
                       relative_position_scores = torch.einsum("bhld,lrd->bhlr", query, positional_embedding) 
        
                       attn_weights = attn_weights + relative_position_scores 
        
                   elif module.position_embedding_type == "relative_key_query": 
        
                       relative_position_scores_query = torch.einsum("bhld,lrd->bhlr", query, positional_embedding) 
        
                       relative_position_scores_key = torch.einsum("bhrd,lrd->bhlr", key, positional_embedding) 
        
                       attn_weights = attn_weights + relative_position_scores_query + relative_position_scores_key 
        
               # Scaling is shifted in case of embeddings being relative 
        
               attn_weights = attn_weights * scaling 
        
               if attention_mask is not None and attention_mask.ndim == 4: 
        
                   attention_mask = attention_mask[:, :, :, : key.shape[-2]] 
        
                   attn_weights = attn_weights + attention_mask 
        
               attn_weights = nn.functional.softmax(attn_weights, dim=-1) 
        
               attn_weights = nn.functional.dropout(attn_weights, p=dropout, training=module.training) 
        
               if head_mask is not None: 
        
                   attn_weights = attn_weights * head_mask 
        
               attn_output = torch.matmul(attn_weights, value) 
        
               attn_output = attn_output.transpose(1, 2).contiguous() 
        
               return attn_output, attn_weights

(no longer on main but should give you the idea how this should look like)

src/transformers/models/t5/modeling_t5.py

    
                              "when creating this class."

                          )

                      self.scaling = self.d_model**-0.5

Contributor

vasqu Dec 2, 2025

For completeness, we should have the is_causal flag here, you can look into Bart for this - i.e. encoder = False, decoder = False if self attn or True if cross attn.

src/transformers/models/t5/modeling_t5.py

Comment on lines +380 to +383

    
                      if self.config._attn_implementation != "eager":

                          logger.warning_once(

                              "T5 uses relative position bias; SDPA/FlashAttention not supported, fall back to eager."

                          )

Contributor

vasqu Dec 2, 2025

This should never happen as we don't support anything other than eager. I would even raise an error here if anything

src/transformers/models/t5/modeling_t5.py

Comment on lines +301 to +310

    
                      hidden_states: torch.FloatTensor,

                      key_value_states: Optional[torch.FloatTensor] = None,

                      past_key_values: Optional[torch.FloatTensor] = None,

                      attention_mask: Optional[torch.FloatTensor] = None,

                      position_bias: Optional[torch.FloatTensor] = None,

                      query_length: Optional[torch.LongTensor] = None,

                      output_attentions: Optional[bool] = False,

                      cache_position: Optional[torch.LongTensor] = None,

                      **kwargs: Unpack[FlashAttentionKwargs],

                  ) -> tuple[torch.Tensor, Optional[torch.Tensor], Optional[tuple[torch.Tensor]]]:

Contributor

vasqu Dec 2, 2025

Let's not rename here, this would break BC. The type annotations are fine by itself.

src/transformers/models/t5/modeling_t5.py

    
                  config: T5Config

                  base_model_prefix = "transformer"

                  supports_gradient_checkpointing = True

                  _supports_attention_backend = True

Contributor

vasqu Dec 2, 2025

Not supported - kwargs are not used everywhere so far and enc-dec will need another look

src/transformers/models/t5/modeling_t5.py

    
                      attn_output = attn_output.view(batch_size, -1, self.inner_dim).contiguous()

                      attn_output = self.o(attn_output)

                      outputs = (attn_output, position_bias)

Contributor

vasqu Dec 2, 2025

Would be nice if we could refactor this along in this PR, we have an Outputrecorder which can handle collecting the weights. We no longer need to explicitly have the kwargs then. You can take a look at other model like Llama or t5gemma2 which do this. In essence, you need decorators (check_model_input, can_return_tuple) and the respective flag _can_record_outputs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet