Sdpa for owlvit #42136

Aravind-11 · 2025-11-10T23:38:23Z

What does this PR do?

Implements SDPA for OWL VIT.

Fixes #28103

Before submitting

Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
[] Did you write any new necessary tests?

Who can review?

@vasqu @younesbelkada

Aravind-11 · 2025-11-11T00:35:19Z

What does this PR do?

Implements SDPA for OWL VIT. Revamp of #28818

Fixes #28103

Before submitting

Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.

Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.

[] Did you write any new necessary tests?

Who can review?

@vasqu @younesbelkada

I ran the RUN_SLOW=1 python -m pytest tests/models/owlvit/test_modeling_owlvit.py for the original owlvit implementation and it seemed to fail the same tests as my current implementation. I'm not sure how to infer from that.

vasqu

Sorry but I've got to be strict about this. We no longer implement separate classes for all the attention flavors but one unified one. I think ViT is a good example in this case, e.g. see https://github.com/huggingface/transformers/blob/main/src/transformers/models/vit/modeling_vit.py

Before changing this to these standards I won't take a proper look for now.

Aravind-11 · 2025-11-11T18:51:20Z

Sorry but I've got to be strict about this. We no longer implement separate classes for all the attention flavors but one unified one. I think ViT is a good example in this case, e.g. see https://github.com/huggingface/transformers/blob/main/src/transformers/models/vit/modeling_vit.py

Before changing this to these standards I won't take a proper look for now.

Got it. Thanks a lot!

Aravind-11 · 2025-11-12T06:42:48Z

Sorry but I've got to be strict about this. We no longer implement separate classes for all the attention flavors but one unified one. I think ViT is a good example in this case, e.g. see https://github.com/huggingface/transformers/blob/main/src/transformers/models/vit/modeling_vit.py

Before changing this to these standards I won't take a proper look for now.

I made similar changes as in the vit and removed the seperate sdpa class. Let me know what you think!

vasqu

Added some comments but in general it would be best to have a green CI before requesting a review. Atm, things are likely not working as expected

vasqu · 2025-11-12T12:43:42Z

src/transformers/models/owlvit/modeling_owlvit.py

-        causal_attention_mask = _create_4d_causal_attention_mask(
-            input_shape, hidden_states.dtype, device=hidden_states.device
+        # OWL-ViT uses a bidirectional (non-causal) encoder.
+        attention_mask = create_bidirectional_mask(
+            config=self.config,
+            input_embeds=hidden_states,
+            attention_mask=attention_mask,
        )
-        # expand attention_mask
-        if attention_mask is not None:
-            # [num_samples, seq_len] -> [num_samples, 1, tgt_seq_len, src_seq_len]
-            attention_mask = _prepare_4d_attention_mask(attention_mask, hidden_states.dtype)


This seems to suffer from the same issue as in #41750

It does not use a bidirectional mask, but a causal mask:

The first mask is a based causal mask

The second is a padding mask

These are added on top creating a causal mask with padding included

This also may need to adjust the is_causal argument dynamically as in the PR I linked - although I'm not sure if it's just causal in general

Thanks! I made some changes to the code after referring to CLIP - removing the output_attention, return dict and casual_attention_mask. Also copied the eager attention part, attention reshaping from CLIP. Added the flash and flex attn too.

I think that the current CI is failing because the OWL VIT config file is conflicting with the current encoder implementation. Could you guide me here? Thanks a lot!

Thanks! I made some changes to the code after referring to CLIP - removing the output_attention, return dict and casual_attention_mask. Also copied the eager attention part, attention reshaping from CLIP. Added the flash and flex attn too.

I think that the current CI is failing because the OWL VIT config file is conflicting with the current encoder implementation. Could you guide me here? Thanks a lot!

Hi, I investigated the failing OwlViTForObjectDetectionTest::test_eager_matches_sdpa_inference_09_fp32_pad_left.

The failure is due to the test invoking OwlViTForObjectDetection.forward() without providing pixel_values.

OwlViTForObjectDetection requires pixel_values (image tensors) for its vision backbone. When the test omits them, the model raises a ValueError: 'pixel_values' is None.

Also, when I run make fix-copies, it's add output_attention and create_causal_mask parameters in owlvitencoderlayer.forward() function.

Responded here #42136 (comment)

Resolving my previous comments since the state has changed quite a bit from last time

src/transformers/models/owlvit/modeling_owlvit.py

vasqu · 2025-11-17T13:00:44Z

@Aravind-11 it seems like ~180 tests are failing, so there might be some really serious regressions. Don't worry about the copies, functionality is more important first.

If pixel values are needed for example, I would first check if they were needed before. If yes, see what changed, if not, the inputs preparation may need to change in the test file.

github-actions · 2025-11-17T14:46:02Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: owlvit

Aravind-11 · 2025-11-17T15:57:28Z

@Aravind-11 it seems like ~180 tests are failing, so there might be some really serious regressions. Don't worry about the copies, functionality is more important first.

If pixel values are needed for example, I would first check if they were needed before. If yes, see what changed, if not, the inputs preparation may need to change in the test file.

Hi, @vasqu , thank you for the comments! I reverted changes in owlv2 back to original for easier debugging. Only the pixel_values() seem to be causing issues in OWLVIT as per the tests.

The pixel_values are needed in the original implementation in ObjectionDetection.forward() too.

def forward(
self,
input_ids: torch.Tensor,
pixel_values: torch.FloatTensor,
attention_mask: Optional[torch.Tensor] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
interpolate_pos_encoding: bool = False,
return_dict: Optional[bool] = None,
)

this is the original forward signature. Please let me know what you think. Thanks!

vasqu · 2025-11-17T16:10:43Z

Have you checked for a correct prepare_config_and_inputs_for_common as in ViT

transformers/tests/models/vit/test_modeling_vit.py

Lines 173 to 181 in 2cc9152

    
           def prepare_config_and_inputs_for_common(self): 
        
               config_and_inputs = self.prepare_config_and_inputs() 
        
               ( 
        
                   config, 
        
                   pixel_values, 
        
                   labels, 
        
               ) = config_and_inputs 
        
               inputs_dict = {"pixel_values": pixel_values} 
        
               return config, inputs_dict

This might cause these mismatches as we use that for generation specific tests (sdpa to eager equivalence included). I.e. it needs to be converted to a dict for us to pass as kwargs directly

Aravind-11 · 2025-11-17T16:46:36Z

Have you checked for a correct prepare_config_and_inputs_for_common as in ViT

transformers/tests/models/vit/test_modeling_vit.py

Lines 173 to 181 in 2cc9152

def prepare_config_and_inputs_for_common(self):

config_and_inputs = self.prepare_config_and_inputs()

(

config,

pixel_values,

labels,

) = config_and_inputs

inputs_dict = {"pixel_values": pixel_values}

return config, inputs_dict

This might cause these mismatches as we use that for generation specific tests (sdpa to eager equivalence included). I.e. it needs to be converted to a dict for us to pass as kwargs directly

def prepare_config_and_inputs_for_common(self):
config_and_inputs = self.prepare_config_and_inputs()
config, pixel_values, input_ids, attention_mask = config_and_inputs
inputs_dict = {
"pixel_values": pixel_values,
"input_ids": input_ids,
"attention_mask": attention_mask,
}
return config, inputs_dict

this is the current prepare_config_and_inputs_for_common() for OwlViTForObjectDetectionTester. is this overrriden ?

Have you checked for a correct prepare_config_and_inputs_for_common as in ViT

transformers/tests/models/vit/test_modeling_vit.py

Lines 173 to 181 in 2cc9152

def prepare_config_and_inputs_for_common(self):

config_and_inputs = self.prepare_config_and_inputs()

(

config,

pixel_values,

labels,

) = config_and_inputs

inputs_dict = {"pixel_values": pixel_values}

return config, inputs_dict

This might cause these mismatches as we use that for generation specific tests (sdpa to eager equivalence included). I.e. it needs to be converted to a dict for us to pass as kwargs directly

def prepare_config_and_inputs_for_common(self):
      config_and_inputs = self.prepare_config_and_inputs()
      config, input_ids, attention_mask, pixel_values = config_and_inputs
      inputs_dict = {
          "pixel_values": pixel_values,
          "input_ids": input_ids,
          "attention_mask": attention_mask,
          "return_loss": False,
      }
      return config, inputs_dict

yes, the pixel_values are passed for the vision model tester.

vasqu · 2025-11-18T16:07:38Z

Sorry have a bit more todo atm, would need to debug myself! This is weird, but it does seem something in the preparation seems to go wrong 😢

Aravind-11 · 2025-11-18T16:37:15Z

Sorry have a bit more todo atm, would need to debug myself! This is weird, but it does seem something in the preparation seems to go wrong 😢

Got it! No worries. Let me know if you find something. Thanks.

Aravind-11 · 2025-11-25T06:16:07Z

Sorry have a bit more todo atm, would need to debug myself! This is weird, but it does seem something in the preparation seems to go wrong 😢

Got it! No worries. Let me know if you find something. Thanks.

I have a doubt .. @vasqu I noticed some models still have output_attention and return_dict() in their forward signatures inspite of having **kwargs --> MPT, T5, Paligemma, pix2struct.. why is that ? and should all models have sdpa, flash and flex attn enabled?

vasqu · 2025-11-25T10:03:04Z

It's connected but not necessary, the attention interface allows the usage of all attentions (fa, sdpa, flex). The signature along output_xxx, return_dict is a different matter and is handled via decorators such as can_return_dict or check_model_inputs. The latter needs _can_record_outputs to be properly adjusted as attribute of that model.

tl;dr: separate things but usually you need the attention interface first and then the decorators. There are cases where it's possible that only one has been applied.

Aravind-11 · 2025-11-25T17:31:21Z

It's connected but not necessary, the attention interface allows the usage of all attentions (fa, sdpa, flex). The signature along output_xxx, return_dict is a different matter and is handled via decorators such as can_return_dict or check_model_inputs. The latter needs _can_record_outputs to be properly adjusted as attribute of that model.

tl;dr: separate things but usually you need the attention interface first and then the decorators. There are cases where it's possible that only one has been applied.

ohh okay , got it! Makes sense. Thank you!

Aravind-11 · 2025-12-08T18:56:20Z

hii @vasqu! 😁 , were you able to look into this? thank you!

vasqu reviewed Nov 11, 2025

View reviewed changes

Aravind-11 force-pushed the sdpa_for_OWL_ViT branch from d519ced to 77f5221 Compare November 12, 2025 05:15

vasqu reviewed Nov 12, 2025

View reviewed changes

nileshkokane01 and others added 24 commits November 15, 2025 18:22

Added sdpa attention

f9498f3

Added Changes to OWL-Vit as suggested

ddd9eeb

Fixed nits

c20d934

removed unwanted files

a75f357

Fixed nits

2c07d0f

Fixed past_key_values_length to length 0

a1d25d1

Fixed nits

62f47ab

Fixed dim issue

9c6dfbf

fixed nits

2ccdc99

Fixed dim issue

ec085b3

add sdpa for owlvit

a1c22e4

fixes

8cff000

fixe

b12a5fc

fixes

9ca23ae

fixes

9923d24

fixes

7a7977d

remove return_dict from config file

a08d71d

remove return_dict from ovlv2 config

de20883

commit same changes to owlv2

5d5cf7d

make fix-copies

b281a10

fixes

9b9020a

fixes

4dbbca2

fixes

90d32e9

fixes

57e88a3

Aravind-11 added 4 commits November 15, 2025 18:22

fixes

83e244e

fixes

8e32ea9

revert changes

aea9460

fixes

42971a3

Aravind-11 force-pushed the sdpa_for_OWL_ViT branch from 41a76bf to 42971a3 Compare November 16, 2025 01:43

Aravind-11 added 7 commits November 15, 2025 18:52

fixes

752540e

fixes

e64aa6f

fixes

3597dd5

make style

7ad3918

make fix-copies

c27369c

fixes

13ff952

fixes

8a4c788

Revert changes to owlv2 for debugging

1123ea7

Sdpa for owlvit #42136

Are you sure you want to change the base?

Sdpa for owlvit #42136

Conversation

Aravind-11 commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

Aravind-11 commented Nov 11, 2025

What does this PR do?

Before submitting

Who can review?

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

Aravind-11 commented Nov 11, 2025

Uh oh!

Aravind-11 commented Nov 12, 2025

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

vasqu Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

vasqu Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

Aravind-11 Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

Aravind-11 Nov 16, 2025

Choose a reason for hiding this comment

Uh oh!

Aravind-11 Nov 16, 2025

Choose a reason for hiding this comment

Uh oh!

vasqu Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vasqu commented Nov 17, 2025

Uh oh!

github-actions bot commented Nov 17, 2025

Uh oh!

Aravind-11 commented Nov 17, 2025

Uh oh!

vasqu commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Aravind-11 commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vasqu commented Nov 18, 2025

Uh oh!

Aravind-11 commented Nov 18, 2025

Uh oh!

Aravind-11 commented Nov 25, 2025

Uh oh!

vasqu commented Nov 25, 2025

Uh oh!

Aravind-11 commented Nov 25, 2025

Uh oh!

Aravind-11 commented Dec 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Aravind-11 commented Nov 10, 2025 •

edited

Loading

vasqu commented Nov 17, 2025 •

edited

Loading

Aravind-11 commented Nov 17, 2025 •

edited

Loading