Skip to content

Conversation

@Aravind-11
Copy link
Contributor

@Aravind-11 Aravind-11 commented Nov 10, 2025

What does this PR do?

Implements SDPA for OWL VIT.

Fixes #28103

Before submitting

Who can review?

@vasqu @younesbelkada

@Aravind-11
Copy link
Contributor Author

What does this PR do?

Implements SDPA for OWL VIT. Revamp of #28818

Fixes #28103

Before submitting

Who can review?

@vasqu @younesbelkada

I ran the RUN_SLOW=1 python -m pytest tests/models/owlvit/test_modeling_owlvit.py for the original owlvit implementation and it seemed to fail the same tests as my current implementation. I'm not sure how to infer from that.

Copy link
Contributor

@vasqu vasqu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry but I've got to be strict about this. We no longer implement separate classes for all the attention flavors but one unified one. I think ViT is a good example in this case, e.g. see https://github.com/huggingface/transformers/blob/main/src/transformers/models/vit/modeling_vit.py

Before changing this to these standards I won't take a proper look for now.

@Aravind-11
Copy link
Contributor Author

Sorry but I've got to be strict about this. We no longer implement separate classes for all the attention flavors but one unified one. I think ViT is a good example in this case, e.g. see https://github.com/huggingface/transformers/blob/main/src/transformers/models/vit/modeling_vit.py

Before changing this to these standards I won't take a proper look for now.

Got it. Thanks a lot!

@Aravind-11
Copy link
Contributor Author

Sorry but I've got to be strict about this. We no longer implement separate classes for all the attention flavors but one unified one. I think ViT is a good example in this case, e.g. see https://github.com/huggingface/transformers/blob/main/src/transformers/models/vit/modeling_vit.py

Before changing this to these standards I won't take a proper look for now.

I made similar changes as in the vit and removed the seperate sdpa class. Let me know what you think!

Copy link
Contributor

@vasqu vasqu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments but in general it would be best to have a green CI before requesting a review. Atm, things are likely not working as expected

Comment on lines 716 to 724
causal_attention_mask = _create_4d_causal_attention_mask(
input_shape, hidden_states.dtype, device=hidden_states.device
# OWL-ViT uses a bidirectional (non-causal) encoder.
attention_mask = create_bidirectional_mask(
config=self.config,
input_embeds=hidden_states,
attention_mask=attention_mask,
)
# expand attention_mask
if attention_mask is not None:
# [num_samples, seq_len] -> [num_samples, 1, tgt_seq_len, src_seq_len]
attention_mask = _prepare_4d_attention_mask(attention_mask, hidden_states.dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to suffer from the same issue as in #41750

It does not use a bidirectional mask, but a causal mask:

  • The first mask is a based causal mask
  • The second is a padding mask
  • These are added on top creating a causal mask with padding included

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also may need to adjust the is_causal argument dynamically as in the PR I linked - although I'm not sure if it's just causal in general

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I made some changes to the code after referring to CLIP - removing the output_attention, return dict and casual_attention_mask. Also copied the eager attention part, attention reshaping from CLIP. Added the flash and flex attn too.

I think that the current CI is failing because the OWL VIT config file is conflicting with the current encoder implementation. Could you guide me here? Thanks a lot!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I made some changes to the code after referring to CLIP - removing the output_attention, return dict and casual_attention_mask. Also copied the eager attention part, attention reshaping from CLIP. Added the flash and flex attn too.

I think that the current CI is failing because the OWL VIT config file is conflicting with the current encoder implementation. Could you guide me here? Thanks a lot!

Hi, I investigated the failing OwlViTForObjectDetectionTest::test_eager_matches_sdpa_inference_09_fp32_pad_left.

The failure is due to the test invoking OwlViTForObjectDetection.forward() without providing pixel_values.

OwlViTForObjectDetection requires pixel_values (image tensors) for its vision backbone. When the test omits them, the model raises a ValueError: 'pixel_values' is None.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, when I run make fix-copies, it's add output_attention and create_causal_mask parameters in owlvitencoderlayer.forward() function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Responded here #42136 (comment)

Resolving my previous comments since the state has changed quite a bit from last time

@vasqu
Copy link
Contributor

vasqu commented Nov 17, 2025

@Aravind-11 it seems like ~180 tests are failing, so there might be some really serious regressions. Don't worry about the copies, functionality is more important first.

If pixel values are needed for example, I would first check if they were needed before. If yes, see what changed, if not, the inputs preparation may need to change in the test file.

@github-actions
Copy link
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: owlvit

@Aravind-11
Copy link
Contributor Author

@Aravind-11 it seems like ~180 tests are failing, so there might be some really serious regressions. Don't worry about the copies, functionality is more important first.

If pixel values are needed for example, I would first check if they were needed before. If yes, see what changed, if not, the inputs preparation may need to change in the test file.

Hi, @vasqu , thank you for the comments! I reverted changes in owlv2 back to original for easier debugging. Only the pixel_values() seem to be causing issues in OWLVIT as per the tests.

The pixel_values are needed in the original implementation in ObjectionDetection.forward() too.

def forward(
self,
input_ids: torch.Tensor,
pixel_values: torch.FloatTensor,
attention_mask: Optional[torch.Tensor] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
interpolate_pos_encoding: bool = False,
return_dict: Optional[bool] = None,
)

this is the original forward signature. Please let me know what you think. Thanks!

@vasqu
Copy link
Contributor

vasqu commented Nov 17, 2025

Have you checked for a correct prepare_config_and_inputs_for_common as in ViT

def prepare_config_and_inputs_for_common(self):
config_and_inputs = self.prepare_config_and_inputs()
(
config,
pixel_values,
labels,
) = config_and_inputs
inputs_dict = {"pixel_values": pixel_values}
return config, inputs_dict

This might cause these mismatches as we use that for generation specific tests (sdpa to eager equivalence included). I.e. it needs to be converted to a dict for us to pass as kwargs directly

@Aravind-11
Copy link
Contributor Author

Aravind-11 commented Nov 17, 2025

Have you checked for a correct prepare_config_and_inputs_for_common as in ViT

def prepare_config_and_inputs_for_common(self):
config_and_inputs = self.prepare_config_and_inputs()
(
config,
pixel_values,
labels,
) = config_and_inputs
inputs_dict = {"pixel_values": pixel_values}
return config, inputs_dict

This might cause these mismatches as we use that for generation specific tests (sdpa to eager equivalence included). I.e. it needs to be converted to a dict for us to pass as kwargs directly

def prepare_config_and_inputs_for_common(self):
config_and_inputs = self.prepare_config_and_inputs()
config, pixel_values, input_ids, attention_mask = config_and_inputs
inputs_dict = {
"pixel_values": pixel_values,
"input_ids": input_ids,
"attention_mask": attention_mask,
}
return config, inputs_dict

this is the current prepare_config_and_inputs_for_common() for OwlViTForObjectDetectionTester. is this overrriden ?

Have you checked for a correct prepare_config_and_inputs_for_common as in ViT

def prepare_config_and_inputs_for_common(self):
config_and_inputs = self.prepare_config_and_inputs()
(
config,
pixel_values,
labels,
) = config_and_inputs
inputs_dict = {"pixel_values": pixel_values}
return config, inputs_dict

This might cause these mismatches as we use that for generation specific tests (sdpa to eager equivalence included). I.e. it needs to be converted to a dict for us to pass as kwargs directly

def prepare_config_and_inputs_for_common(self):
      config_and_inputs = self.prepare_config_and_inputs()
      config, input_ids, attention_mask, pixel_values = config_and_inputs
      inputs_dict = {
          "pixel_values": pixel_values,
          "input_ids": input_ids,
          "attention_mask": attention_mask,
          "return_loss": False,
      }
      return config, inputs_dict 

yes, the pixel_values are passed for the vision model tester.

@vasqu
Copy link
Contributor

vasqu commented Nov 18, 2025

Sorry have a bit more todo atm, would need to debug myself! This is weird, but it does seem something in the preparation seems to go wrong 😢

@Aravind-11
Copy link
Contributor Author

Sorry have a bit more todo atm, would need to debug myself! This is weird, but it does seem something in the preparation seems to go wrong 😢

Got it! No worries. Let me know if you find something. Thanks.

@Aravind-11
Copy link
Contributor Author

Sorry have a bit more todo atm, would need to debug myself! This is weird, but it does seem something in the preparation seems to go wrong 😢

Got it! No worries. Let me know if you find something. Thanks.

I have a doubt .. @vasqu I noticed some models still have output_attention and return_dict() in their forward signatures inspite of having **kwargs --> MPT, T5, Paligemma, pix2struct.. why is that ? and should all models have sdpa, flash and flex attn enabled?

@vasqu
Copy link
Contributor

vasqu commented Nov 25, 2025

It's connected but not necessary, the attention interface allows the usage of all attentions (fa, sdpa, flex). The signature along output_xxx, return_dict is a different matter and is handled via decorators such as can_return_dict or check_model_inputs. The latter needs _can_record_outputs to be properly adjusted as attribute of that model.

tl;dr: separate things but usually you need the attention interface first and then the decorators. There are cases where it's possible that only one has been applied.

@Aravind-11
Copy link
Contributor Author

It's connected but not necessary, the attention interface allows the usage of all attentions (fa, sdpa, flex). The signature along output_xxx, return_dict is a different matter and is handled via decorators such as can_return_dict or check_model_inputs. The latter needs _can_record_outputs to be properly adjusted as attribute of that model.

tl;dr: separate things but usually you need the attention interface first and then the decorators. There are cases where it's possible that only one has been applied.

ohh okay , got it! Makes sense. Thank you!

@Aravind-11
Copy link
Contributor Author

hii @vasqu! 😁 , were you able to look into this? thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OWL-VIT Vision Foundation Model deployment in the edge cases - Need SDPA support for OWL-ViT Model optimization for Edge Deployment

3 participants