[`draft`] Add `return_dict` to `get_(text|image|audio|video)_features` methods #42564

tomaarsen · 2025-12-02T16:00:11Z

What does this PR do?

Add return_dict to get_text_features & get_image_features methods to allow returning 'BaseModelOutputWithPooling'

Well, the architectures supporting get_image_features are all extremely different, with wildly different outputs for the get_image_features methods:

2d outputs,
3d outputs,
lists of 2d outputs (due to non-matching shapes),
existing 'return_attentions' resulting in returning 2-tuple,
existing 'return_dict' resulting in returning 3-tuples (???),
high quality image embeddings,
low quality image embeddings,
deepstack image embeddings,
etc. etc. etc.

And I only went through like 70-80% of all architectures with get_image_features before I gave up.

Standardisation of all of these sounds like a lost cause. cc @zucchini-nlp I'm curious about your thoughts here. When I did some preliminary research, I only ran into a handful of cases, and I figured we'd be able to reformat them all into one format, but I'm not sure anymore. I added # NOTE: @Tom ... where I figured we might have big problems with standardisation.

For get_text_features it's a lot simpler, there's only one architecture (blip-2) that differs from all others.

I haven't started on get_audio_features and get_video_features, but there's not too much of a point if we can't get get_image_features normalized.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@zucchini-nlp @ArthurZucker @Cyrilvallez

Tom Aarsen

…ModelOutputWithPooling' Added to all architectures except blip-2, which has a much different structure here. It uses 'Blip2TextModelWithProjection' to get these embeddings/features, but this class isn't as simple to use

…eModelOutputWithPooling' Well, the architectures supporting get_image_features are all extremely different, with wildly different outputs for the get_image_features methods. 2d outputs, 3d outputs, lists of 2d outputs (due to non-matching shapes), existing 'return_attentions' resulting in returning 2-tuple, existing 'return_dict' resulting in returning 3-tuples (???), high quality image embeddings, low quality image embeddings, deepstack image embeddings, etc. etc. etc. And I only went through like 70-80% of all architectures with get_image_features before I gave up. Standardisation of all of these sounds like a lost cause.

github-actions · 2025-12-02T16:01:14Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: aimv2, align, altclip, aria, aya_vision, blip, blip_2, chameleon, chinese_clip, clap, clip, clipseg, clvp, cohere2_vision, deepseek_vl, deepseek_vl_hybrid

HuggingFaceDocBuilderDev · 2025-12-02T16:08:52Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

zucchini-nlp

We discussed this internally and decided to add last_hidden_states to all models as the last state from vision block. The pooled embeddings will stay of different shapes as is

For the last hidden state the shapes are already more standardized, with a few major options. The only special cases might be qwen-like models where each image encoding has different sequence length and thus the outputs are concatenated as length*dim

zucchini-nlp · 2025-12-03T10:10:48Z

src/transformers/models/chameleon/modeling_chameleon.py

-        vision_embeddings = self.get_input_embeddings()(image_tokens)
-        return vision_embeddings
+        image_embeddings = self.get_input_embeddings()(image_tokens)
+        image_features = image_embeddings.mean(dim=1)


taking the mean is not what we want for VLM. They are supposed to return image embeddings in the format that can be concatenated with text embeddings

Ah, yes. This was just a quick test to experiment with the shapes. I didn't realise I kept it in.

zucchini-nlp · 2025-12-03T10:50:53Z

src/transformers/models/chameleon/modeling_chameleon.py

+
+        if return_dict:
+            return BaseModelOutputWithPooling(
+                last_hidden_state=image_embeddings,


with chameleon it is a bit vague. The vision quantizer could return hidden_states before quantizing them which i believe is the last hidden state we want

zucchini-nlp · 2025-12-03T10:52:36Z

src/transformers/models/emu3/modeling_emu3.py

        ]
        image_features = self.get_input_embeddings()(image_tokens)
        image_features = torch.split(image_features, split_sizes)
+        # NOTE: @Tom Not easily converted to the standard format


yeah, same as chameleon. We would first need to start returning hidden states from a VQ-module

zucchini-nlp · 2025-12-03T10:56:51Z

src/transformers/models/glm46v/modeling_glm46v.py

        image_embeds = self.visual(pixel_values, grid_thw=image_grid_thw)
        split_sizes = (image_grid_thw.prod(-1) // self.visual.spatial_merge_size**2).tolist()
        image_embeds = torch.split(image_embeds, split_sizes)
+        # NOTE: @Tom Not easily converted to the standard format


this is the same as qwen-vl models with the last hidden state being of shape bs, len*pooled_dim. The visual block returns only pooled outputs iirc, so we might need to also change the vision block

zucchini-nlp · 2025-12-03T10:58:38Z

src/transformers/models/instructblipvideo/modeling_instructblipvideo.py

            pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, image_size, image_size)`):
                The tensors corresponding to the input images.
        """
+        # NOTE: @Tom perhaps we should just raise an error here instead?


this fn should be simply removed, because model doesn't work with images. It was a bad copy from modular at the time 🫠

zucchini-nlp · 2025-12-03T11:00:58Z

src/transformers/models/kosmos2/modeling_kosmos2.py

+            return BaseModelOutputWithPooling(
+                last_hidden_state=vision_model_output.last_hidden_state,
+                pooler_output=image_embeds,
+                attentions=projection_attentions,  # TODO: @Tom does this match expectations?


i'd say not really, since these look like attentions of vision-pooling module. Very model specific, most poolers I've seen aren't attention based

zucchini-nlp · 2025-12-03T11:03:30Z

src/transformers/models/qwen3_omni_moe/modeling_qwen3_omni_moe.py

+        if return_dict:
+            return BaseModelOutputWithPooling(
+                last_hidden_state=image_embeds,
+                # pooler_output=image_features,  # NOTE: @Tom no pooled embeddings here


same thing here, image_embeds are actually pooled embeddings and the last hidden state is not returned from visual

tomaarsen added 3 commits December 2, 2025 13:47

make fixup

b6d6df3

zucchini-nlp reviewed Dec 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[`draft`] Add `return_dict` to `get_(text|image|audio|video)_features` methods #42564

[`draft`] Add `return_dict` to `get_(text|image|audio|video)_features` methods #42564

Uh oh!

tomaarsen commented Dec 2, 2025

Uh oh!

github-actions bot commented Dec 2, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Dec 2, 2025

Uh oh!

zucchini-nlp left a comment

Uh oh!

zucchini-nlp Dec 3, 2025

Uh oh!

tomaarsen Dec 3, 2025

Uh oh!

zucchini-nlp Dec 3, 2025

Uh oh!

zucchini-nlp Dec 3, 2025

Uh oh!

zucchini-nlp Dec 3, 2025

Uh oh!

zucchini-nlp Dec 3, 2025

Uh oh!

zucchini-nlp Dec 3, 2025

Uh oh!

zucchini-nlp Dec 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[draft] Add return_dict to get_(text|image|audio|video)_features methods #42564

Are you sure you want to change the base?

[draft] Add return_dict to get_(text|image|audio|video)_features methods #42564

Uh oh!

Conversation

tomaarsen commented Dec 2, 2025

What does this PR do?

Before submitting

Who can review?

Uh oh!

github-actions bot commented Dec 2, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Dec 2, 2025

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[`draft`] Add `return_dict` to `get_(text|image|audio|video)_features` methods #42564

[`draft`] Add `return_dict` to `get_(text|image|audio|video)_features` methods #42564