Add LightOnOCR model implementation #41621

baptiste-aubertin · 2025-10-15T14:23:41Z

What does this PR do?

Implementation of the LightonOCR model following Modular Transformers architecture.

Our model is a 1B parameter OCR model using Pixtral as the vision encoder and Qwen3 as the LLM decoder.

I still have an issue with auto configuration, which I'm not familiar with:

🚨 Config not found for lightonocr. You can manually add it to HARDCODED_CONFIG_FOR_MODELS in utils/auto_docstring.py

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

zucchini-nlp

Hey @baptiste-aubertin , thanks a lot for working on it! Nice PR, and I left comments mostly to push the model for better standardization. Lmk when you need a second review

src/transformers/models/lightonocr/modular_lightonocr.py

tests/models/lightonocr/test_modeling_lightonocr.py

tests/models/lightonocr/test_processor_lightonocr.py

tests/models/lightonocr/test_modeling_lightonocr.py

molbap

Hey @baptiste-aubertin, continued the review! Also for naming, the model should be better named LightOnOcr rather than ..OCR but that's a casing nit. For the tests not passing, adding the Text and Vision models to utils/check_repo list IGNORE_NON_TESTED should be enough.
I recommend rebasing and running the linter again, from your branch:

git checkout main
git pull
git checkout -
git merge main
pip install -e .[quality]
make fixup

and that should get rid of the code quality issues.

src/transformers/models/lightonocr/modular_lightonocr.py

tests/models/lightonocr/test_modeling_lightonocr.py

baptiste-aubertin · 2025-10-27T22:53:28Z

Hi @molbap @zucchini-nlp ! Thanks for the detailed review and the instructions!
Sorry I didn't get back to you last week, I was a little overloaded with the release.

i went through you comment and I'm still running into test issues. The problem is that modular is trying to reimplement a common attention mechanism for both the Pixtral vision model and the Qwen3 text model, and this is causing some conflicts. Have you dealt with this before? Not sure how to handle that :(

molbap

I think you're right, some of the current tests failing are indeed because the text decoder end up using eager_attention_forward from Pixtral module which uses MHA, but the Qwen3 module needs GQA/repeating key values. Here the repeat should be 2, so you get a factor 2 issue in your matmuls 😬 To circumvent it you can rewrite the needed eager_attention_forward with repeat_kv so you'll have GQA in the decoder as expected. I outlined changes in the vision modules to silo the inheritance a bit, and similar ones might be needed in the text modules. Let me know and ping me when you've done a new pass!

src/transformers/models/lightonocr/modular_lightonocr.py

molbap

Looking much better! Left what I think are my last comments, once addressed we'll ping core maintainer for final review & merge 🥳

docs/source/en/model_doc/lightonocr.md

tests/models/lightonocr/test_processor_lightonocr.py

tests/models/lightonocr/test_modeling_lightonocr.py

src/transformers/models/lightonocr/modular_lightonocr.py

baptiste-aubertin · 2025-10-31T13:41:15Z

Hi I think all the tests passed and i rebased the branch :). Please tell me if I need to do something else
Thanks a lot for your help !

molbap · 2025-10-31T13:58:27Z

run-slow: lightonocr

github-actions · 2025-10-31T14:01:53Z

This comment contains run-slow, running the specified jobs:

models: ['models/lightonocr']
quantizations: [] ...

molbap

Left last nits comments + a non-nit comment on testing as we only test on part of the output for now in the IntegrationTest, the rest seems copacetic :) well done, ccing to core reviewer

molbap · 2025-10-31T14:00:48Z

src/transformers/models/lightonocr/modular_lightonocr.py

+    @property
+    def vision_model(self):
+        """Alias for vision_encoder to match standard composite model naming."""
+        return self.vision_encoder


ah, I see. No worries, we'll be able to change it in v5 of transformers with our dynamic weights converter

src/transformers/models/lightonocr/modular_lightonocr.py

molbap · 2025-10-31T14:04:25Z

src/transformers/models/lightonocr/modular_lightonocr.py

+        # Since we use packing, if flash_attention_2 is selected we rely on position_ids
+        if self.config._attn_implementation == "flash_attention_2":
+            kwargs["position_ids"] = kwargs["position_ids"].to(hidden_states.device, non_blocking=True)


couldn't we pass the position ids anyway?

I copied this forward from Pixtral but yes maybe. Gonna try

transformers/src/transformers/models/pixtral/modeling_pixtral.py

Line 245 in 6fb6d3c

# Since we use packing, if flash_attention_2 is selected we rely on position_ids

src/transformers/models/lightonocr/modular_lightonocr.py

molbap · 2025-10-31T14:07:11Z

tests/models/lightonocr/test_modeling_lightonocr.py

+
+        # Check that the model generated non-empty text
+        # The exact output depends on the trained model, but it should contain some OCR text
+        self.assertIsNotNone(decoded_output)
+        self.assertIsInstance(decoded_output, str)
+        self.assertGreater(len(decoded_output.strip()), 0, "Model should generate non-empty OCR output")
+
+        # Check that the model correctly extracted the date from the receipt
+        self.assertIn(
+            "25/12/2018", decoded_output, "Model should extract the date '25/12/2018' from the receipt image"
+        )


For that model, don't we have a deterministic idea of what its output is? To be more robust to regressions, just compare to the full expected output, no?

I added output comparison with the ground truth with difflib for that :)

tests/models/lightonocr/test_modeling_lightonocr.py

molbap · 2025-10-31T14:16:07Z

There's a couple errors on the slow tests I just ran: with

FAILED tests/models/lightonocr/test_modeling_lightonocr.py::LightOnOCRForConditionalGenerationIntegrationTest::test_model_can_generate_without_images - AttributeError: 'LightOnOCRConfig' object has no attribute 'vocab_size'
FAILED tests/models/lightonocr/test_modeling_lightonocr.py::LightOnOCRForConditionalGenerationIntegrationTest::test_model_forward_with_images - AttributeError: 'LightOnOCRConfig' object has no attribute 'vocab_size'

so just a config key missing/a wrong access IMO. The other one which is suspicious is

    def vision_apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
        """Applies Rotary Position Embedding to the query and key tensors.
        .......
        """
        cos = cos.unsqueeze(unsqueeze_dim)
        sin = sin.unsqueeze(unsqueeze_dim)
>       q_embed = (q * cos) + (vision_rotate_half(q) * sin)
E       RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!

which is on a multi-GPU setting only but means that a q tensor should be created on a safer device, or at least within this util function to be safe.

baptiste-aubertin · 2025-10-31T18:01:21Z

There's a couple errors on the slow tests I just ran: with

FAILED tests/models/lightonocr/test_modeling_lightonocr.py::LightOnOCRForConditionalGenerationIntegrationTest::test_model_can_generate_without_images - AttributeError: 'LightOnOCRConfig' object has no attribute 'vocab_size'
FAILED tests/models/lightonocr/test_modeling_lightonocr.py::LightOnOCRForConditionalGenerationIntegrationTest::test_model_forward_with_images - AttributeError: 'LightOnOCRConfig' object has no attribute 'vocab_size'

so just a config key missing/a wrong access IMO. The other one which is suspicious is

    def vision_apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
        """Applies Rotary Position Embedding to the query and key tensors.
        .......
        """
        cos = cos.unsqueeze(unsqueeze_dim)
        sin = sin.unsqueeze(unsqueeze_dim)
>       q_embed = (q * cos) + (vision_rotate_half(q) * sin)
E       RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!

which is on a multi-GPU setting only but means that a q tensor should be created on a safer device, or at least within this util function to be safe.

i added vocab getter and put the q and k on the same device as cos and sin :)

baptiste-aubertin · 2025-11-05T10:20:05Z

Hi @molbap thanks again for your help ! Could you please re trigger the slow tests to be sure that everything pass before the intervention of the last reviewer :) ? Thanks

zucchini-nlp · 2025-11-05T11:58:04Z

run-slow: lightonocr

github-actions · 2025-11-05T11:59:28Z

This comment contains run-slow, running the specified jobs:

models: ["models/lightonocr"]
quantizations: []

github-actions · 2025-11-05T12:07:26Z

CI Results

Workflow Run ⚙️

✅ No failing test specific to this PR 🎉 !

zucchini-nlp

Hye @baptiste-aubertin , While we're waiting on core maintainer's review, I left a few comments and questions below, mostly about things that looks redundant to me. LMK what you think and if we can delete those?

docs/source/en/model_doc/lightonocr.md

src/transformers/models/lightonocr/modular_lightonocr.py

zucchini-nlp · 2025-11-05T12:04:51Z

src/transformers/models/lightonocr/modular_lightonocr.py

+        # These should be set on the tokenizer before creating the processor
+        self.image_token = getattr(tokenizer, "image_token", "<|image_pad|>")
+        self.image_break_token = getattr(tokenizer, "image_break_token", "<|vision_pad|>")
+        self.image_end_token = getattr(tokenizer, "image_end_token", "<|vision_end|>")


can we update the tokenizer on the hub instead? These were used as fallbacks for BC in old models, for new models imo we can directly change the tokenizer

src/transformers/models/lightonocr/modular_lightonocr.py

zucchini-nlp · 2025-11-05T12:43:12Z

tests/models/lightonocr/test_modeling_lightonocr.py

+    @unittest.skip("Pixtral does not support attention interfaces.")
+    def test_eager_matches_fa2_generate(self):
+        pass
+
+    @unittest.skip("Pixtral does not support attention interfaces.")
+    def test_eager_matches_sdpa_generate(self):
+        pass
+
+    @unittest.skip("Pixtral does not support attention interfaces.")
+    def test_flash_attn_2_from_config(self):
+        pass
+
+    @unittest.skip("Pixtral does not support attention interfaces.")
+    def test_flash_attn_2_inference_equivalence(self):
+        pass
+
+    @unittest.skip("Pixtral does not support attention interfaces.")
+    def test_flash_attn_2_inference_equivalence_right_padding(self):
+        pass


these ones should be skipped if the model doesn't support it. Do we have supports_flash_attention = False set in the vision encoder?

I copied this from pixtral tests. But yes i can try this flag instead

baptiste-aubertin · 2025-11-05T12:58:55Z

Hye @baptiste-aubertin , While we're waiting on core maintainer's review, I left a few comments and questions below, mostly about things that looks redundant to me. LMK what you think and if we can delete those?

Hi @zucchini-nlp thanks you ! This time I will try to answer rapidly 😅

molbap · 2025-11-06T09:47:56Z

run-slow: lightonocr

github-actions · 2025-11-06T09:49:20Z

This comment contains run-slow, running the specified jobs:

models: ["models/lightonocr"]
quantizations: []

…b-models

…tOnOCR processor

- Add vocab_size property to LightOnOCRConfig that delegates to text_config - Fix test parameter name from image_token_index to image_token_id - Add Unpack type hint to processor __call__ kwargs - Remove unnecessary comments from modeling forward method

…ge sizes

…dation

…d weights mapping #0 - Remove custom _init_weights methods (handled by base class) - Update _tied_weights_keys to dict format with explicit mapping - Update documentation date

…lacement #0 - Use config.text_config.vocab_size instead of config.vocab_size for composite config - Remove explicit device placement from attention_mask and image_sizes tensors - Allow device_map='auto' to handle device placement in model parallelism tests

github-actions · 2025-11-24T10:40:33Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, lightonocr

baptiste-aubertin · 2025-11-24T10:54:56Z

hi I rabased this morning and all the tests are now passing 👍 😄

eivtho · 2025-12-03T09:52:03Z

We have had very good results fine-tuning on LightOnOCR, and are looking forward to this being merged so we can use it in production.

merveenoyan added the Vision label Oct 15, 2025

merveenoyan requested a review from zucchini-nlp October 16, 2025 10:53

staghado mentioned this pull request Oct 16, 2025

[Model] Add support for LightOnOCR vllm-project/vllm#26916

Merged

5 tasks

zucchini-nlp reviewed Oct 17, 2025

View reviewed changes

molbap self-requested a review October 24, 2025 07:52

molbap added the New model label Oct 24, 2025

molbap reviewed Oct 24, 2025

View reviewed changes

baptiste-aubertin force-pushed the main branch 2 times, most recently from 0b34715 to 6dd60a9 Compare October 27, 2025 22:37

molbap reviewed Oct 28, 2025

View reviewed changes

src/transformers/models/lightonocr/modular_lightonocr.py Outdated Show resolved Hide resolved

src/transformers/models/lightonocr/modular_lightonocr.py Show resolved Hide resolved

molbap reviewed Oct 29, 2025

View reviewed changes

baptiste-aubertin force-pushed the main branch from 4bdd305 to ee8f9fc Compare October 31, 2025 13:29

molbap approved these changes Oct 31, 2025

View reviewed changes

molbap requested a review from ArthurZucker October 31, 2025 14:10

zucchini-nlp reviewed Nov 5, 2025

View reviewed changes

baptiste-aubertin force-pushed the main branch 2 times, most recently from 5c88a1d to 4c22024 Compare November 5, 2025 16:43

baptiste-aubertin added 26 commits November 24, 2025 11:30

Fix device/dtype handling in LightOnOCR vision processing

8caf104

Add TransformersKwargs type hints to LightOnOCR forward methods

ee7b019

Make torch imports conditional and use _from_config for LightOnOCR su…

26148cf

…b-models

Set patch_size at runtime instead of modifying class defaults in Ligh…

7ce262d

…tOnOCR processor

type kwargs

c904c24

Remove loocr forward comments

663e0b9

Add vocab_size setter to LightOnOCR configuration

4aa257e

Fix device mismatch in vision rotary embeddings and optimize test ima…

d4a3d3b

…ge sizes

Improve LightOnOCR integration test with similarity-based output vali…

b9982c8

…dation

Enable flex attention

d67bd3f

Enable flex attention

91a8c3f

Loocr description with blogpost

b0a71fc

redundant tie_word_embeddings

8b47aa9

remove architecture from default config

803d661

vocab_size accessors

9874da7

remove useless tensor conversion

a905c30

remove useless conversion

dc41e1c

move dtype conversion to after image feature extraction

d5eff32

remove useless stuff

1860d4d

fixup

e7aeaad

export text and vision config classes

5f9998d

refactor(lightonocr): remove unused weight initialization and fix tie…

ea6281b

…d weights mapping #0 - Remove custom _init_weights methods (handled by base class) - Update _tied_weights_keys to dict format with explicit mapping - Update documentation date

ruff

c5f0285

fix mistake tokenizer

1f91d31

baptiste-aubertin force-pushed the main branch from 70bbb89 to 1f91d31 Compare November 24, 2025 10:39

Add LightOnOCR model implementation #41621

Are you sure you want to change the base?

Add LightOnOCR model implementation #41621

Uh oh!

Conversation

baptiste-aubertin commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

molbap left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

baptiste-aubertin commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

molbap left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

molbap left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

baptiste-aubertin commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

molbap commented Oct 31, 2025

Uh oh!

github-actions bot commented Oct 31, 2025

Uh oh!

molbap left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

baptiste-aubertin commented Oct 15, 2025 •

edited

Loading

baptiste-aubertin commented Oct 27, 2025 •

edited

Loading

baptiste-aubertin commented Oct 31, 2025 •

edited

Loading

baptiste-aubertin commented Nov 5, 2025 •

edited

Loading