Skip to content

Conversation

@baptiste-aubertin
Copy link

@baptiste-aubertin baptiste-aubertin commented Oct 15, 2025

What does this PR do?

Implementation of the LightonOCR model following Modular Transformers architecture.

Our model is a 1B parameter OCR model using Pixtral as the vision encoder and Qwen3 as the LLM decoder.

I still have an issue with auto configuration, which I'm not familiar with:

🚨 Config not found for lightonocr. You can manually add it to HARDCODED_CONFIG_FOR_MODELS in utils/auto_docstring.py

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @baptiste-aubertin , thanks a lot for working on it! Nice PR, and I left comments mostly to push the model for better standardization. Lmk when you need a second review

@molbap molbap self-requested a review October 24, 2025 07:52
Copy link
Contributor

@molbap molbap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @baptiste-aubertin, continued the review! Also for naming, the model should be better named LightOnOcr rather than ..OCR but that's a casing nit. For the tests not passing, adding the Text and Vision models to utils/check_repo list IGNORE_NON_TESTED should be enough.
I recommend rebasing and running the linter again, from your branch:

git checkout main
git pull
git checkout -
git merge main
pip install -e .[quality]
make fixup

and that should get rid of the code quality issues.

@baptiste-aubertin baptiste-aubertin force-pushed the main branch 2 times, most recently from 0b34715 to 6dd60a9 Compare October 27, 2025 22:37
@baptiste-aubertin
Copy link
Author

baptiste-aubertin commented Oct 27, 2025

Hi @molbap @zucchini-nlp ! Thanks for the detailed review and the instructions!
Sorry I didn't get back to you last week, I was a little overloaded with the release.

i went through you comment and I'm still running into test issues. The problem is that modular is trying to reimplement a common attention mechanism for both the Pixtral vision model and the Qwen3 text model, and this is causing some conflicts. Have you dealt with this before? Not sure how to handle that :(

Copy link
Contributor

@molbap molbap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right, some of the current tests failing are indeed because the text decoder end up using eager_attention_forward from Pixtral module which uses MHA, but the Qwen3 module needs GQA/repeating key values. Here the repeat should be 2, so you get a factor 2 issue in your matmuls 😬 To circumvent it you can rewrite the needed eager_attention_forward with repeat_kv so you'll have GQA in the decoder as expected. I outlined changes in the vision modules to silo the inheritance a bit, and similar ones might be needed in the text modules. Let me know and ping me when you've done a new pass!

Copy link
Contributor

@molbap molbap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking much better! Left what I think are my last comments, once addressed we'll ping core maintainer for final review & merge 🥳

@baptiste-aubertin
Copy link
Author

baptiste-aubertin commented Oct 31, 2025

Hi I think all the tests passed and i rebased the branch :). Please tell me if I need to do something else
Thanks a lot for your help !

@molbap
Copy link
Contributor

molbap commented Oct 31, 2025

run-slow: lightonocr

@github-actions
Copy link
Contributor

This comment contains run-slow, running the specified jobs:

models: ['models/lightonocr']
quantizations: [] ...

Copy link
Contributor

@molbap molbap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left last nits comments + a non-nit comment on testing as we only test on part of the output for now in the IntegrationTest, the rest seems copacetic :) well done, ccing to core reviewer

Comment on lines +645 to +608
@property
def vision_model(self):
"""Alias for vision_encoder to match standard composite model naming."""
return self.vision_encoder
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I see. No worries, we'll be able to change it in v5 of transformers with our dynamic weights converter

Comment on lines +539 to +508
# Since we use packing, if flash_attention_2 is selected we rely on position_ids
if self.config._attn_implementation == "flash_attention_2":
kwargs["position_ids"] = kwargs["position_ids"].to(hidden_states.device, non_blocking=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couldn't we pass the position ids anyway?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied this forward from Pixtral but yes maybe. Gonna try

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# Since we use packing, if flash_attention_2 is selected we rely on position_ids

Comment on lines 571 to 579

# Check that the model generated non-empty text
# The exact output depends on the trained model, but it should contain some OCR text
self.assertIsNotNone(decoded_output)
self.assertIsInstance(decoded_output, str)
self.assertGreater(len(decoded_output.strip()), 0, "Model should generate non-empty OCR output")

# Check that the model correctly extracted the date from the receipt
self.assertIn(
"25/12/2018", decoded_output, "Model should extract the date '25/12/2018' from the receipt image"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For that model, don't we have a deterministic idea of what its output is? To be more robust to regressions, just compare to the full expected output, no?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added output comparison with the ground truth with difflib for that :)

@molbap molbap requested a review from ArthurZucker October 31, 2025 14:10
@molbap
Copy link
Contributor

molbap commented Oct 31, 2025

There's a couple errors on the slow tests I just ran: with

FAILED tests/models/lightonocr/test_modeling_lightonocr.py::LightOnOCRForConditionalGenerationIntegrationTest::test_model_can_generate_without_images - AttributeError: 'LightOnOCRConfig' object has no attribute 'vocab_size'
FAILED tests/models/lightonocr/test_modeling_lightonocr.py::LightOnOCRForConditionalGenerationIntegrationTest::test_model_forward_with_images - AttributeError: 'LightOnOCRConfig' object has no attribute 'vocab_size'

so just a config key missing/a wrong access IMO. The other one which is suspicious is

    def vision_apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
        """Applies Rotary Position Embedding to the query and key tensors.
        .......
        """
        cos = cos.unsqueeze(unsqueeze_dim)
        sin = sin.unsqueeze(unsqueeze_dim)
>       q_embed = (q * cos) + (vision_rotate_half(q) * sin)
E       RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!

which is on a multi-GPU setting only but means that a q tensor should be created on a safer device, or at least within this util function to be safe.

@baptiste-aubertin
Copy link
Author

There's a couple errors on the slow tests I just ran: with

FAILED tests/models/lightonocr/test_modeling_lightonocr.py::LightOnOCRForConditionalGenerationIntegrationTest::test_model_can_generate_without_images - AttributeError: 'LightOnOCRConfig' object has no attribute 'vocab_size'
FAILED tests/models/lightonocr/test_modeling_lightonocr.py::LightOnOCRForConditionalGenerationIntegrationTest::test_model_forward_with_images - AttributeError: 'LightOnOCRConfig' object has no attribute 'vocab_size'

so just a config key missing/a wrong access IMO. The other one which is suspicious is

    def vision_apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
        """Applies Rotary Position Embedding to the query and key tensors.
        .......
        """
        cos = cos.unsqueeze(unsqueeze_dim)
        sin = sin.unsqueeze(unsqueeze_dim)
>       q_embed = (q * cos) + (vision_rotate_half(q) * sin)
E       RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!

which is on a multi-GPU setting only but means that a q tensor should be created on a safer device, or at least within this util function to be safe.

i added vocab getter and put the q and k on the same device as cos and sin :)

@baptiste-aubertin
Copy link
Author

baptiste-aubertin commented Nov 5, 2025

Hi @molbap thanks again for your help ! Could you please re trigger the slow tests to be sure that everything pass before the intervention of the last reviewer :) ? Thanks

@zucchini-nlp
Copy link
Member

run-slow: lightonocr

@github-actions
Copy link
Contributor

github-actions bot commented Nov 5, 2025

This comment contains run-slow, running the specified jobs:

models: ["models/lightonocr"]
quantizations: []

@github-actions
Copy link
Contributor

github-actions bot commented Nov 5, 2025

CI Results

Workflow Run ⚙️

✅ No failing test specific to this PR 🎉 !

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hye @baptiste-aubertin , While we're waiting on core maintainer's review, I left a few comments and questions below, mostly about things that looks redundant to me. LMK what you think and if we can delete those?

Comment on lines +194 to +185
# These should be set on the tokenizer before creating the processor
self.image_token = getattr(tokenizer, "image_token", "<|image_pad|>")
self.image_break_token = getattr(tokenizer, "image_break_token", "<|vision_pad|>")
self.image_end_token = getattr(tokenizer, "image_end_token", "<|vision_end|>")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we update the tokenizer on the hub instead? These were used as fallbacks for BC in old models, for new models imo we can directly change the tokenizer

Comment on lines +415 to +429
@unittest.skip("Pixtral does not support attention interfaces.")
def test_eager_matches_fa2_generate(self):
pass

@unittest.skip("Pixtral does not support attention interfaces.")
def test_eager_matches_sdpa_generate(self):
pass

@unittest.skip("Pixtral does not support attention interfaces.")
def test_flash_attn_2_from_config(self):
pass

@unittest.skip("Pixtral does not support attention interfaces.")
def test_flash_attn_2_inference_equivalence(self):
pass

@unittest.skip("Pixtral does not support attention interfaces.")
def test_flash_attn_2_inference_equivalence_right_padding(self):
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these ones should be skipped if the model doesn't support it. Do we have supports_flash_attention = False set in the vision encoder?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied this from pixtral tests. But yes i can try this flag instead

@baptiste-aubertin
Copy link
Author

Hye @baptiste-aubertin , While we're waiting on core maintainer's review, I left a few comments and questions below, mostly about things that looks redundant to me. LMK what you think and if we can delete those?

Hi @zucchini-nlp thanks you ! This time I will try to answer rapidly 😅

@baptiste-aubertin baptiste-aubertin force-pushed the main branch 2 times, most recently from 5c88a1d to 4c22024 Compare November 5, 2025 16:43
@molbap
Copy link
Contributor

molbap commented Nov 6, 2025

run-slow: lightonocr

@github-actions
Copy link
Contributor

github-actions bot commented Nov 6, 2025

This comment contains run-slow, running the specified jobs:

models: ["models/lightonocr"]
quantizations: []

- Add vocab_size property to LightOnOCRConfig that delegates to text_config
- Fix test parameter name from image_token_index to image_token_id
- Add Unpack type hint to processor __call__ kwargs
- Remove unnecessary comments from modeling forward method
…d weights mapping #0

- Remove custom _init_weights methods (handled by base class)
- Update _tied_weights_keys to dict format with explicit mapping
- Update documentation date
…lacement #0

- Use config.text_config.vocab_size instead of config.vocab_size for composite config
- Remove explicit device placement from attention_mask and image_sizes tensors
- Allow device_map='auto' to handle device placement in model parallelism tests
@github-actions
Copy link
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, lightonocr

@baptiste-aubertin
Copy link
Author

hi I rabased this morning and all the tests are now passing 👍 😄

@eivtho
Copy link

eivtho commented Dec 3, 2025

We have had very good results fine-tuning on LightOnOCR, and are looking forward to this being merged so we can use it in production.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants