-
Notifications
You must be signed in to change notification settings - Fork 31.3k
Add LightOnOCR model implementation #41621
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
zucchini-nlp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @baptiste-aubertin , thanks a lot for working on it! Nice PR, and I left comments mostly to push the model for better standardization. Lmk when you need a second review
molbap
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @baptiste-aubertin, continued the review! Also for naming, the model should be better named LightOnOcr rather than ..OCR but that's a casing nit. For the tests not passing, adding the Text and Vision models to utils/check_repo list IGNORE_NON_TESTED should be enough.
I recommend rebasing and running the linter again, from your branch:
git checkout main
git pull
git checkout -
git merge main
pip install -e .[quality]
make fixupand that should get rid of the code quality issues.
0b34715 to
6dd60a9
Compare
|
Hi @molbap @zucchini-nlp ! Thanks for the detailed review and the instructions! i went through you comment and I'm still running into test issues. The problem is that modular is trying to reimplement a common attention mechanism for both the Pixtral vision model and the Qwen3 text model, and this is causing some conflicts. Have you dealt with this before? Not sure how to handle that :( |
molbap
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're right, some of the current tests failing are indeed because the text decoder end up using eager_attention_forward from Pixtral module which uses MHA, but the Qwen3 module needs GQA/repeating key values. Here the repeat should be 2, so you get a factor 2 issue in your matmuls 😬 To circumvent it you can rewrite the needed eager_attention_forward with repeat_kv so you'll have GQA in the decoder as expected. I outlined changes in the vision modules to silo the inheritance a bit, and similar ones might be needed in the text modules. Let me know and ping me when you've done a new pass!
molbap
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking much better! Left what I think are my last comments, once addressed we'll ping core maintainer for final review & merge 🥳
4bdd305 to
ee8f9fc
Compare
|
Hi I think all the tests passed and i rebased the branch :). Please tell me if I need to do something else |
|
run-slow: lightonocr |
|
This comment contains run-slow, running the specified jobs: models: ['models/lightonocr'] |
molbap
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left last nits comments + a non-nit comment on testing as we only test on part of the output for now in the IntegrationTest, the rest seems copacetic :) well done, ccing to core reviewer
| @property | ||
| def vision_model(self): | ||
| """Alias for vision_encoder to match standard composite model naming.""" | ||
| return self.vision_encoder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, I see. No worries, we'll be able to change it in v5 of transformers with our dynamic weights converter
| # Since we use packing, if flash_attention_2 is selected we rely on position_ids | ||
| if self.config._attn_implementation == "flash_attention_2": | ||
| kwargs["position_ids"] = kwargs["position_ids"].to(hidden_states.device, non_blocking=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couldn't we pass the position ids anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I copied this forward from Pixtral but yes maybe. Gonna try
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # Since we use packing, if flash_attention_2 is selected we rely on position_ids |
|
|
||
| # Check that the model generated non-empty text | ||
| # The exact output depends on the trained model, but it should contain some OCR text | ||
| self.assertIsNotNone(decoded_output) | ||
| self.assertIsInstance(decoded_output, str) | ||
| self.assertGreater(len(decoded_output.strip()), 0, "Model should generate non-empty OCR output") | ||
|
|
||
| # Check that the model correctly extracted the date from the receipt | ||
| self.assertIn( | ||
| "25/12/2018", decoded_output, "Model should extract the date '25/12/2018' from the receipt image" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For that model, don't we have a deterministic idea of what its output is? To be more robust to regressions, just compare to the full expected output, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added output comparison with the ground truth with difflib for that :)
|
There's a couple errors on the FAILED tests/models/lightonocr/test_modeling_lightonocr.py::LightOnOCRForConditionalGenerationIntegrationTest::test_model_can_generate_without_images - AttributeError: 'LightOnOCRConfig' object has no attribute 'vocab_size'
FAILED tests/models/lightonocr/test_modeling_lightonocr.py::LightOnOCRForConditionalGenerationIntegrationTest::test_model_forward_with_images - AttributeError: 'LightOnOCRConfig' object has no attribute 'vocab_size'so just a config key missing/a wrong access IMO. The other one which is suspicious is def vision_apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
"""Applies Rotary Position Embedding to the query and key tensors.
.......
"""
cos = cos.unsqueeze(unsqueeze_dim)
sin = sin.unsqueeze(unsqueeze_dim)
> q_embed = (q * cos) + (vision_rotate_half(q) * sin)
E RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!which is on a multi-GPU setting only but means that a q tensor should be created on a safer device, or at least within this util function to be safe. |
i added vocab getter and put the q and k on the same device as cos and sin :) |
|
Hi @molbap thanks again for your help ! Could you please re trigger the slow tests to be sure that everything pass before the intervention of the last reviewer :) ? Thanks |
|
run-slow: lightonocr |
|
This comment contains models: ["models/lightonocr"] |
CI Results✅ No failing test specific to this PR 🎉 ! |
zucchini-nlp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hye @baptiste-aubertin , While we're waiting on core maintainer's review, I left a few comments and questions below, mostly about things that looks redundant to me. LMK what you think and if we can delete those?
| # These should be set on the tokenizer before creating the processor | ||
| self.image_token = getattr(tokenizer, "image_token", "<|image_pad|>") | ||
| self.image_break_token = getattr(tokenizer, "image_break_token", "<|vision_pad|>") | ||
| self.image_end_token = getattr(tokenizer, "image_end_token", "<|vision_end|>") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we update the tokenizer on the hub instead? These were used as fallbacks for BC in old models, for new models imo we can directly change the tokenizer
| @unittest.skip("Pixtral does not support attention interfaces.") | ||
| def test_eager_matches_fa2_generate(self): | ||
| pass | ||
|
|
||
| @unittest.skip("Pixtral does not support attention interfaces.") | ||
| def test_eager_matches_sdpa_generate(self): | ||
| pass | ||
|
|
||
| @unittest.skip("Pixtral does not support attention interfaces.") | ||
| def test_flash_attn_2_from_config(self): | ||
| pass | ||
|
|
||
| @unittest.skip("Pixtral does not support attention interfaces.") | ||
| def test_flash_attn_2_inference_equivalence(self): | ||
| pass | ||
|
|
||
| @unittest.skip("Pixtral does not support attention interfaces.") | ||
| def test_flash_attn_2_inference_equivalence_right_padding(self): | ||
| pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these ones should be skipped if the model doesn't support it. Do we have supports_flash_attention = False set in the vision encoder?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I copied this from pixtral tests. But yes i can try this flag instead
Hi @zucchini-nlp thanks you ! This time I will try to answer rapidly 😅 |
5c88a1d to
4c22024
Compare
|
run-slow: lightonocr |
|
This comment contains models: ["models/lightonocr"] |
- Add vocab_size property to LightOnOCRConfig that delegates to text_config - Fix test parameter name from image_token_index to image_token_id - Add Unpack type hint to processor __call__ kwargs - Remove unnecessary comments from modeling forward method
…d weights mapping #0 - Remove custom _init_weights methods (handled by base class) - Update _tied_weights_keys to dict format with explicit mapping - Update documentation date
…lacement #0 - Use config.text_config.vocab_size instead of config.vocab_size for composite config - Remove explicit device placement from attention_mask and image_sizes tensors - Allow device_map='auto' to handle device placement in model parallelism tests
70bbb89 to
1f91d31
Compare
|
[For maintainers] Suggested jobs to run (before merge) run-slow: auto, lightonocr |
|
hi I rabased this morning and all the tests are now passing 👍 😄 |
|
We have had very good results fine-tuning on LightOnOCR, and are looking forward to this being merged so we can use it in production. |
What does this PR do?
Implementation of the LightonOCR model following Modular Transformers architecture.
Our model is a 1B parameter OCR model using Pixtral as the vision encoder and Qwen3 as the LLM decoder.
I still have an issue with auto configuration, which I'm not familiar with:
🚨 Config not found for lightonocr. You can manually add it to HARDCODED_CONFIG_FOR_MODELS in utils/auto_docstring.pyFixes # (issue)
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.