-
Notifications
You must be signed in to change notification settings - Fork 302
UTF8 validate and replace invalid in GGUF output decoding #3062
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||
|---|---|---|---|---|
|
|
@@ -575,7 +575,10 @@ create_tokenizer_from_config(const std::shared_ptr<void>& shared_object_ov_token | |||
| ov::OutputVector inputs_for_fused_ragged(detokenizer_outputs.begin(), detokenizer_outputs.end() - 1); | ||||
| auto outputs_fused_ragged = create_func("FuzeRagged", inputs_for_fused_ragged, {}); | ||||
| outputs_fused_ragged.insert(outputs_fused_ragged.end(), detokenizer_outputs.end() - 1, detokenizer_outputs.end()); | ||||
| auto packed_output = create_func("StringTensorPack", outputs_fused_ragged, {}); | ||||
| ov::OutputVector inputs_for_utf8_validate(outputs_fused_ragged.begin(), outputs_fused_ragged.end()); | ||||
| auto outputs_utf8_validate = create_func("UTF8Validate", inputs_for_utf8_validate, {{"replace_mode", true}}); | ||||
| outputs_utf8_validate.insert(outputs_utf8_validate.end(), outputs_fused_ragged.end() - 1, outputs_fused_ragged.end()); | ||||
|
||||
| outputs_utf8_validate.insert(outputs_utf8_validate.end(), outputs_fused_ragged.end() - 1, outputs_fused_ragged.end()); |
| +2 −1 | .gitignore | |
| +9 −69 | README.md | |
| +3 −0 | benchmark/.gitignore | |
| +1 −0 | python/openvino_tokenizers/hf_parser.py | |
| +2 −0 | src/tokenizers_factory.cpp | |
| +1 −1 | tests/pass_rates.json | |
| +269 −3,907 | tests/stats.json | |
| +1 −12 | tests/tokenizers_test.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line appears to insert the last element of
outputs_fused_raggedintooutputs_utf8_validate, butoutputs_utf8_validatealready contains all elements fromoutputs_fused_ragged(copied on line 578). This pattern duplicates the last element. If the intent is to append additional outputs (like the original line 577), verify thatoutputs_fused_raggedcontains the correct elements to insert, or if this duplication is intentional for the UTF8Validate operation.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that the insert is not really necessary, indeed.