UTF8 validate and replace invalid in GGUF output decoding #3062

mzegla · 2025-11-24T12:02:09Z

Description

Add UTF8 validate with replace mode = true in GGUF detokenizer conversion

Checklist:

Tests have been updated or added to cover the new code.
This patch fully addresses the ticket.
I have made corresponding changes to the documentation.

Copilot

Pull request overview

This PR adds UTF-8 validation with replacement mode to the GGUF detokenizer conversion pipeline to handle invalid UTF-8 sequences gracefully during token-to-string decoding.

Key Changes:

Introduces a UTF8Validate operation between FuzeRagged and StringTensorPack operations in the detokenizer model construction
Configures the UTF8Validate operation with replace_mode=true to replace invalid UTF-8 sequences rather than failing

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-24T12:02:31Z

src/cpp/src/gguf_utils/gguf_tokenizer.cpp

-    auto packed_output = create_func("StringTensorPack", outputs_fused_ragged, {});
+    ov::OutputVector inputs_for_utf8_validate(outputs_fused_ragged.begin(), outputs_fused_ragged.end());
+    auto outputs_utf8_validate = create_func("UTF8Validate", inputs_for_utf8_validate, {{"replace_mode", true}});
+    outputs_utf8_validate.insert(outputs_utf8_validate.end(), outputs_fused_ragged.end() - 1, outputs_fused_ragged.end());


This line appears to insert the last element of outputs_fused_ragged into outputs_utf8_validate, but outputs_utf8_validate already contains all elements from outputs_fused_ragged (copied on line 578). This pattern duplicates the last element. If the intent is to append additional outputs (like the original line 577), verify that outputs_fused_ragged contains the correct elements to insert, or if this duplication is intentional for the UTF8Validate operation.

Suggested change

outputs_utf8_validate.insert(outputs_utf8_validate.end(), outputs_fused_ragged.end() - 1, outputs_fused_ragged.end());

It seems that the insert is not really necessary, indeed.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-24T15:51:18Z

src/cpp/src/gguf_utils/gguf_tokenizer.cpp

-    auto packed_output = create_func("StringTensorPack", outputs_fused_ragged, {});
+    ov::OutputVector inputs_for_utf8_validate(outputs_fused_ragged.begin(), outputs_fused_ragged.end());
+    auto outputs_utf8_validate = create_func("UTF8Validate", inputs_for_utf8_validate, {{"replace_mode", true}});
+    outputs_utf8_validate.insert(outputs_utf8_validate.end(), outputs_fused_ragged.end() - 1, outputs_fused_ragged.end());


This line appends the last element of outputs_fused_ragged to outputs_utf8_validate, but outputs_utf8_validate already contains all elements from outputs_fused_ragged (copied in line 578). This results in duplicating the last element. Consider removing this line or clarifying the intended behavior if partial duplication is needed.

Suggested change

outputs_utf8_validate.insert(outputs_utf8_validate.end(), outputs_fused_ragged.end() - 1, outputs_fused_ragged.end());

utf8 validate and replace invalid in gguf output decoding

356d545

Copilot AI review requested due to automatic review settings November 24, 2025 12:02

github-actions bot added the category: GGUF GGUF file reader label Nov 24, 2025

Copilot AI reviewed Nov 24, 2025

View reviewed changes

mzegla changed the title ~~utf8 validate and replace invalid in gguf output decoding~~ UTF8 validate and replace invalid in GGUF output decoding Nov 24, 2025

update tokenizers

82994e2

github-actions bot added the category: tokenizers Tokenizer class or submodule update label Nov 24, 2025

mzegla marked this pull request as ready for review November 24, 2025 15:50

Copilot AI review requested due to automatic review settings November 24, 2025 15:50

Copilot AI reviewed Nov 24, 2025

View reviewed changes

apaniukov approved these changes Nov 24, 2025

View reviewed changes

apaniukov enabled auto-merge November 24, 2025 16:08

Merge branch 'master' into gguf_utf8_validate

9f5e7f2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UTF8 validate and replace invalid in GGUF output decoding #3062

UTF8 validate and replace invalid in GGUF output decoding #3062

mzegla commented Nov 24, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Nov 24, 2025

Uh oh!

apaniukov Nov 25, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Nov 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

UTF8 validate and replace invalid in GGUF output decoding #3062

Are you sure you want to change the base?

UTF8 validate and replace invalid in GGUF output decoding #3062

Conversation

mzegla commented Nov 24, 2025

Description

Checklist:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

apaniukov Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants