Skip to content

Conversation

@mzegla
Copy link
Collaborator

@mzegla mzegla commented Nov 24, 2025

Description

Add UTF8 validate with replace mode = true in GGUF detokenizer conversion

CVS-176839

Checklist:

  • Tests have been updated or added to cover the new code.
  • This patch fully addresses the ticket.
  • I have made corresponding changes to the documentation.

Copilot AI review requested due to automatic review settings November 24, 2025 12:02
@github-actions github-actions bot added the category: GGUF GGUF file reader label Nov 24, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds UTF-8 validation with replacement mode to the GGUF detokenizer conversion pipeline to handle invalid UTF-8 sequences gracefully during token-to-string decoding.

Key Changes:

  • Introduces a UTF8Validate operation between FuzeRagged and StringTensorPack operations in the detokenizer model construction
  • Configures the UTF8Validate operation with replace_mode=true to replace invalid UTF-8 sequences rather than failing

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

auto packed_output = create_func("StringTensorPack", outputs_fused_ragged, {});
ov::OutputVector inputs_for_utf8_validate(outputs_fused_ragged.begin(), outputs_fused_ragged.end());
auto outputs_utf8_validate = create_func("UTF8Validate", inputs_for_utf8_validate, {{"replace_mode", true}});
outputs_utf8_validate.insert(outputs_utf8_validate.end(), outputs_fused_ragged.end() - 1, outputs_fused_ragged.end());
Copy link

Copilot AI Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line appears to insert the last element of outputs_fused_ragged into outputs_utf8_validate, but outputs_utf8_validate already contains all elements from outputs_fused_ragged (copied on line 578). This pattern duplicates the last element. If the intent is to append additional outputs (like the original line 577), verify that outputs_fused_ragged contains the correct elements to insert, or if this duplication is intentional for the UTF8Validate operation.

Suggested change
outputs_utf8_validate.insert(outputs_utf8_validate.end(), outputs_fused_ragged.end() - 1, outputs_fused_ragged.end());

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the insert is not really necessary, indeed.

@mzegla mzegla changed the title utf8 validate and replace invalid in gguf output decoding UTF8 validate and replace invalid in GGUF output decoding Nov 24, 2025
@github-actions github-actions bot added the category: tokenizers Tokenizer class or submodule update label Nov 24, 2025
@mzegla mzegla marked this pull request as ready for review November 24, 2025 15:50
Copilot AI review requested due to automatic review settings November 24, 2025 15:50
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

auto packed_output = create_func("StringTensorPack", outputs_fused_ragged, {});
ov::OutputVector inputs_for_utf8_validate(outputs_fused_ragged.begin(), outputs_fused_ragged.end());
auto outputs_utf8_validate = create_func("UTF8Validate", inputs_for_utf8_validate, {{"replace_mode", true}});
outputs_utf8_validate.insert(outputs_utf8_validate.end(), outputs_fused_ragged.end() - 1, outputs_fused_ragged.end());
Copy link

Copilot AI Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line appends the last element of outputs_fused_ragged to outputs_utf8_validate, but outputs_utf8_validate already contains all elements from outputs_fused_ragged (copied in line 578). This results in duplicating the last element. Consider removing this line or clarifying the intended behavior if partial duplication is needed.

Suggested change
outputs_utf8_validate.insert(outputs_utf8_validate.end(), outputs_fused_ragged.end() - 1, outputs_fused_ragged.end());

Copilot uses AI. Check for mistakes.
@apaniukov apaniukov enabled auto-merge November 24, 2025 16:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: GGUF GGUF file reader category: tokenizers Tokenizer class or submodule update

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants