fmt : handle invalid UTF-8 input by replacing malformed sequences #9329

mattsu2020 · 2025-11-19T08:47:55Z

Description

This PR enhances the fmt utility to robustly handle input containing invalid UTF-8 sequences.

Previously, fmt relied on BufRead::lines(), which returns an error for lines containing invalid UTF-8, causing those lines to be silently dropped or iteration to stop prematurely.

This change implements manual line reading using read_until and converts the buffer to a string using String::from_utf8_lossy. This ensures that malformed sequences are replaced with the Unicode replacement character (U+FFFD) and the line is processed instead of being discarded.

…med sequences instead of dropping lines.

mattsu2020 · 2025-11-19T08:57:10Z

fix
fmt non-space
#9127

github-actions · 2025-11-19T09:07:11Z

GNU testsuite comparison:

Skipping an intermittent issue tests/misc/tee (passes in this run but fails in the 'main' branch)
Congrats! The gnu test tests/fmt/non-space is no longer failing!

cakebaker · 2025-11-19T10:49:41Z

src/uu/fmt/src/parasplit.rs

+                buf.pop();
+            }
+        }
+        let n = String::from_utf8_lossy(&buf).into_owned();


I think using from_utf8_lossy is incorrect.

If you look at the output of GNU fmt, you will see that they don't do a lossy conversion:

$ printf "=\xA0=" | fmt -s -w1 | hexdump -X 0000000 3d a0 3d 0a 0000004

And our output:

printf "=\xA0=" | cargo run -q fmt -s -w1 | hexdump -X 0000000 3d ef bf bd 3d 0a 0000006

- Changed `indent_str` field in `BreakArgs` to `indent: &[u8]` to avoid repeated UTF-8 conversions. - Updated `write_all` calls to pass `&s` instead of `s.as_bytes()` in fmt.rs and similar string/byteslicing in linebreak.rs. - Modified method signatures in parasplit.rs to accept `&[u8]` instead of `&str` for prefix matching, ensuring consistent byte-level operations without assuming valid UTF-8.

- Updated indentation calculation in FileLines to use is_some_and for tab and character checks, avoiding unnecessary computations and improving code flow. - Changed punctuation checks in WordSplit iterator to use is_some_and for cleaner, more idiomatic Rust code. - This refactor enhances readability and leverages short-circuiting behavior.

…line Refactored the is_whitespace assignment by combining chained method calls on one line for improved conciseness and readability.

codspeed-hq · 2025-11-19T11:47:33Z

CodSpeed Performance Report

Merging #9329 will not alter performance

_{Comparing mattsu2020:fmt_compatibility (6a313a4) with main (b0f41e7)}

Summary

✅ 126 untouched
⏩ 6 skipped¹

6 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

github-actions · 2025-11-19T11:53:19Z

GNU testsuite comparison:

Skipping an intermittent issue tests/misc/tee (passes in this run but fails in the 'main' branch)
Congrats! The gnu test tests/fmt/non-space is no longer failing!

…hrough Updated test_fmt_invalid_utf8 to expect raw byte (\xA0) passthrough instead of replacement character (\u{FFFD}) for invalid UTF-8 input, ensuring GNU-compatible behavior in fmt. This fixes the test expectation to match actual output, avoiding lossy conversion.

github-actions · 2025-11-19T12:15:24Z

GNU testsuite comparison:

Skipping an intermittent issue tests/misc/tee (passes in this run but fails in the 'main' branch)
Congrats! The gnu test tests/fmt/non-space is no longer failing!

mattsu2020 added 3 commits November 19, 2025 16:02

test: add word joiner and cyrillic kha character tests for fmt

073f7fc

feat: Enhance fmt to handle invalid UTF-8 input by replacing malfor…

36a01a1

…med sequences instead of dropping lines.

chore: add FFFD to spell-checker ignore list in fmt test.

2c617d4

cakebaker reviewed Nov 19, 2025

View reviewed changes

mattsu2020 added 3 commits November 19, 2025 20:16

style(fmt): compact whitespace check in WordSplit iterator to single …

c59f1bc

…line Refactored the is_whitespace assignment by combining chained method calls on one line for improved conciseness and readability.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fmt : handle invalid UTF-8 input by replacing malformed sequences #9329

fmt : handle invalid UTF-8 input by replacing malformed sequences #9329

mattsu2020 commented Nov 19, 2025

Uh oh!

mattsu2020 commented Nov 19, 2025

Uh oh!

github-actions bot commented Nov 19, 2025

Uh oh!

cakebaker Nov 19, 2025

Uh oh!

mattsu2020 Nov 19, 2025

Uh oh!

codspeed-hq bot commented Nov 19, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Nov 19, 2025

Uh oh!

github-actions bot commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

fmt : handle invalid UTF-8 input by replacing malformed sequences #9329

Are you sure you want to change the base?

fmt : handle invalid UTF-8 input by replacing malformed sequences #9329

Conversation

mattsu2020 commented Nov 19, 2025

Description

Uh oh!

mattsu2020 commented Nov 19, 2025

Uh oh!

github-actions bot commented Nov 19, 2025

Uh oh!

cakebaker Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

mattsu2020 Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

codspeed-hq bot commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CodSpeed Performance Report

Merging #9329 will not alter performance

Summary

Footnotes

Uh oh!

github-actions bot commented Nov 19, 2025

Uh oh!

github-actions bot commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codspeed-hq bot commented Nov 19, 2025 •

edited

Loading