-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
fmt : handle invalid UTF-8 input by replacing malformed sequences #9329
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
fix |
|
GNU testsuite comparison: |
src/uu/fmt/src/parasplit.rs
Outdated
| buf.pop(); | ||
| } | ||
| } | ||
| let n = String::from_utf8_lossy(&buf).into_owned(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using from_utf8_lossy is incorrect.
If you look at the output of GNU fmt, you will see that they don't do a lossy conversion:
$ printf "=\xA0=" | fmt -s -w1 | hexdump -X
0000000 3d a0 3d 0a
0000004
And our output:
printf "=\xA0=" | cargo run -q fmt -s -w1 | hexdump -X
0000000 3d ef bf bd 3d 0a
0000006
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix it
- Changed `indent_str` field in `BreakArgs` to `indent: &[u8]` to avoid repeated UTF-8 conversions. - Updated `write_all` calls to pass `&s` instead of `s.as_bytes()` in fmt.rs and similar string/byteslicing in linebreak.rs. - Modified method signatures in parasplit.rs to accept `&[u8]` instead of `&str` for prefix matching, ensuring consistent byte-level operations without assuming valid UTF-8.
- Updated indentation calculation in FileLines to use is_some_and for tab and character checks, avoiding unnecessary computations and improving code flow. - Changed punctuation checks in WordSplit iterator to use is_some_and for cleaner, more idiomatic Rust code. - This refactor enhances readability and leverages short-circuiting behavior.
…line Refactored the is_whitespace assignment by combining chained method calls on one line for improved conciseness and readability.
CodSpeed Performance ReportMerging #9329 will not alter performanceComparing Summary
Footnotes
|
|
GNU testsuite comparison: |
…hrough
Updated test_fmt_invalid_utf8 to expect raw byte (\xA0) passthrough instead of replacement character (\u{FFFD}) for invalid UTF-8 input, ensuring GNU-compatible behavior in fmt. This fixes the test expectation to match actual output, avoiding lossy conversion.
|
GNU testsuite comparison: |
Description
This PR enhances the fmt utility to robustly handle input containing invalid UTF-8 sequences.
Previously, fmt relied on BufRead::lines(), which returns an error for lines containing invalid UTF-8, causing those lines to be silently dropped or iteration to stop prematurely.
This change implements manual line reading using read_until and converts the buffer to a string using String::from_utf8_lossy. This ensures that malformed sequences are replaced with the Unicode replacement character (U+FFFD) and the line is processed instead of being discarded.