Skip to content

Conversation

@mattsu2020
Copy link
Contributor

Description

This PR enhances the fmt utility to robustly handle input containing invalid UTF-8 sequences.

Previously, fmt relied on BufRead::lines(), which returns an error for lines containing invalid UTF-8, causing those lines to be silently dropped or iteration to stop prematurely.

This change implements manual line reading using read_until and converts the buffer to a string using String::from_utf8_lossy. This ensures that malformed sequences are replaced with the Unicode replacement character (U+FFFD) and the line is processed instead of being discarded.

@mattsu2020
Copy link
Contributor Author

fix
fmt non-space
#9127

@github-actions
Copy link

GNU testsuite comparison:

Skipping an intermittent issue tests/misc/tee (passes in this run but fails in the 'main' branch)
Congrats! The gnu test tests/fmt/non-space is no longer failing!

buf.pop();
}
}
let n = String::from_utf8_lossy(&buf).into_owned();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using from_utf8_lossy is incorrect.

If you look at the output of GNU fmt, you will see that they don't do a lossy conversion:

$ printf "=\xA0=" | fmt -s -w1 | hexdump -X
0000000  3d  a0  3d  0a                                                
0000004

And our output:

 printf "=\xA0=" | cargo run -q fmt -s -w1 | hexdump -X
0000000  3d  ef  bf  bd  3d  0a                                        
0000006

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix it

- Changed `indent_str` field in `BreakArgs` to `indent: &[u8]` to avoid repeated UTF-8 conversions.
- Updated `write_all` calls to pass `&s` instead of `s.as_bytes()` in fmt.rs and similar string/byteslicing in linebreak.rs.
- Modified method signatures in parasplit.rs to accept `&[u8]` instead of `&str` for prefix matching, ensuring consistent byte-level operations without assuming valid UTF-8.
- Updated indentation calculation in FileLines to use is_some_and for tab and character checks,
  avoiding unnecessary computations and improving code flow.
- Changed punctuation checks in WordSplit iterator to use is_some_and for cleaner,
  more idiomatic Rust code.
- This refactor enhances readability and leverages short-circuiting behavior.
…line

Refactored the is_whitespace assignment by combining chained method calls on one line for improved conciseness and readability.
@codspeed-hq
Copy link

codspeed-hq bot commented Nov 19, 2025

CodSpeed Performance Report

Merging #9329 will not alter performance

Comparing mattsu2020:fmt_compatibility (6a313a4) with main (b0f41e7)

Summary

✅ 126 untouched
⏩ 6 skipped1

Footnotes

  1. 6 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@github-actions
Copy link

GNU testsuite comparison:

Skipping an intermittent issue tests/misc/tee (passes in this run but fails in the 'main' branch)
Congrats! The gnu test tests/fmt/non-space is no longer failing!

…hrough

Updated test_fmt_invalid_utf8 to expect raw byte (\xA0) passthrough instead of replacement character (\u{FFFD}) for invalid UTF-8 input, ensuring GNU-compatible behavior in fmt. This fixes the test expectation to match actual output, avoiding lossy conversion.
@github-actions
Copy link

GNU testsuite comparison:

Skipping an intermittent issue tests/misc/tee (passes in this run but fails in the 'main' branch)
Congrats! The gnu test tests/fmt/non-space is no longer failing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants