shuf: Add `--random-seed`, make `--random-source` GNU-compatible, report write failures, optimize #7585

blyxxyz · 2025-03-26T18:37:22Z

Add a --random-seed option that allows using a string as a seed for deterministic output. For example: shuf --random-seed=🦀 -i1-10.

This comes with a forward compatibility commitment: we have to keep using SHA-3 and ChaCha12 and the current (in-tree) algorithms whenever --random-seed is passed. Details and motivation:

coreutils/src/uu/shuf/src/random_seed.rs

Lines 12 to 61 in e89aaf8

    
           /// Reproducible seeded random number generation. 
        
           /// 
        
           /// The behavior should stay the same between releases, so don't change it without 
        
           /// a very good reason. 
        
           /// 
        
           /// # How it works 
        
           /// 
        
           /// - Take a Unicode string as the seed. 
        
           /// 
        
           /// - Encode this seed as UTF-8. 
        
           /// 
        
           /// - Take the SHA3-256 hash of the encoded seed. 
        
           /// 
        
           /// - Use that hash as the input for a [`rand_chacha`] ChaCha12 RNG. 
        
           ///   (We don't touch the nonce, so that's probably zero.) 
        
           /// 
        
           /// - Take 64-bit samples from the RNG. 
        
           /// 
        
           /// - Use Lemire's method to generate uniformly distributed integers and: 
        
           /// 
        
           ///   - With --repeat, use these to pick elements from ranges. 
        
           /// 
        
           ///   - Without --repeat, use these to do left-to-right modern Fisher-Yates. 
        
           /// 
        
           /// # Why it works like this 
        
           /// 
        
           /// - Unicode string: Greatest common denominator between platforms. Windows doesn't 
        
           ///   let you pass raw bytes as a CLI argument and that would be bad practice anyway. 
        
           ///   A decimal or hex number would work but this is much more flexible without being 
        
           ///   unmanageable. 
        
           /// 
        
           ///   (Footgun: if the user passes a filename we won't read from the file but the 
        
           ///   command will run anyway.) 
        
           /// 
        
           /// - UTF-8: That's what Rust likes and it's the least unreasonable Unicode encoding. 
        
           /// 
        
           /// - SHA3-256: We want to make good use of the entire user input and SHA-3 is 
        
           ///   state of the art. ChaCha12 takes a 256-bit seed. 
        
           /// 
        
           /// - ChaCha12: [`rand`]'s default rng as of writing. Seems state of the art. 
        
           /// 
        
           /// - 64-bit samples: We could often get away with 32-bit samples but let's keep things 
        
           ///   simple and only use one width. (There doesn't seem to be much of a performance hit.) 
        
           /// 
        
           /// - Lemire, Fisher-Yates: These are very easy to implement and maintain ourselves. 
        
           ///   `rand` provides fancier implementations but only promises reproducibility within 
        
           ///   patch releases: https://rust-random.github.io/book/crate-reprod.html 
        
           /// 
        
           ///   Strictly speaking even `ChaCha12` is subject to breakage. But since it's a very 
        
           ///   specific algorithm I assume it's safe in practice.

Resolves Implementation of shuf --random-source #2528.

I'd like some input on the design. With this interface it's possible that a user passes a filename as an argument without noticing that it uses the filename itself as the seed rather than the contents of the file. Are we OK with that? I tried to make the help text as clear as possible:

--random-seed <STRING> seed with STRING for reproducible output

Make the behavior of --random-source mostly compatible with GNU. I managed to black box reverse engineer their RNG (which is luckily very simple and similar to our old implementation). So now people who use this trick for reproducibility will be able to switch to uutils and still get the old output they're relying on. (Except for one weirdly specific invocation for which I've added a #[ignore] test.)
Use u64 instead of usize for --input-range and --head-count. It's now possible to generate integers above u32::MAX on 32-bit systems.
Replace the nonrepeating integer generator. The new one has better reproducibility: it matches GNU and gives identical output to seq | shuf. (The performance is practically the same.)
Handle --random-source I/O errors properly. They used to cause a panic (which due to buffering quirks would get printed above the output rather than at the end). Now they're reported normally.
Flush the BufWriter at the end. If we don't do this then write errors get ignored.
Use a larger write buffer (8K → 64K).
Use the itoa crate to format integers. With --repeat and --input-range this was a bottleneck.

I also experimented with I/O tricks like vectored writes but they turned out to be too situational, so not a lot of performance work. shuf -r -n10000000 -i0-10 now runs twice as fast though.

This is important since the output is buffered and errors may end up ignored otherwise. `shuf -e a b c > /dev/full` now errors while it didn't before.

…rrors

This gives a 1.8× speedup for `shuf -r -n1000000 -i1-1024`.

When the --random-source option is used uutils shuf now gives identical output to GNU shuf in many (but not all) cases. This is helpful to users who use it to get deterministic output, e.g. by combining it with `openssl` as suggested in the GNU info pages. I reverse engineered the algorithm from GNU shuf's output. There may be bugs. Other modes of shuffling still use `rand`'s `ThreadRng`, though they now sample a uniform distribution directly without going through the slice helper trait. Additionally, switch from `usize` to `u64` for `--input-range` and `--head-count`. This way the same range of numbers can be generated on 32-bit platforms as on 64-bit platforms.

This adds a new option to get reproducible output from a seed. This was already possible with --random-source, but doing that properly was tricky and had poor performance. Adding this option implies a commitment to keep using the exact same algorithms in the future. For that reason we only use third-party libraries for well-known algorithms and implement our own distributions on top of that. ----- As a teenager on King's Day I once used `shuf` for divination. People paid €0.50 to enter a cramped tent and sat down next to me behind an old netbook. I would ask their name and their sun sign and pipe this information into `shuf --random-source=/dev/stdin`, which selected pseudo-random dictionary words and `tee`d them into `espeak`. If someone's name was too short `shuf` crashed with an end of file error. --random-seed would have worked better.

github-actions · 2025-03-26T19:22:39Z

GNU testsuite comparison:

Skip an intermittent issue tests/misc/stdbuf (fails in this run but passes in the 'main' branch)

We used to use a clever homegrown way to sample integers. But GNU shuf with --random-source observably uses Fisher-Yates, and the output of the old version depended on a heuristic (making it dangerous for --random-seed). So now we do Fisher-Yates here, just like we do for other inputs. In deterministic modes the output for --input-range is identical that for piping `seq` into `shuf`. We imitate the old algorithm's method for keeping the resource use in check. The performance of the new version is very close to that of the old version: I haven't found any cases where it's much faster or much slower.

github-actions · 2025-03-26T20:00:54Z

GNU testsuite comparison:

Skipping an intermittent issue tests/misc/stdbuf (passes in this run but fails in the 'main' branch)

sylvestre · 2025-03-27T11:38:52Z

Could you please make the gnu random source test pass ? (it is fine to adjust the upstream gnu test)
thanks

blyxxyz · 2025-03-27T19:34:26Z

Is there such a test? I can only find a test for passing --random-source multiple times. tests/shuf/shuf-reservoir and tests/shuf/shuf pass on my machine.

sylvestre · 2025-03-30T09:22:17Z

sorry, i was confusing with shred :( sorry

Not sure why I added this. The test can stay though.

RenjiSann · 2025-11-14T23:15:14Z

This MR has conflicts. @blyxxyz are you available to update the patch ?

github-actions · 2025-11-15T08:17:04Z

GNU testsuite comparison:

Skip an intermittent issue tests/tail/overlay-headers (fails in this run but passes in the 'main' branch)
Skipping an intermittent issue tests/misc/tee (passes in this run but fails in the 'main' branch)

blyxxyz · 2025-11-15T08:29:27Z

Done, thanks for the heads up! I've also turned the new strings into translation keys.

I think the CI failure is spurious/unrelated?

RenjiSann

Very pleasant review to make !
Your code is well-documented and easy to read.

The only nit is my question regarding the itoa dependency.

Thank you for your contribution !

RenjiSann · 2025-11-18T01:08:40Z

src/uu/shuf/src/shuf.rs

+impl Writable for u64 {
    fn write_all_to(&self, output: &mut impl OsWrite) -> Result<(), Error> {
-        write!(output, "{self}")
+        let mut buf = itoa::Buffer::new();


Could you explain briefly why you're using an external crate here to fulfill what seems to be a simple u64::to_string() ?

I guess there is some kind of edge case that's being handled, but I'm wondering whether it's worth adding a dependency or if we could implement it ourselves.

The only difference is performance but that difference is significant, at the time it sped up shuf -r -n1000000 -i1-1024 by 1.8×.

We could probably get much of that speedup in a custom implementation just by bypassing string formatting but itoa is very well-optimized with uninitialized memory and other tricks. (Apparently it's the same code as the standard library, just without the std::fmt abstractions.)

As microdependencies go this one is very low-risk, it's one of the most downloaded crates on crates.io and we already have it as a transitive dependency through time. So it felt worth adding. I could remove/replace it if you prefer, for now I've added a comment to the code.

github-actions · 2025-11-18T07:41:56Z

GNU testsuite comparison:

Skip an intermittent issue tests/tail/overlay-headers (fails in this run but passes in the 'main' branch)
Skipping an intermittent issue tests/misc/tee (passes in this run but fails in the 'main' branch)

To test: uutils#7585

codspeed-hq · 2025-11-18T08:38:49Z

CodSpeed Performance Report

Merging #7585 will degrade performances by 15.07%

_{Comparing blyxxyz:clean-shuf-2 (ed5ae0d) with main (67c7612)}

Summary

⚡ 1 improvement
❌ 2 regressions
✅ 123 untouched
⏩ 6 skipped¹

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

	Benchmark	`BASE`	`HEAD`	Change
⚡	`shuf_input_range[1000000]`	229.1 ms	124.8 ms	+83.65%
❌	`shuf_lines[100000]`	26.9 ms	31.7 ms	-15.07%
❌	`shuf_repeat_sampling[50000]`	4.6 ms	5.1 ms	-9.94%

6 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

sylvestre · 2025-11-18T08:43:14Z

cargo-deny: #9322

github-actions · 2025-11-18T08:43:45Z

GNU testsuite comparison:

Skip an intermittent issue tests/tail/overlay-headers (fails in this run but passes in the 'main' branch)

blyxxyz added 7 commits March 26, 2025 09:22

shuf: Move NonrepeatingIterator to own module

bbf65ba

shuf: correctness: Flush output after writing

fa386da

This is important since the output is buffered and errors may end up ignored otherwise. `shuf -e a b c > /dev/full` now errors while it didn't before.

shuf: perf: Bump output buffer to 64KB

bab5343

shuf: correctness: Do not use panics to report --random-source read e…

f6df6a6

…rrors

shuf: perf: Use itoa for integer formatting

b6d1b80

This gives a 1.8× speedup for `shuf -r -n1000000 -i1-1024`.

blyxxyz force-pushed the clean-shuf-2 branch from 535b0c0 to f34dda2 Compare March 26, 2025 19:26

blyxxyz mentioned this pull request Mar 31, 2025

Flush BufWriters, clean up error reporting #7622

Merged

blyxxyz added 3 commits April 30, 2025 13:55

Merge remote-tracking branch 'origin/main' into clean-shuf-2

a8fdcc1

shuf: Remove pointless heuristic

10bc746

Not sure why I added this. The test can stay though.

shuf: Add tests

2f06a48

blyxxyz added 3 commits November 15, 2025 08:45

Merge remote-tracking branch 'origin/main' into clean-shuf-2

4bb0aea

shuf: Resolve clippy warnings

68b1a51

shuf: Translate new strings

3601fff

RenjiSann reviewed Nov 18, 2025

View reviewed changes

shuf: Explain itoa use

0731e9d

sylvestre added a commit to sylvestre/coreutils that referenced this pull request Nov 18, 2025

shuf: add benchmarks

055b1e2

To test: uutils#7585

sylvestre mentioned this pull request Nov 18, 2025

shuf: add benchmarks #9320

Merged

Merge branch 'main' into clean-shuf-2

ed5ae0d

	/// Reproducible seeded random number generation.
	///
	/// The behavior should stay the same between releases, so don't change it without
	/// a very good reason.
	///
	/// # How it works
	///
	/// - Take a Unicode string as the seed.
	///
	/// - Encode this seed as UTF-8.
	///
	/// - Take the SHA3-256 hash of the encoded seed.
	///
	/// - Use that hash as the input for a [`rand_chacha`] ChaCha12 RNG.
	/// (We don't touch the nonce, so that's probably zero.)
	///
	/// - Take 64-bit samples from the RNG.
	///
	/// - Use Lemire's method to generate uniformly distributed integers and:
	///
	/// - With --repeat, use these to pick elements from ranges.
	///
	/// - Without --repeat, use these to do left-to-right modern Fisher-Yates.
	///
	/// # Why it works like this
	///
	/// - Unicode string: Greatest common denominator between platforms. Windows doesn't
	/// let you pass raw bytes as a CLI argument and that would be bad practice anyway.
	/// A decimal or hex number would work but this is much more flexible without being
	/// unmanageable.
	///
	/// (Footgun: if the user passes a filename we won't read from the file but the
	/// command will run anyway.)
	///
	/// - UTF-8: That's what Rust likes and it's the least unreasonable Unicode encoding.
	///
	/// - SHA3-256: We want to make good use of the entire user input and SHA-3 is
	/// state of the art. ChaCha12 takes a 256-bit seed.
	///
	/// - ChaCha12: [`rand`]'s default rng as of writing. Seems state of the art.
	///
	/// - 64-bit samples: We could often get away with 32-bit samples but let's keep things
	/// simple and only use one width. (There doesn't seem to be much of a performance hit.)
	///
	/// - Lemire, Fisher-Yates: These are very easy to implement and maintain ourselves.
	/// `rand` provides fancier implementations but only promises reproducibility within
	/// patch releases: https://rust-random.github.io/book/crate-reprod.html
	///
	/// Strictly speaking even `ChaCha12` is subject to breakage. But since it's a very
	/// specific algorithm I assume it's safe in practice.

Uh oh!

shuf: Add --random-seed, make --random-source GNU-compatible, report write failures, optimize #7585

Are you sure you want to change the base?

shuf: Add --random-seed, make --random-source GNU-compatible, report write failures, optimize #7585

Uh oh!

Conversation

blyxxyz commented Mar 26, 2025

Uh oh!

github-actions bot commented Mar 26, 2025

Uh oh!

github-actions bot commented Mar 26, 2025

Uh oh!

sylvestre commented Mar 27, 2025

Uh oh!

blyxxyz commented Mar 27, 2025

Uh oh!

sylvestre commented Mar 30, 2025

Uh oh!

RenjiSann commented Nov 14, 2025

Uh oh!

github-actions bot commented Nov 15, 2025

Uh oh!

blyxxyz commented Nov 15, 2025

Uh oh!

RenjiSann left a comment

Choose a reason for hiding this comment

Uh oh!

RenjiSann Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

blyxxyz Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 18, 2025

Uh oh!

codspeed-hq bot commented Nov 18, 2025

CodSpeed Performance Report

Merging #7585 will degrade performances by 15.07%

Summary

Benchmarks breakdown

Footnotes

Uh oh!

sylvestre commented Nov 18, 2025

Uh oh!

github-actions bot commented Nov 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shuf: Add `--random-seed`, make `--random-source` GNU-compatible, report write failures, optimize #7585

shuf: Add `--random-seed`, make `--random-source` GNU-compatible, report write failures, optimize #7585