Skip to content

Conversation

@blyxxyz
Copy link
Contributor

@blyxxyz blyxxyz commented Mar 26, 2025

  • Add a --random-seed option that allows using a string as a seed for deterministic output. For example: shuf --random-seed=🦀 -i1-10.

    This comes with a forward compatibility commitment: we have to keep using SHA-3 and ChaCha12 and the current (in-tree) algorithms whenever --random-seed is passed. Details and motivation:

    /// Reproducible seeded random number generation.
    ///
    /// The behavior should stay the same between releases, so don't change it without
    /// a very good reason.
    ///
    /// # How it works
    ///
    /// - Take a Unicode string as the seed.
    ///
    /// - Encode this seed as UTF-8.
    ///
    /// - Take the SHA3-256 hash of the encoded seed.
    ///
    /// - Use that hash as the input for a [`rand_chacha`] ChaCha12 RNG.
    /// (We don't touch the nonce, so that's probably zero.)
    ///
    /// - Take 64-bit samples from the RNG.
    ///
    /// - Use Lemire's method to generate uniformly distributed integers and:
    ///
    /// - With --repeat, use these to pick elements from ranges.
    ///
    /// - Without --repeat, use these to do left-to-right modern Fisher-Yates.
    ///
    /// # Why it works like this
    ///
    /// - Unicode string: Greatest common denominator between platforms. Windows doesn't
    /// let you pass raw bytes as a CLI argument and that would be bad practice anyway.
    /// A decimal or hex number would work but this is much more flexible without being
    /// unmanageable.
    ///
    /// (Footgun: if the user passes a filename we won't read from the file but the
    /// command will run anyway.)
    ///
    /// - UTF-8: That's what Rust likes and it's the least unreasonable Unicode encoding.
    ///
    /// - SHA3-256: We want to make good use of the entire user input and SHA-3 is
    /// state of the art. ChaCha12 takes a 256-bit seed.
    ///
    /// - ChaCha12: [`rand`]'s default rng as of writing. Seems state of the art.
    ///
    /// - 64-bit samples: We could often get away with 32-bit samples but let's keep things
    /// simple and only use one width. (There doesn't seem to be much of a performance hit.)
    ///
    /// - Lemire, Fisher-Yates: These are very easy to implement and maintain ourselves.
    /// `rand` provides fancier implementations but only promises reproducibility within
    /// patch releases: https://rust-random.github.io/book/crate-reprod.html
    ///
    /// Strictly speaking even `ChaCha12` is subject to breakage. But since it's a very
    /// specific algorithm I assume it's safe in practice.

    Resolves Implementation of shuf --random-source #2528.

    I'd like some input on the design. With this interface it's possible that a user passes a filename as an argument without noticing that it uses the filename itself as the seed rather than the contents of the file. Are we OK with that? I tried to make the help text as clear as possible:

    --random-seed <STRING> seed with STRING for reproducible output

  • Make the behavior of --random-source mostly compatible with GNU. I managed to black box reverse engineer their RNG (which is luckily very simple and similar to our old implementation). So now people who use this trick for reproducibility will be able to switch to uutils and still get the old output they're relying on. (Except for one weirdly specific invocation for which I've added a #[ignore] test.)

  • Use u64 instead of usize for --input-range and --head-count. It's now possible to generate integers above u32::MAX on 32-bit systems.

  • Replace the nonrepeating integer generator. The new one has better reproducibility: it matches GNU and gives identical output to seq | shuf. (The performance is practically the same.)

  • Handle --random-source I/O errors properly. They used to cause a panic (which due to buffering quirks would get printed above the output rather than at the end). Now they're reported normally.

  • Flush the BufWriter at the end. If we don't do this then write errors get ignored.

  • Use a larger write buffer (8K → 64K).

  • Use the itoa crate to format integers. With --repeat and --input-range this was a bottleneck.

I also experimented with I/O tricks like vectored writes but they turned out to be too situational, so not a lot of performance work. shuf -r -n10000000 -i0-10 now runs twice as fast though.

blyxxyz added 7 commits March 26, 2025 09:22
This is important since the output is buffered and errors may end up
ignored otherwise.

`shuf -e a b c > /dev/full` now errors while it didn't before.
This gives a 1.8× speedup for `shuf -r -n1000000 -i1-1024`.
When the --random-source option is used uutils shuf now gives
identical output to GNU shuf in many (but not all) cases. This is
helpful to users who use it to get deterministic output, e.g. by
combining it with `openssl` as suggested in the GNU info pages.

I reverse engineered the algorithm from GNU shuf's output. There may
be bugs.

Other modes of shuffling still use `rand`'s `ThreadRng`, though they
now sample a uniform distribution directly without going through the
slice helper trait.

Additionally, switch from `usize` to `u64` for `--input-range` and
`--head-count`. This way the same range of numbers can be generated on
32-bit platforms as on 64-bit platforms.
This adds a new option to get reproducible output from a seed. This
was already possible with --random-source, but doing that properly was
tricky and had poor performance.

Adding this option implies a commitment to keep using the exact same
algorithms in the future. For that reason we only use third-party
libraries for well-known algorithms and implement our own
distributions on top of that.

-----

As a teenager on King's Day I once used `shuf` for divination. People
paid €0.50 to enter a cramped tent and sat down next to me behind an
old netbook. I would ask their name and their sun sign and pipe this
information into `shuf --random-source=/dev/stdin`, which selected
pseudo-random dictionary words and `tee`d them into `espeak`.

If someone's name was too short `shuf` crashed with an end of file
error. --random-seed would have worked better.
@github-actions
Copy link

GNU testsuite comparison:

Skip an intermittent issue tests/misc/stdbuf (fails in this run but passes in the 'main' branch)

We used to use a clever homegrown way to sample integers. But GNU shuf
with --random-source observably uses Fisher-Yates, and the output of
the old version depended on a heuristic (making it dangerous for
--random-seed).

So now we do Fisher-Yates here, just like we do for other inputs. In
deterministic modes the output for --input-range is identical that for
piping `seq` into `shuf`.

We imitate the old algorithm's method for keeping the resource use in
check. The performance of the new version is very close to that of the
old version: I haven't found any cases where it's much faster or much
slower.
@github-actions
Copy link

GNU testsuite comparison:

Skipping an intermittent issue tests/misc/stdbuf (passes in this run but fails in the 'main' branch)

@sylvestre
Copy link
Contributor

Could you please make the gnu random source test pass ? (it is fine to adjust the upstream gnu test)
thanks

@blyxxyz
Copy link
Contributor Author

blyxxyz commented Mar 27, 2025

Is there such a test? I can only find a test for passing --random-source multiple times. tests/shuf/shuf-reservoir and tests/shuf/shuf pass on my machine.

@sylvestre
Copy link
Contributor

sorry, i was confusing with shred :( sorry

@RenjiSann
Copy link
Collaborator

This MR has conflicts. @blyxxyz are you available to update the patch ?

@github-actions
Copy link

GNU testsuite comparison:

Skip an intermittent issue tests/tail/overlay-headers (fails in this run but passes in the 'main' branch)
Skipping an intermittent issue tests/misc/tee (passes in this run but fails in the 'main' branch)

@blyxxyz
Copy link
Contributor Author

blyxxyz commented Nov 15, 2025

Done, thanks for the heads up! I've also turned the new strings into translation keys.

I think the CI failure is spurious/unrelated?

Copy link
Collaborator

@RenjiSann RenjiSann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very pleasant review to make !
Your code is well-documented and easy to read.

The only nit is my question regarding the itoa dependency.

Thank you for your contribution !

impl Writable for u64 {
fn write_all_to(&self, output: &mut impl OsWrite) -> Result<(), Error> {
write!(output, "{self}")
let mut buf = itoa::Buffer::new();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain briefly why you're using an external crate here to fulfill what seems to be a simple u64::to_string() ?

I guess there is some kind of edge case that's being handled, but I'm wondering whether it's worth adding a dependency or if we could implement it ourselves.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only difference is performance but that difference is significant, at the time it sped up shuf -r -n1000000 -i1-1024 by 1.8×.

We could probably get much of that speedup in a custom implementation just by bypassing string formatting but itoa is very well-optimized with uninitialized memory and other tricks. (Apparently it's the same code as the standard library, just without the std::fmt abstractions.)

As microdependencies go this one is very low-risk, it's one of the most downloaded crates on crates.io and we already have it as a transitive dependency through time. So it felt worth adding. I could remove/replace it if you prefer, for now I've added a comment to the code.

@github-actions
Copy link

GNU testsuite comparison:

Skip an intermittent issue tests/tail/overlay-headers (fails in this run but passes in the 'main' branch)
Skipping an intermittent issue tests/misc/tee (passes in this run but fails in the 'main' branch)

sylvestre added a commit to sylvestre/coreutils that referenced this pull request Nov 18, 2025
@sylvestre sylvestre mentioned this pull request Nov 18, 2025
@codspeed-hq
Copy link

codspeed-hq bot commented Nov 18, 2025

CodSpeed Performance Report

Merging #7585 will degrade performances by 15.07%

Comparing blyxxyz:clean-shuf-2 (ed5ae0d) with main (67c7612)

Summary

⚡ 1 improvement
❌ 2 regressions
✅ 123 untouched
⏩ 6 skipped1

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark BASE HEAD Change
shuf_input_range[1000000] 229.1 ms 124.8 ms +83.65%
shuf_lines[100000] 26.9 ms 31.7 ms -15.07%
shuf_repeat_sampling[50000] 4.6 ms 5.1 ms -9.94%

Footnotes

  1. 6 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@sylvestre
Copy link
Contributor

cargo-deny: #9322

@github-actions
Copy link

GNU testsuite comparison:

Skip an intermittent issue tests/tail/overlay-headers (fails in this run but passes in the 'main' branch)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implementation of shuf --random-source

3 participants