-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
shuf: Add --random-seed, make --random-source GNU-compatible, report write failures, optimize
#7585
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This is important since the output is buffered and errors may end up ignored otherwise. `shuf -e a b c > /dev/full` now errors while it didn't before.
This gives a 1.8× speedup for `shuf -r -n1000000 -i1-1024`.
When the --random-source option is used uutils shuf now gives identical output to GNU shuf in many (but not all) cases. This is helpful to users who use it to get deterministic output, e.g. by combining it with `openssl` as suggested in the GNU info pages. I reverse engineered the algorithm from GNU shuf's output. There may be bugs. Other modes of shuffling still use `rand`'s `ThreadRng`, though they now sample a uniform distribution directly without going through the slice helper trait. Additionally, switch from `usize` to `u64` for `--input-range` and `--head-count`. This way the same range of numbers can be generated on 32-bit platforms as on 64-bit platforms.
This adds a new option to get reproducible output from a seed. This was already possible with --random-source, but doing that properly was tricky and had poor performance. Adding this option implies a commitment to keep using the exact same algorithms in the future. For that reason we only use third-party libraries for well-known algorithms and implement our own distributions on top of that. ----- As a teenager on King's Day I once used `shuf` for divination. People paid €0.50 to enter a cramped tent and sat down next to me behind an old netbook. I would ask their name and their sun sign and pipe this information into `shuf --random-source=/dev/stdin`, which selected pseudo-random dictionary words and `tee`d them into `espeak`. If someone's name was too short `shuf` crashed with an end of file error. --random-seed would have worked better.
|
GNU testsuite comparison: |
We used to use a clever homegrown way to sample integers. But GNU shuf with --random-source observably uses Fisher-Yates, and the output of the old version depended on a heuristic (making it dangerous for --random-seed). So now we do Fisher-Yates here, just like we do for other inputs. In deterministic modes the output for --input-range is identical that for piping `seq` into `shuf`. We imitate the old algorithm's method for keeping the resource use in check. The performance of the new version is very close to that of the old version: I haven't found any cases where it's much faster or much slower.
|
GNU testsuite comparison: |
|
Could you please make the gnu random source test pass ? (it is fine to adjust the upstream gnu test) |
|
Is there such a test? I can only find a test for passing |
|
sorry, i was confusing with shred :( sorry |
Not sure why I added this. The test can stay though.
|
This MR has conflicts. @blyxxyz are you available to update the patch ? |
|
GNU testsuite comparison: |
|
Done, thanks for the heads up! I've also turned the new strings into translation keys. I think the CI failure is spurious/unrelated? |
RenjiSann
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very pleasant review to make !
Your code is well-documented and easy to read.
The only nit is my question regarding the itoa dependency.
Thank you for your contribution !
| impl Writable for u64 { | ||
| fn write_all_to(&self, output: &mut impl OsWrite) -> Result<(), Error> { | ||
| write!(output, "{self}") | ||
| let mut buf = itoa::Buffer::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain briefly why you're using an external crate here to fulfill what seems to be a simple u64::to_string() ?
I guess there is some kind of edge case that's being handled, but I'm wondering whether it's worth adding a dependency or if we could implement it ourselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only difference is performance but that difference is significant, at the time it sped up shuf -r -n1000000 -i1-1024 by 1.8×.
We could probably get much of that speedup in a custom implementation just by bypassing string formatting but itoa is very well-optimized with uninitialized memory and other tricks. (Apparently it's the same code as the standard library, just without the std::fmt abstractions.)
As microdependencies go this one is very low-risk, it's one of the most downloaded crates on crates.io and we already have it as a transitive dependency through time. So it felt worth adding. I could remove/replace it if you prefer, for now I've added a comment to the code.
|
GNU testsuite comparison: |
To test: uutils#7585
CodSpeed Performance ReportMerging #7585 will degrade performances by 15.07%Comparing Summary
Benchmarks breakdown
Footnotes
|
|
cargo-deny: #9322 |
|
GNU testsuite comparison: |
Add a
--random-seedoption that allows using a string as a seed for deterministic output. For example:shuf --random-seed=🦀 -i1-10.This comes with a forward compatibility commitment: we have to keep using SHA-3 and ChaCha12 and the current (in-tree) algorithms whenever
--random-seedis passed. Details and motivation:coreutils/src/uu/shuf/src/random_seed.rs
Lines 12 to 61 in e89aaf8
Resolves Implementation of
shuf --random-source#2528.I'd like some input on the design. With this interface it's possible that a user passes a filename as an argument without noticing that it uses the filename itself as the seed rather than the contents of the file. Are we OK with that? I tried to make the help text as clear as possible:
--random-seed <STRING> seed with STRING for reproducible outputMake the behavior of
--random-sourcemostly compatible with GNU. I managed to black box reverse engineer their RNG (which is luckily very simple and similar to our old implementation). So now people who use this trick for reproducibility will be able to switch to uutils and still get the old output they're relying on. (Except for one weirdly specific invocation for which I've added a#[ignore]test.)Use
u64instead ofusizefor--input-rangeand--head-count. It's now possible to generate integers aboveu32::MAXon 32-bit systems.Replace the nonrepeating integer generator. The new one has better reproducibility: it matches GNU and gives identical output to
seq | shuf. (The performance is practically the same.)Handle
--random-sourceI/O errors properly. They used to cause a panic (which due to buffering quirks would get printed above the output rather than at the end). Now they're reported normally.Flush the
BufWriterat the end. If we don't do this then write errors get ignored.Use a larger write buffer (8K → 64K).
Use the
itoacrate to format integers. With--repeatand--input-rangethis was a bottleneck.I also experimented with I/O tricks like vectored writes but they turned out to be too situational, so not a lot of performance work.
shuf -r -n10000000 -i0-10now runs twice as fast though.