-
Notifications
You must be signed in to change notification settings - Fork 6
Description
A. First I want to start on a positive note.
Your RNG implementation is 2.5 times faster (probably just because the RNG is faster, despite vectorized).
julia> x = zeros(UInt64, 1); @btime rand!(local_rng(), $x);
18.234 ns (0 allocations: 0 bytes)
julia> x = zeros(UInt64, 1); @btime rand!($x);
46.249 ns (0 allocations: 0 bytes)
While you also beat base on Float64 (by a smaller factor), it's interesting that still your Int RNG is slower than Float generation (you do not generate Float first, what I think Base does, so understandable for their).
julia> x = zeros(Float64, 1); @btime rand!(local_rng(), $x);
12.197 ns (0 allocations: 0 bytes)
julia> x = zeros(Float64, 1); @btime rand!($x);
29.039 ns (0 allocations: 0 bytes)
While I remember, and int generation doesn't work with PCG (I don't care), should your tagline be changed (maybe you just started with PCG, and I maybe influenced you to add the other RNG): "Vectorized PCG uniform, normal, and exponential random samplers."
B.
What you really want to show, is e.g. the possible 3.4 times gain, what you can ses in this microbenchmark, but I think maybe harmful:
julia> x = zeros(UInt64, 2048); @btime rand!(local_rng(), $x);
1.023 μs (0 allocations: 0 bytes)
julia> x = zeros(UInt64, 2048); @btime rand!($x);
3.498 μs (0 allocations: 0 bytes)
Now, as you know, on discourse I've pointed to/used your package, and we could discuss this there in the open, but I want to make sure I'm not misunderstanding, before I give a warning about my program, 3x faster, mostly thanks to your package.
Someone else used this trick of precalculating the random numbers first, and I just improved that program keeping that idea. I've seen it before, maybe first on Debian's Benchmark game.
And it other do it, you are kind of forced to do so too in a microbenchmark to match other languages, since this works.
I just think this may be a very deceptive idea, and I think your vectorized package depends on it. At some point, for me around 2048 calculated random numbers it stops being faster, and gets slower with larger arrays, it seems/I guess around L1 cache amount of generated random numbers. But that's just the best case.
In all real-world programs you're going to do something with those values, and using the L1 cache for something else is valuable, and in those the right size might be much lower. So, is there a way to have your cake and eat it too?
I'm not sure what's a safe amount of generated random numbers. With just a few numbers generated you get almost all of the gain (maybe keep next for generated values in registers only?). But I think in programs where you use this idea, I'm not sure you can do away with a small buffer in memory.
julia> x = zeros(UInt64, 1); @btime rand!(local_rng(), $x);
18.276 ns (0 allocations: 0 bytes)
julia> x = zeros(UInt64, 2); @btime rand!(local_rng(), $x);
18.259 ns (0 allocations: 0 bytes)
julia> x = zeros(UInt64, 4); @btime rand!(local_rng(), $x);
18.209 ns (0 allocations: 0 bytes)
julia> x = zeros(UInt64, 8); @btime rand!(local_rng(), $x);
21.283 ns (0 allocations: 0 bytes)
julia> x = zeros(UInt64, 16); @btime rand!(local_rng(), $x);
25.688 ns (0 allocations: 0 bytes)
julia> x = zeros(UInt64, 32); @btime rand!(local_rng(), $x);
32.816 ns (0 allocations: 0 bytes)
julia> x = zeros(UInt64, 64); @btime rand!(local_rng(), $x);
50.996 ns (0 allocations: 0 bytes)
julia> x = zeros(UInt64, 128); @btime rand!(local_rng(), $x);
85.410 ns (0 allocations: 0 bytes)
julia> x = zeros(UInt64, 256); @btime rand!(local_rng(), $x);
144.156 ns (0 allocations: 0 bytes)
julia> x = zeros(UInt64, 512); @btime rand!(local_rng(), $x);
266.631 ns (0 allocations: 0 bytes)
julia> x = zeros(UInt64, 1024); @btime rand!(local_rng(), $x);
517.667 ns (0 allocations: 0 bytes)
julia> x = zeros(UInt64, 2048); @btime rand!(local_rng(), $x);
1.023 μs (0 allocations: 0 bytes)
# somewhere around here, problem, or in real-world programs much sooner
julia> x = zeros(UInt64, 4096); @btime rand!(local_rng(), $x);
2.224 μs (0 allocations: 0 bytes)
julia> x = zeros(UInt64, 8192); @btime rand!(local_rng(), $x);
6.119 μs (0 allocations: 0 bytes)
julia> x = zeros(UInt64, 1000000); @btime rand!(local_rng(), $x);
847.457 μs (0 allocations: 0 bytes)
julia> x = zeros(UInt64, 2048000); @btime rand!(local_rng(), $x);
2.147 ms (0 allocations: 0 bytes)