Sub optimal parallelism on a server with a high number of CPU cores

Hello,
I would like to report a performance issue when I use arkworks on a machine with 192 cores. I seem to get a reproducible performance boost if I simply limit the number of rayon threads allocated to the program. For example, with the benchmarks of the groth16 crate:
```
$ cargo bench
wall-clock proving time for Bls12_381: 6.7046785 s

$ RAYON_NUM_THREADS=50 cargo bench
wall-clock proving time for Bls12_381: 4.314917805 s
```

The machine running the bench is an aws hpc7 with an AMD EPYC 9R14 CPU.

I did most of my experiments on the verification time for the `pke_v2` scheme of this crate: https://github.com/zama-ai/tfhe-rs/tree/main/tfhe-zk-pok.

From the base level:
- If I limit rayon to 50 threads I get a 20% improvement
- If I simply remove the parallel feature for ark-poly I get a 10/15% improvement depending on the exact scheme
- If I multiply the min chunk size in `algebra/poly/src/domain/radix2/fft.rs` by 4, I get a 8/12% improvement depending on the exact scheme:

```rust
const CHUNK_SIZE_MUL: usize = 4;

/// The minimum size of a chunk at which parallelization of `butterfly`s is
/// beneficial. This value was chosen empirically.
const MIN_GAP_SIZE_FOR_PARALLELIZATION: usize = (1 << 10) * CHUNK_SIZE_MUL;

/// The minimum size of a chunk at which parallelization of `butterfly`s is
/// beneficial. This value was chosen empirically.
const MIN_INPUT_SIZE_FOR_PARALLELIZATION: usize = (1 << 10) * CHUNK_SIZE_MUL;

// minimum size at which to parallelize.
#[cfg(feature = "parallel")]
const LOG_ROOTS_OF_UNITY_PARALLEL_SIZE: u32 = 7 * (CHUNK_SIZE_MUL as u32);
```
(this was done quickly to get a rough idea)

I understand that the exact values will depend on the target CPU, but this is still a non-negligible gain. It looks like the chunk size for the FFT butterflies could be increased?
I tried to do the same in other parts of the lib, but this is the most significant boost I got. However, since the boost I get from the FFT is smaller than the boost from `RAYON_NUM_THREADS=50`, I think there are some other places where using bigger chunks might be beneficial.
Intuitively, I would say that the values from `RAYON_NUM_THREADS` are a lower bound on what we can get, since they affect all computations indistinguishably, so it may be possible to do even better with targeted chunk size modifications.

How were the chunk values chosen in the first place? Is there a specific benchmark from arkworks that I could run to see if there are new optimal values for these?

Thanks a lot for your help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sub optimal parallelism on a server with a high number of CPU cores #976

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Sub optimal parallelism on a server with a high number of CPU cores #976

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions