Skip to content

Sub optimal parallelism on a server with a high number of CPU cores #976

@nsarlin-zama

Description

@nsarlin-zama

Hello,
I would like to report a performance issue when I use arkworks on a machine with 192 cores. I seem to get a reproducible performance boost if I simply limit the number of rayon threads allocated to the program. For example, with the benchmarks of the groth16 crate:

$ cargo bench
wall-clock proving time for Bls12_381: 6.7046785 s

$ RAYON_NUM_THREADS=50 cargo bench
wall-clock proving time for Bls12_381: 4.314917805 s

The machine running the bench is an aws hpc7 with an AMD EPYC 9R14 CPU.

I did most of my experiments on the verification time for the pke_v2 scheme of this crate: https://github.com/zama-ai/tfhe-rs/tree/main/tfhe-zk-pok.

From the base level:

  • If I limit rayon to 50 threads I get a 20% improvement
  • If I simply remove the parallel feature for ark-poly I get a 10/15% improvement depending on the exact scheme
  • If I multiply the min chunk size in algebra/poly/src/domain/radix2/fft.rs by 4, I get a 8/12% improvement depending on the exact scheme:
const CHUNK_SIZE_MUL: usize = 4;

/// The minimum size of a chunk at which parallelization of `butterfly`s is
/// beneficial. This value was chosen empirically.
const MIN_GAP_SIZE_FOR_PARALLELIZATION: usize = (1 << 10) * CHUNK_SIZE_MUL;

/// The minimum size of a chunk at which parallelization of `butterfly`s is
/// beneficial. This value was chosen empirically.
const MIN_INPUT_SIZE_FOR_PARALLELIZATION: usize = (1 << 10) * CHUNK_SIZE_MUL;

// minimum size at which to parallelize.
#[cfg(feature = "parallel")]
const LOG_ROOTS_OF_UNITY_PARALLEL_SIZE: u32 = 7 * (CHUNK_SIZE_MUL as u32);

(this was done quickly to get a rough idea)

I understand that the exact values will depend on the target CPU, but this is still a non-negligible gain. It looks like the chunk size for the FFT butterflies could be increased?
I tried to do the same in other parts of the lib, but this is the most significant boost I got. However, since the boost I get from the FFT is smaller than the boost from RAYON_NUM_THREADS=50, I think there are some other places where using bigger chunks might be beneficial.
Intuitively, I would say that the values from RAYON_NUM_THREADS are a lower bound on what we can get, since they affect all computations indistinguishably, so it may be possible to do even better with targeted chunk size modifications.

How were the chunk values chosen in the first place? Is there a specific benchmark from arkworks that I could run to see if there are new optimal values for these?

Thanks a lot for your help!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions