Compact Automated Reproducible Assessment of Machine Learning (CARAML) is a benchmark framework designed to assess AI workloads on novel accelerators. It has been developed and tested extensively on systems at the Jülich Supercomputing Centre (JSC).
CARAML leverages JUBE, a scripting-based framework for creating benchmark sets, running them across different systems, and evaluating results. Additionally, it includes power/energy measurements through the jpwr tool.
CARAML has been tested on the JURECA-DC EVALUATION PLATFORM, JURECA-DC, JEDI, WEST-AI Nodes, NHR-FAU and Mila. These include the accelerators:
| System | Configuration | Tag |
|---------------------------------------------------|---------------------------------------------------|-----------|
| NVIDIA Ampere node (SXM) | 4 × A100 (40GB HBM2e) GPUs | `A100` |
| NVIDIA Ampere node (SXM) | 4 × A100 (80GB HBM2e) GPUs | `MILA` |
| NVIDIA Hopper node (PCIe) | 4 × H100 (80GB HBM2e) GPUs | `H100` |
| NVIDIA Hopper node (NVLink) | 4 × H100 (94GB HBM2e) GPUs | `WAIH100` |
| NVIDIA Grace-Hopper chip | 1 × GH200 (480GB LPDDR5X, 96GB HBM3) GPU | `GH200` |
| NVIDIA Grace-Hopper node | 4 × GH200 (120GB LPDDR5X, 96GB HBM3) GPUs | `JUPITER` |
| AMD MI300X node | 8 × MI300X (192GB HBM3) GPUs | `MI300X` |
| AMD MI300A node | 4 × MI300A (128GB HBM3) APUs | `MI300A` |
| AMD MI200 node | 4 × MI250 (128GB HBM2e) GPUs | `MI250` |
| Graphcore IPU-POD4 M2000 | 4 × GC200 (512GB DDR4-3200) IPUs | `GC200` |Note:
MILAtag is supported only for FNO benchmark.
CARAML currently provides benchmarks implemented in Python:
The image_classification model training benchmark is implemented in PyTorch. It is designed to test image classification models such as ResNet50 on various accelerators. For IPU's graphcore/examples is used.
Performance is measured in images/s and energy is measured in Wh.
Note: Support for the image classification benchmark in TensorFlow has been discontinued.
The LLM-training benchmark is implemented in PyTorch with:
- Megatron-LM with commit:
f7727433293427bef04858f67b2889fe9b177d88and patch applied for NVIDIA - Megatron-LM-ROCm with commit:
21045b59127cd2d5509f1ca27d81fae7b485bd22and patch applied for AMD - graphcore/examples (forked version) for Graphcore
Performance is measured in tokens/s and energy is recorded in Wh.
The operator-benchmark is implemented in PyTorch with operator_learning for NVIDIA systems.
It enables comprehensive analysis of mixed-precision training and the performance impact of torch.compile during both training and inference of FNO models.
The benchmark includes experiments on two representative problems:
- Rayleigh–Bénard convection (RBC) in 2D and 3D
- Plasma simulation using Particle-in-Cell (PIC) methods in 1D and 2D.
Performance is measured in timesteps/s.
To run the benchmarks, install JUBE following JUBE Installation Documentation setup instructions. The benchmarks are deployed using Apptainer containers and executed using SLURM on the tested accelerators.
-
Image Classification: Synthetic data is generated on the host machine for benchmarking. The IPU tag
syntheticadditionally allows for the generation of synthetic data directly on the IPU. -
LLM Training: A subset of the OSCAR dataset (790 samples, ~10 MB) is pre-processed using GPT-2 tokenizers. This data is provided in the
llm_datadirectory. -
FNO Benchmark: Data from numerical solvers for RBC and PIC is cloned from huggingface repository during setup phase of benchmark using setup_fno_env.sh and kept at
operator_benchmark/fno_data.
- Clone the repository and navigate into it:
git clone https://github.com/FZJ-JSC/CARAML.git
cd CARAML- Modify the
systemandmodelparameters in the respective JUBE configuration file. - To pull the required container use the
containertag as follows:Replacejube run {JUBEConfig}.{xml,yaml} --tag container H100H100with one of the following as needed:MILA(for A100 80GB, only for FNO benchmark)GH200(for Arm CPU + H100)MI250orMI300XorMI300A(for AMD)GC200(for Graphcore)
Note: The
containertag should ideally be used only once at the beginning to pull and set up the container and environment.
-
To run the benchmark with defined configurations do
jube run image_classification/image_classification_torch_benchmark.xml --tag H100
H100can be replaced with any tag mentioned in tested accelerators section. -
After the benchmark has been executed, use
jube continueto postprocess resultsjube continue image_classification/image_classification_torch_benchmark_run -i last -
To generate result do:
jube result image_classification/image_classification_torch_benchmark_run -i last
-
To run the benchmark with defined configurations for
800MGPT model with OSCAR data do:jube run llm_training/llm_benchmark_nvidia_amd.yaml --tag 800M A100
A100can be replaced with any tag mentioned in tested accelerators section and800Mcan be replaced with13Band175Bfor systems with more node resources. -
To run the benchmark with defined configurations for
117MGPT model on Graphcore with synthetic data dojube run llm_training/llm_benchmark_ipu.yaml --tag 117M synthetic
If tag
syntheticis not given, the benchmark will use OSCAR data. -
After the benchmark has been executed, use
jube continueto postprocess resultsjube continue llm_training/llm_benchmark_{nvidia_amd,ipu}_run -i last -
To generate result do:
jube result llm_training/llm_benchmark_{nvidia_amd,ipu}_run -i last
To run all problems (PIC1D, PIC2D, RBC2D, RBC3D) use all tag otherwise use the respective problem tag.
- Distributed Training: Add the
ddptag to enable distributed data parallel (DDP) training. - Torch.compile: To run
torch.compilewith different modes of execution for training and inference. Useevaltag for inference.
Example to run training in mixed precision with different torch.compile modes on H100:
jube run operator_benchmark/fno_benchmark.yaml --tag H100 all
H100 can be replaced with any tag mentioned in tested accelerators section.
In order to use PyTorch torch run API on JSC systems fixed_torch_run.py fix is required. The fix solves the issue defined here.
Additionally the hostname is appended with an i for allowing communication over InfiniBand as described here.
@INPROCEEDINGS{10820809,
author={John, Chelsea Maria and Nassyr, Stepan and Penke, Carolin and Herten, Andreas},
booktitle={SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis},
title={Performance and Power: Systematic Evaluation of AI Workloads on Accelerators with CARAML},
year={2024},
pages={1164-1176},
doi={10.1109/SCW63240.2024.00158}
}


