CARAML

Compact Automated Reproducible Assessment of Machine Learning (CARAML) is a benchmark framework designed to assess AI workloads on novel accelerators. It has been developed and tested extensively on systems at the Jülich Supercomputing Centre (JSC).

CARAML leverages JUBE, a scripting-based framework for creating benchmark sets, running them across different systems, and evaluating results. Additionally, it includes power/energy measurements through the jpwr tool.

Paper: Arxiv, IEEE

Tested Accelerators

CARAML has been tested on the JURECA-DC EVALUATION PLATFORM, JURECA-DC, JEDI, WEST-AI Nodes, NHR-FAU and Mila. These include the accelerators:

| System                                            | Configuration                                     | Tag       |
|---------------------------------------------------|---------------------------------------------------|-----------|
| NVIDIA Ampere node (SXM)                          | 4 × A100 (40GB HBM2e) GPUs                        | `A100`    |
| NVIDIA Ampere node (SXM)                          | 4 × A100 (80GB HBM2e) GPUs                        | `MILA`    |
| NVIDIA Hopper node (PCIe)                         | 4 × H100 (80GB HBM2e) GPUs                        | `H100`    |
| NVIDIA Hopper node (NVLink)                       | 4 × H100 (94GB HBM2e) GPUs                        | `WAIH100` |
| NVIDIA Grace-Hopper chip                          | 1 × GH200 (480GB LPDDR5X, 96GB HBM3) GPU          | `GH200`   |
| NVIDIA Grace-Hopper node                          | 4 × GH200 (120GB LPDDR5X, 96GB HBM3) GPUs         | `JUPITER` |
| AMD MI300X node                                   | 8 × MI300X (192GB HBM3) GPUs                      | `MI300X`  |
| AMD MI300A node                                   | 4 × MI300A (128GB HBM3) APUs                      | `MI300A`  |
| AMD MI200 node                                    | 4 × MI250 (128GB HBM2e) GPUs                      | `MI250`   |
| Graphcore IPU-POD4 M2000                          | 4 × GC200 (512GB DDR4-3200) IPUs                  | `GC200`   |

Note: MILA tag is supported only for FNO benchmark.

Benchmark

CARAML currently provides benchmarks implemented in Python:

1. Computer Vision: Image Classification (Training)

The image_classification model training benchmark is implemented in PyTorch. It is designed to test image classification models such as ResNet50 on various accelerators. For IPU's graphcore/examples is used.

Performance is measured in images/s and energy is measured in Wh.

Note: Support for the image classification benchmark in TensorFlow has been discontinued.

2. Natural Language Processing: GPT Language Model (Training)

The LLM-training benchmark is implemented in PyTorch with:

Megatron-LM with commit: f7727433293427bef04858f67b2889fe9b177d88 and patch applied for NVIDIA
Megatron-LM-ROCm with commit: 21045b59127cd2d5509f1ca27d81fae7b485bd22 and patch applied for AMD
graphcore/examples (forked version) for Graphcore

Performance is measured in tokens/s and energy is recorded in Wh.

3. Neural Operator: Fourier Neural Operator (FNO)

The operator-benchmark is implemented in PyTorch with operator_learning for NVIDIA systems.

It enables comprehensive analysis of mixed-precision training and the performance impact of torch.compile during both training and inference of FNO models.

The benchmark includes experiments on two representative problems:

Rayleigh–Bénard convection (RBC) in 2D and 3D
Plasma simulation using Particle-in-Cell (PIC) methods in 1D and 2D.

Performance is measured in timesteps/s.

Requirements

To run the benchmarks, install JUBE following JUBE Installation Documentation setup instructions. The benchmarks are deployed using Apptainer containers and executed using SLURM on the tested accelerators.

Dataset

Image Classification: Synthetic data is generated on the host machine for benchmarking. The IPU tag synthetic additionally allows for the generation of synthetic data directly on the IPU.
LLM Training: A subset of the OSCAR dataset (790 samples, ~10 MB) is pre-processed using GPT-2 tokenizers. This data is provided in the llm_data directory.
FNO Benchmark: Data from numerical solvers for RBC and PIC is cloned from huggingface repository during setup phase of benchmark using setup_fno_env.sh and kept at operator_benchmark/fno_data.

Execution

Clone the repository and navigate into it:

git clone https://github.com/FZJ-JSC/CARAML.git
cd CARAML

Modify the system and model parameters in the respective JUBE configuration file.
To pull the required container use the container tag as follows:
```
jube run  {JUBEConfig}.{xml,yaml} --tag container H100
```
Replace H100 with one of the following as needed:
- MILA (for A100 80GB, only for FNO benchmark)
- GH200 (for Arm CPU + H100)
- MI250 or MI300X or MI300A (for AMD)
- GC200 (for Graphcore)

Note: The container tag should ideally be used only once at the beginning to pull and set up the container and environment.

Image Classification Training Benchmark

To run the benchmark with defined configurations do
```
jube run image_classification/image_classification_torch_benchmark.xml --tag H100
```
H100 can be replaced with any tag mentioned in tested accelerators section.

After the benchmark has been executed, use jube continue to postprocess results

jube continue image_classification/image_classification_torch_benchmark_run -i last

To generate result do:

jube result image_classification/image_classification_torch_benchmark_run -i last

LLM Pre-Training Benchmark

To run the benchmark with defined configurations for 800M GPT model with OSCAR data do:
```
jube run llm_training/llm_benchmark_nvidia_amd.yaml --tag 800M A100
```
A100 can be replaced with any tag mentioned in tested accelerators section and 800M can be replaced with 13B and 175B for systems with more node resources.
To run the benchmark with defined configurations for 117M GPT model on Graphcore with synthetic data do
```
jube run llm_training/llm_benchmark_ipu.yaml --tag 117M synthetic
```
If tag synthetic is not given, the benchmark will use OSCAR data.
After the benchmark has been executed, use jube continue to postprocess results
```
jube continue llm_training/llm_benchmark_{nvidia_amd,ipu}_run -i last
```

To generate result do:

jube result llm_training/llm_benchmark_{nvidia_amd,ipu}_run -i last

FNO Training & Inference Benchmark

To run all problems (PIC1D, PIC2D, RBC2D, RBC3D) use all tag otherwise use the respective problem tag.

Distributed Training: Add the ddp tag to enable distributed data parallel (DDP) training.
Torch.compile: To run torch.compile with different modes of execution for training and inference. Use eval tag for inference.

Example to run training in mixed precision with different torch.compile modes on H100:

jube run operator_benchmark/fno_benchmark.yaml --tag H100 all

H100 can be replaced with any tag mentioned in tested accelerators section.

Results

JSC Specific Fixes

In order to use PyTorch torch run API on JSC systems fixed_torch_run.py fix is required. The fix solves the issue defined here.

Additionally the hostname is appended with an i for allowing communication over InfiniBand as described here.

Citation

@INPROCEEDINGS{10820809,
  author={John, Chelsea Maria and Nassyr, Stepan and Penke, Carolin and Herten, Andreas},
  booktitle={SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis}, 
  title={Performance and Power: Systematic Evaluation of AI Workloads on Accelerators with CARAML}, 
  year={2024},
  pages={1164-1176},
  doi={10.1109/SCW63240.2024.00158}
  }

Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
assets		assets
aux		aux
image_classification		image_classification
llm_training		llm_training
operator_benchmark		operator_benchmark
requirements		requirements
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
get_pytorch_container.sh		get_pytorch_container.sh
get_tensorflow_container.sh		get_tensorflow_container.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CARAML

Tested Accelerators

Benchmark

1. Computer Vision: Image Classification (Training)

2. Natural Language Processing: GPT Language Model (Training)

3. Neural Operator: Fourier Neural Operator (FNO)

Requirements

Dataset

Execution

Image Classification Training Benchmark

LLM Pre-Training Benchmark

FNO Training & Inference Benchmark

Results

JSC Specific Fixes

Citation

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

FZJ-JSC/CARAML

Folders and files

Latest commit

History

Repository files navigation

CARAML

Tested Accelerators

Benchmark

1. Computer Vision: Image Classification (Training)

2. Natural Language Processing: GPT Language Model (Training)

3. Neural Operator: Fourier Neural Operator (FNO)

Requirements

Dataset

Execution

Image Classification Training Benchmark

LLM Pre-Training Benchmark

FNO Training & Inference Benchmark

Results

JSC Specific Fixes

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages