Skip to content

Commit bfd5be6

Browse files
authored
Merge branch 'main' into main
2 parents 508db3d + b0ef970 commit bfd5be6

File tree

7 files changed

+214
-4
lines changed

7 files changed

+214
-4
lines changed
Lines changed: 159 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,159 @@
1+
---
2+
title: "Optimizing and Benchmarking GPU Collective Communication of PyLops-MPI with NCCL"
3+
subtitle: "GPU-to-GPU Communication in PyLops-MPI for Large-scale Inverse Problems with Nvidia's NCCL"
4+
summary:
5+
authors:
6+
- tharitt
7+
tags: ["osre25", "gsoc25","lbl","PyLops-MPI"]
8+
categories: ["High-performance Computing", "GPU", "NCCL", "Parallel Programming"]
9+
date: 2025-09-05
10+
lastmod: 2025-09-05
11+
featured: true
12+
draft: false
13+
14+
# Featured image
15+
# To use, add an image named `featured.jpg/png` to your page's folder.
16+
# Focal points: Smart, Center, TopLeft, Top, TopRight, Left, Right, BottomLeft, Bottom, BottomRight.
17+
image:
18+
caption: ""
19+
focal_point: "Center"
20+
preview_only: false
21+
---
22+
23+
# Enabling NCCL GPU-GPU Communication in PyLops-MPI - Google Summer of Code Project (2025) - Part 2
24+
25+
Hello all! 👋 This is Tharit again. I want to share this blog post about my Part 2 of Google Summer of Code projects. In case you miss it, you can take a look at [Part1](https://ucsc-ospo.github.io/report/osre25/lbl/pylops-mpi/20250723-tharit/) as well. Without further introduction, these following supports were added since last time.
26+
27+
28+
- ### Complex Number Support [PR #148](https://github.com/PyLops/pylops-mpi/pull/148)
29+
_Between this PR and the previous PR, there are lots of debugging and testing to make sure that all existing `MPILinearOperator` works under NCCL as they do with `mpi4py` PR [#141](https://github.com/PyLops/pylops-mpi/pull/141), [#142](https://github.com/PyLops/pylops-mpi/pull/142) [#145](https://github.com/PyLops/pylops-mpi/pull/145)_
30+
31+
Most of the PyLops-MPI users are scientists and engineers working on the scientific problems - and most of the scientific problem involves complex numbers (Fourier Transform touches many things). _NCCL does not support the complex number out-of-the-box_.
32+
33+
It turned out that adding complex-number support was not the big issue. The complex number is simply the contiguous array of, says, `float64`. Unlike typical `float64`, one element of `complex128` number is then represented by two `float64`. Things get more complicate if we start to talk about the complex number arithmatic. Luckily, [NCCL semantics](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/types.html#c.ncclRedOp_t) only supports _element-wise_ `ncclSum`, `ncclProd`, `ncclMin`, `ncclMax`, `ncclAvg`. Wrapping element-wise operations for complex number is straightforward.
34+
35+
The change to PyLops-MPI `_nccl.py` itself is minimal. We simply added the function below and this hides the complexity of buffer-size management from users.
36+
```python
37+
def _nccl_buf_size(buf, count=None):
38+
""" Get an appropriate buffer size according to the dtype of buf
39+
if buf.dtype in ['complex64', 'complex128']:
40+
return 2 * count if count else 2 * buf.size
41+
else:
42+
return count if count else buf.size
43+
```
44+
The conceptual is quite simple. But mechanically, to get it right in the general case required some extensive bug fixing, particularly in the call to `_allgather `as noted earlier in the "Core Change" section. The array needs some preprocessing (to align with NCCL semantics) and post-processing so that the result from Pylops-MPI’s NCCL allgather matches with the PyLops-MPI allgather. This is because Pylops-MPI must be able to switch between `mpi4py` and NCCL seamlessly from the user's perspective. To make it concrete, here is how we do the `_allgather()` with NCCL
45+
```python
46+
def _allgather(self, send_buf, recv_buf=None):
47+
"""Allgather operation
48+
"""
49+
if deps.nccl_enabled and self.base_comm_nccl:
50+
if isinstance(send_buf, (tuple, list, int)):
51+
return nccl_allgather(self.base_comm_nccl, send_buf, recv_buf)
52+
else:
53+
send_shapes = self.base_comm.allgather(send_buf.shape)
54+
(padded_send, padded_recv) = _prepare_nccl_allgather_inputs(send_buf, send_shapes)
55+
raw_recv = nccl_allgather(self.base_comm_nccl, padded_send, recv_buf if recv_buf else padded_recv)
56+
return _unroll_nccl_allgather_recv(raw_recv, padded_send.shape, send_shapes)
57+
# < snip - MPI allgather >
58+
```
59+
60+
**After this feature was added, the PyLops-MPI with NCCL now catches up with its original MPI implementation, i.e., the test coverage is now the same 306 tests passed !**
61+
62+
- ### Benchmark Instrumentation [PR #157](https://github.com/PyLops/pylops-mpi/pull/157)
63+
Profiling distributed GPU operations is critical to understanding performance bottlenecks. To make this easier, we added a _lightweight benchmark instrumentation_ framework in PyLops-MPI. The goal was to allow developers to mark execution points in a function and collect timing information for these markers.
64+
65+
The core of the implementation is a `@benchmark decorator`. Inside a decorated function, developers can call `mark(label)` to record the time at specific points. After the function completes, the timings are reported in a human-readable format. This design is inspired by C++-style instrumentation, letting developers place markers directly in the code where they are most informative.
66+
67+
But because we are in Python, to handle nested function calls, we collect the timing information as a stack (bottom-up call graph) and parse the result at the end of the decorated function. Here's is the illustration:
68+
```python
69+
@benchmark
70+
def outer_func_with_mark(par):
71+
mark("Outer func start")
72+
inner_func_with_mark(par) # <- this does `dot` and is also decorated
73+
dist_arr = DistributedArray(global_shape=par['global_shape'],
74+
partition=par['partition'],
75+
dtype=par['dtype'], axis=par['axis'])
76+
dist_arr + dist_arr
77+
mark("Outer func ends")
78+
```
79+
The text output is
80+
```
81+
[decorator]outer_func_with_mark: total runtime: 0.001206 s
82+
[decorator]inner_func_with_mark: total runtime: 0.000351 s
83+
Begin array constructor-->Begin dot: 0.000026 s
84+
Begin dot-->Finish dot: 0.000322 s
85+
Outer func start-->Outer func ends: 0.001202 s
86+
```
87+
88+
Benchmarking is controlled via the environment variable `BENCH_PYLOPS_MPI`. It defaults to `1` (enable) but can be set to `0` to skip benchmarking for clean output. **This means users can leave the decorated code unchanged and disable the benchmark through the environment variable**. This is inspired by the C++ debug flag set during the compilation. Moreover, careful attention had to be made on concurrency issue of benchmarking because the time is recorded by CPU while the NCCL issues the operation in an async manner to CUDA stream [PR #163](https://github.com/PyLops/pylops-mpi/pull/163) is an example of this.
89+
90+
- ### Benchmark Result
91+
This was the moment of truth. Our 12-week hardwork would be judged by a set of hard cold numbers. Our expectation was that
92+
- If the system does not have proprietary NVLink for GPU-GPU communication but is NCCL-compatible, the communication using `CuPy + NCCL` should still be faster than `NumPy + MPI` (and possibly`CuPy + MPI`) in PyLops-MPI i.e., there should be a benefit from using NCCL from communication-related optimizations enabled by this project.
93+
94+
The result below was from NCSA UIUC Delta system [4-Way NVIDIA A40 GPU](https://docs.ncsa.illinois.edu/systems/delta/en/latest/user_guide/architecture.html) (no NVLink) with the `allreduce` operation.
95+
96+
<p align="center">
97+
<img width="400" height="300" alt="image" src="https://gist.github.com/user-attachments/assets/b139e63d-11ed-47f4-95f8-5e86bed26312" />
98+
</p>
99+
100+
That meets our expection. One thing to note here is: we see that actually the `CuPy + MPI` communication being slower than the `NumPy + MPI`. This is because the current implementation of PyLops-MPI uses non-buffered calls of `mpi4py` - see detail [here](https://mpi4py.readthedocs.io/en/stable/tutorial.html). The choice was made due to its simplicity as it allowed send and receiving generic Python object wrapped in a `list` and thus allowed fast development process. However, These require the memory copy from GPU to CPU, do communication, and copy memory from CPU to GPU (pickle protocol) - see our discussion with `mpi4py` community [here](https://github.com/mpi4py/mpi4py/discussions/657). This leads us to “Things left to do” section (later).
101+
102+
- If the system has an NVLink for GPU-GPU communication, we will be able to see a significant gain in performance of PyLops-MPI with NCCL.
103+
104+
The result below is also from NCSA UIUC Delta system [8-Way NVIDIA H200 GPU](https://docs.ncsa.illinois.edu/systems/delta/en/latest/user_guide/architecture.html) (with NVLink) but we only use 4 GPUs to compare with previous result. This is also with the `allreduce` operation.
105+
106+
<p align="center">
107+
<img width="400" height="300" alt="image" src="https://gist.github.com/user-attachments/assets/b3d83547-b9af-4b1c-87c0-ace2302eb140" />
108+
</p>
109+
110+
Here we unleash the true power of NCCL and its infrasture as you can see that **the bandwidth of PyLops-MPI with NCCL is 800x of the MPI implementation !**. It may not make much sense to compare the number with `NumPy+MPI` because there is drastic hardware infrastructure upgrade involved.
111+
112+
To top things off, we also ran the experiment trying to saturate the communication with the array size going up to 32 GB in total. We can see that we have the linear scaling i.e. time vs. data size grows linearly.
113+
<p align="center">
114+
<img width="400" height="300" alt="image" src="https://gist.github.com/user-attachments/assets/e5a95fdc-8db7-4caf-925f-256f504603bc" />
115+
</p>
116+
117+
Finally, we ran an experiment with the application of [Least-squares Migration](https://wiki.seg.org/wiki/Least-squares_migration), which is an iterative inversion scheme:
118+
- Each iteration applies a forward `A` and an adjoint `A.T` operation to form residuals and gradients.
119+
- A gradient accumulation requires a global reduction across processes with `allreduce`.
120+
Note that the computation is not trivial and so the total run-time of CPU and GPU is not fairly comparable (notice that in H200, the CuPy+MPI is not the slowest anymore). But we want to give the idea of how things piece together in the real application.
121+
<div align="center">
122+
<img width="400" height="300" alt="kirchA40"
123+
src="https://gist.github.com/user-attachments/assets/46c3a76a-20a3-40c3-981e-6e1c4acecb49" />
124+
<img width="400" height="300" alt="kirchhoff_h200"
125+
src="https://gist.github.com/user-attachments/assets/1439304a-8f78-4640-a78b-ba37238b26e6" />
126+
</div>
127+
128+
129+
### The impact of this GSoC project is clear:
130+
With our NCCL-enabled PyLops-MPI,
131+
- if you don't have access to the state-of-the-art infrastructure, PyLops-MPI with NCCL can still 10x the communication bandwith (A40 case)
132+
- if you do, we allow you to get the most out of the system (H200 case).
133+
134+
And the best thing is to use NCCL with PyLops-MPI, it requires minimal code change as shown in this [LSM Tutorial](https://github.com/PyLops/pylops-mpi/blob/main/tutorials_nccl/lsm_nccl.py) and illustrated below. Only two change require from the code that run on MPI: the array must be allocated in GPU and nccl has to be passed to the `DistributedArray`. And that's it !
135+
136+
```python
137+
nccl_comm = pylops_mpi.utils._nccl.initialize_nccl_comm()
138+
139+
# <snip - same set-up as running with MPI>
140+
141+
lsm = LSM(
142+
# <snip>
143+
cp.asarray(wav.astype(np.float32)), # Copy to GPU
144+
# <snip>
145+
engine="cuda",
146+
dtype=np.float32
147+
)
148+
lsm.Demop.trav_srcs = cp.asarray(lsm.Demop.trav_srcs.astype(np.float32)) # Copy to GPU
149+
lsm.Demop.trav_recs = cp.asarray(lsm.Demop.trav_recs.astype(np.float32)) # Copy to GPU
150+
151+
x0 = pylops_mpi.DistributedArray(VStack.shape[1],
152+
partition=pylops_mpi.Partition.BROADCAST,
153+
base_comm_nccl=nccl_comm, # Explicitly pass nccl communicator
154+
engine="cupy") # Must use CuPy
155+
# <snip - the rest is the same>
156+
```
157+
158+
### Things left to do
159+
- CUDA-Aware MPI: As we pointed out in the A40 experiment that current implementation of PyLops-MPI use non-buffered calls of `mpi4py` and thus introduces the memory copying from GPU to CPU. We aim to optimize this by introducing the buffered calls. However, this is not a trivial task because some of the MPI-related code was developed based on the semantics that the communication return the `list` object while the buffered call will return the array instead.
306 KB
Loading
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
---
2+
title: "Final Report: A Systematic Investigation into the Reproducibility of RAG Systems"
3+
subtitle:
4+
summary: "Final project report: This project delivered ReproRAG, a comprehensive framework to benchmark reproducibility in Retrieval-Augmented Generation systems. Our large-scale empirical study reveals that core retrieval algorithms like FAISS are perfectly deterministic under controlled conditions, falsifying a common assumption. We identified the true, dominant sources of uncertainty as the choice of embedding model and dynamic data changes, establishing a clear hierarchy of reproducibility challenges."
5+
authors:
6+
- wbq-321
7+
tags: ["osre25", "reproducibility", "rag", "llm", "ai-for-science", "final-report"]
8+
categories: ["Project Update"]
9+
date: 2025-09-05
10+
lastmod: 2025-09-05
11+
featured: false
12+
draft: false
13+
14+
# Featured image
15+
# To use, add an image named `featured.jpg/png` to your page's folder.
16+
# Focal points: Smart, Center, TopLeft, Top, TopRight, Left, Right, BottomLeft, Bottom, BottomRight.
17+
image:
18+
caption: ""
19+
focal_point: "Smart"
20+
preview_only: false
21+
---
22+
23+
I'm Baiqiang, and this is the final report for the [Enhancing Reproducibility in RAG Frameworks for Scientific Workflows](https://ucsc-ospo.github.io/project/osre25/pnnl/llm_rag_reproducibility/) project, mentored by Luanzheng "Lenny" Guo and Dongfang Zhao. This project successfully developed a novel framework to quantitatively measure reproducibility in AI systems, yielding several surprising and impactful results.
24+
25+
### The Challenge: The Need for Systematic Measurement
26+
27+
Retrieval-Augmented Generation (RAG) is a cornerstone of AI for science, but its reliability is often compromised by non-determinism. While this issue was a known concern, a fundamental challenge was the lack of standardized tools and methodologies to systematically measure and quantify the sources of this inconsistency. Without a rigorous way to analyze the problem, it was difficult to move beyond ad-hoc tests and establish the true root causes, hindering the development of truly trustworthy AI systems for science.
28+
29+
### Our Contribution: The ReproRAG Framework
30+
31+
To address this gap, the central contribution of this project is **ReproRAG**, a comprehensive, open-source benchmarking framework. ReproRAG is designed to systematically investigate sources of uncertainty across the entire RAG pipeline by:
32+
* **Isolating Variables:** It allows for controlled experiments on embedding models, numerical precision, retrieval algorithms, hardware configurations (CPU/GPU), and distributed execution environments.
33+
* **Quantifying Uncertainty:** It employs a suite of metrics—including Exact Match Rate, Jaccard Similarity, and Kendall's Tau—to precisely measure the impact of each variable on the final retrieved results.
34+
35+
### Key Findings: A New Hierarchy of Uncertainty
36+
37+
Our large-scale empirical study using ReproRAG challenged common assumptions and established a clear hierarchy of what actually impacts reproducibility.
38+
39+
1. **Core Algorithms Are Not the Problem:** Our most surprising finding is that modern retrieval libraries like FAISS are perfectly reproducible out-of-the-box. Across all tested index types (including approximate ones like HNSW and IVF) and execution environments (single-node CPU/GPU and multi-node distributed systems), we achieved perfect run-to-run reproducibility (1.000 scores on all metrics) when environmental factors like random seeds were controlled. This falsifies the common hypothesis that approximate nearest neighbor algorithms are a primary source of randomness.
40+
41+
2. **Embedding Model Choice is a Dominant Source of Variation:** We found that the choice of the embedding model is a dominant factor driving result variation. When comparing outputs from different state-of-the-art models (BGE, E5, Qwen) for the same query, the agreement was very low (e.g., Overlap Coefficient of ~0.43-0.54). This means a scientific conclusion drawn with one model may not be reproducible with another, as they are fundamentally "seeing" different evidence.
42+
43+
3. **Environmental Factors Introduce Measurable "Drift":**
44+
* **Numerical Precision:** Changing floating-point precision (e.g., FP32 vs. FP16) was a guaranteed source of variation, but it caused a small and quantifiable "embedding drift" rather than chaotic changes.
45+
* **Data Insertion:** Incrementally adding new data to an index caused a predictable "displacement" of old results, not a re-shuffling. The relative ranking of the remaining original documents was perfectly stable (Kendall's Tau of 1.000).
46+
47+
4. **Common Determinism Flags Can Be Ineffective:** Our tests showed that popular software-level controls, like `cudnn.deterministic` flags in PyTorch, had no observable effect on the output of modern transformer-based embedding models. This underscores the necessity of empirical validation over assuming that framework settings work as advertised.
48+
49+
### Conclusion
50+
51+
This project successfully shifted the focus of the RAG reproducibility problem. The key challenge is not to fix supposedly "random" algorithms, but to rigorously control the entire experimental environment. We delivered **ReproRAG**, a framework that empowers researchers to do just that. Our findings provide actionable insights for the community: efforts to improve reproducibility should focus less on the retrieval algorithms themselves and more on disciplined management of embedding models, data versioning, and numerical precision.

content/report/osre25/uchicago/MPI/20250614 - Rohan Babbar/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ summary:
55
authors:
66
- rohanbabbar04
77
tags: ["osre25", "reproducibility", "MPI", "cloud computing"]
8-
categories: ["osre25", "reproducibility", "HPC", "MPI"]
8+
categories: ["osre25", "reproducibility", "SOR", "HPC", "MPI"]
99
date: 2025-06-14
1010
lastmod: 2025-06-14
1111
featured: false

content/report/osre25/uchicago/MPI/20250803 - Rohan Babbar/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ summary:
55
authors:
66
- rohanbabbar04
77
tags: ["osre25", "reproducibility", "MPI", "cloud computing"]
8-
categories: ["osre25", "reproducibility", "HPC", "MPI"]
8+
categories: ["osre25", "reproducibility", "SOR", "HPC", "MPI"]
99
date: 2025-08-03
1010
lastmod: 2025-08-03
1111
featured: false

content/report/osre25/uchicago/MPI/20250831 - Rohan Babbar/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ summary:
55
authors:
66
- rohanbabbar04
77
tags: ["osre25", "reproducibility", "MPI", "cloud computing"]
8-
categories: ["osre25", "reproducibility", "HPC", "MPI"]
8+
categories: ["osre25", "reproducibility", "SOR", "HPC", "MPI"]
99
date: 2025-08-31
1010
lastmod: 2025-08-31
1111
featured: false

content/report/osre25/uchicago/MPI/20250901 - Rohan Babbar/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ summary:
55
authors:
66
- rohanbabbar04
77
tags: ["osre25", "reproducibility", "MPI", "cloud computing"]
8-
categories: ["osre25", "reproducibility", "HPC", "MPI"]
8+
categories: ["osre25", "reproducibility", "SOR", "HPC", "MPI"]
99
date: 2025-09-01
1010
lastmod: 2025-09-01
1111
featured: false

0 commit comments

Comments
 (0)