Usability of non-default CUDA streams, per-stream synchronization, memory safety guards #7348

TwentyPast4 · 2025-10-29T14:55:30Z

Type

Bug fix (non-breaking change which fixes an issue): Fixes #
New feature (non-breaking change which adds functionality). Resolves #
Breaking change (fix or feature that would cause existing functionality to not work as expected) Resolves #

Motivation and Context

In my experience, working with non-default CUDA streams is very dangerous in the current state of O3D.
A simple example that produces very unexpected behavior:

ScopedCUDAStream scope(ScopedCUDAStream::CreateNew);
Tensor test = Tensor ::Init<float>({0.f}, "CUDA:0");
std::cout << test.ToString() << std::endl; // Uninitialized memory access!!

While a very similar looking piece of code doesn't have this issue:

ScopedCUDAStream scope(cuda::GetDefaultStream());
Tensor test = Tensor ::Init<float>({0.f}, "CUDA:0");
std::cout << test.ToString() << std::endl; // OK!

The problem boils down to how the CUDA memory manager is implemented.
Issues are masked by the behavior of the CUDA default stream (different from non-default streams): https://docs.nvidia.com/cuda/cuda-driver-api/stream-sync-behavior.html

The CUDA memory manager Memcpy operation is implemented to perform async memcpy operations regardless of the direction of copy (device->device, device->host, host->device). This was a huge red flag when I first saw it, but the library worked, so I just assumed my imagined world was a lie and all is okay somehow.
My initial thinking was, if device->host is an async memcpy, wouldn't the library break pretty much all over the place, at every point where any operations are done on tensors of the sort gpu_tensor.To("CPU:0").AnyOperationHere(). Seems logical, as in such a case host-native operations would be done on top of memory which was yet to be filled. But in practice, this was not the case, so I went on with my life. Until eventually in my multi-threaded workflows, I was annoyed that the threads were implicitly being synchronized, because they shared the same CUDA stream, and thus were waiting on each other's work every time the stream needed to be synchronized. I then tried to use the scoped stream to create a new CUDA stream on each thread, so that they would not have to interfere with each other.
To my horror, that is when the exact behavior I initially pictured when looking at the memory manager manifested itself.
All memory accesses on tensors moved to the host device were done on uninitialized memory and the figurative explosions are quite spectacular.
This issue is also present in other cases too (eg. host->device async copy immediately followed by a host free).

Besides this, I later noticed the library does device-wide synchronization in many cases, instead of per-stream synchronization. This causes performance critical applications (where proper stream management is key) to suffer greatly.

I decided to fix my problems by tweaking the memory manager and adding the necessary stream management functionality to achieve my goals. I must stress that this was done in an effort to solve my problems , and not in the proper "contribute to the library" way. There are several reasons for this, primary of course being time, but secondary also being that my knowledge of O3D is nowhere near the level I'd want (yet) to be able to contribute in the way I'd like.

I'm providing this PR as a blueprint of what I believe needs to be done, and as an outline of the issue at hand.
I'm using the backing branch of the PR as a way to test solutions, and would welcome any input from O3D devs on how best to solve the limitations.

I understand the view point that in the end you could say this is a user-facing issue, and the user should make sure any memory copies go though by taking care of object lifetimes themselves. However, O3D has many cases where it just can't work in multi-cuda-stream mode, as these async memcpy operations break many other things.

It is also my opinion that there should be no way for users to introduce heavily undefined behavior into their applications that heavily warps the intuition of host data always being synchronous and safe to work with (it is not in the outlined use cases). So by this I mean, there should be no way to change the CUDA stream at all, as the library gives absolutely no guarantees that it will be able to live with that (the vast majority of workflows do not survive this change).

Checklist:

I have run python util/check_style.py --apply to apply Open3D code style
to my code.
This PR changes Open3D behavior or adds new functionality.
- Both C++ (Doxygen) and Python (Sphinx / Google style) documentation is
  updated accordingly.
- I have added or updated C++ and / or Python unit tests OR included test
  results (e.g. screenshots or numbers) here.
I will follow up and update the code if CI fails.
For fork PRs, I have selected Allow edits from maintainers.

Description

Added a memory synchronization guard for non-default CUDA streams.
Swapped-out all CUDA synchronization with per-stream synchronization
changed cudaMemset to cudaMemsetAsync

update-docs · 2025-10-29T14:55:35Z

Thanks for submitting this pull request! The maintainers of this repository would appreciate if you could update the CHANGELOG.md based on your changes.

…nchronization per-stream instead of for the entire device

…tions

TwentyPast4 force-pushed the cuda-streams branch from 8cac804 to f2176f1 Compare October 30, 2025 09:23

TwentyPast4 marked this pull request as draft October 30, 2025 12:48

Dan Toškan added 5 commits November 3, 2025 11:10

Added CUDA stream management, async device->host memcopy handling, sy…

2edc51f

…nchronization per-stream instead of for the entire device

Refactored usages of old GetDefaultStream() API

b1c178f

Fixed compile error in tests

27d2ccc

Implemented stream synchronization policy for memcopies in both direc…

bbfb696

…tions

Implemented stream execution policy for thrust use cases

f06bf67

TwentyPast4 force-pushed the cuda-streams branch from a2ce2c0 to f06bf67 Compare November 3, 2025 10:10

Dan Toškan added 6 commits November 4, 2025 17:08

Fixed memory-corrupting use cases of CUB functions

3daa3cc

Fixed a compile error and typo

22da83e

Missed another CUB call without an assigned stream

271eb1e

Fixed radius search not using correct CUDA stream

8b4935c

Questionable: removed starting new stream in KnnSearch due to overhead

3ca5175

Added stream handles to other cuda library usages

2d05634

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Usability of non-default CUDA streams, per-stream synchronization, memory safety guards #7348

Usability of non-default CUDA streams, per-stream synchronization, memory safety guards #7348

Uh oh!

TwentyPast4 commented Oct 29, 2025 •

edited

Loading

Uh oh!

update-docs bot commented Oct 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Usability of non-default CUDA streams, per-stream synchronization, memory safety guards #7348

Are you sure you want to change the base?

Usability of non-default CUDA streams, per-stream synchronization, memory safety guards #7348

Uh oh!

Conversation

TwentyPast4 commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Type

Motivation and Context

Checklist:

Description

Uh oh!

update-docs bot commented Oct 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

TwentyPast4 commented Oct 29, 2025 •

edited

Loading