Skip to content

Conversation

@TwentyPast4
Copy link
Contributor

@TwentyPast4 TwentyPast4 commented Oct 29, 2025

Type

  • Bug fix (non-breaking change which fixes an issue): Fixes #
  • New feature (non-breaking change which adds functionality). Resolves #
  • Breaking change (fix or feature that would cause existing functionality to not work as expected) Resolves #

Motivation and Context

In my experience, working with non-default CUDA streams is very dangerous in the current state of O3D.
A simple example that produces very unexpected behavior:

ScopedCUDAStream scope(ScopedCUDAStream::CreateNew);
Tensor test = Tensor ::Init<float>({0.f}, "CUDA:0");
std::cout << test.ToString() << std::endl; // Uninitialized memory access!!

While a very similar looking piece of code doesn't have this issue:

ScopedCUDAStream scope(cuda::GetDefaultStream());
Tensor test = Tensor ::Init<float>({0.f}, "CUDA:0");
std::cout << test.ToString() << std::endl; // OK!

The problem boils down to how the CUDA memory manager is implemented.
Issues are masked by the behavior of the CUDA default stream (different from non-default streams): https://docs.nvidia.com/cuda/cuda-driver-api/stream-sync-behavior.html

The CUDA memory manager Memcpy operation is implemented to perform async memcpy operations regardless of the direction of copy (device->device, device->host, host->device). This was a huge red flag when I first saw it, but the library worked, so I just assumed my imagined world was a lie and all is okay somehow.
My initial thinking was, if device->host is an async memcpy, wouldn't the library break pretty much all over the place, at every point where any operations are done on tensors of the sort gpu_tensor.To("CPU:0").AnyOperationHere(). Seems logical, as in such a case host-native operations would be done on top of memory which was yet to be filled. But in practice, this was not the case, so I went on with my life. Until eventually in my multi-threaded workflows, I was annoyed that the threads were implicitly being synchronized, because they shared the same CUDA stream, and thus were waiting on each other's work every time the stream needed to be synchronized. I then tried to use the scoped stream to create a new CUDA stream on each thread, so that they would not have to interfere with each other.
To my horror, that is when the exact behavior I initially pictured when looking at the memory manager manifested itself.
All memory accesses on tensors moved to the host device were done on uninitialized memory and the figurative explosions are quite spectacular.
This issue is also present in other cases too (eg. host->device async copy immediately followed by a host free).

Besides this, I later noticed the library does device-wide synchronization in many cases, instead of per-stream synchronization. This causes performance critical applications (where proper stream management is key) to suffer greatly.

I decided to fix my problems by tweaking the memory manager and adding the necessary stream management functionality to achieve my goals. I must stress that this was done in an effort to solve my problems , and not in the proper "contribute to the library" way. There are several reasons for this, primary of course being time, but secondary also being that my knowledge of O3D is nowhere near the level I'd want (yet) to be able to contribute in the way I'd like.

I'm providing this PR as a blueprint of what I believe needs to be done, and as an outline of the issue at hand.
I'm using the backing branch of the PR as a way to test solutions, and would welcome any input from O3D devs on how best to solve the limitations.

I understand the view point that in the end you could say this is a user-facing issue, and the user should make sure any memory copies go though by taking care of object lifetimes themselves. However, O3D has many cases where it just can't work in multi-cuda-stream mode, as these async memcpy operations break many other things.

It is also my opinion that there should be no way for users to introduce heavily undefined behavior into their applications that heavily warps the intuition of host data always being synchronous and safe to work with (it is not in the outlined use cases). So by this I mean, there should be no way to change the CUDA stream at all, as the library gives absolutely no guarantees that it will be able to live with that (the vast majority of workflows do not survive this change).

Checklist:

  • I have run python util/check_style.py --apply to apply Open3D code style
    to my code.
  • This PR changes Open3D behavior or adds new functionality.
    • Both C++ (Doxygen) and Python (Sphinx / Google style) documentation is
      updated accordingly.
    • I have added or updated C++ and / or Python unit tests OR included test
      results
      (e.g. screenshots or numbers) here.
  • I will follow up and update the code if CI fails.
  • For fork PRs, I have selected Allow edits from maintainers.

Description

  • Added a memory synchronization guard for non-default CUDA streams.
  • Swapped-out all CUDA synchronization with per-stream synchronization
  • changed cudaMemset to cudaMemsetAsync

@update-docs
Copy link

update-docs bot commented Oct 29, 2025

Thanks for submitting this pull request! The maintainers of this repository would appreciate if you could update the CHANGELOG.md based on your changes.

@TwentyPast4 TwentyPast4 marked this pull request as draft October 30, 2025 12:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant