Skip to content

Conversation

@derekwin
Copy link
Collaborator

@derekwin derekwin commented Sep 19, 2025

Description

Attempt to add seamless MR/IPC registration by introducing a tensor API.

Fixes # (issue)

Type of Change

  • Bug fix
  • New feature
  • Documentation update

How Has This Been Tested?

Include any tests here.

  • Unit tests
  • Integration tests
  • Manual testing

Checklist

  • My code follows the style guidelines, e.g. format.sh.
  • I have run build_and_install.sh to verify compilation.
  • I have removed redundant variables and comments.
  • I have updated the documentation.
  • I have added tests.

@derekwin
Copy link
Collaborator Author

Although both the compilation container and the host machine have the HIP driver, the following error occurs when using torch::from_blob to allocate a tensor.

Traceback (most recent call last):
  File "/xxxx/p2p/tests/test_tensor.py", line 22, in <module>
    tensor, mr_id, ipc_id = p2p.create_tensor(0, 32, 4)
                            ~~~~~~~~~~~~~~~~~^^^^^^^^^^
RuntimeError: Cannot initialize HIP without ATen_hip library.

@derekwin
Copy link
Collaborator Author

The PyTorch version installed via pip does not include the ATen_hip library, making it difficult to link the ATen_hip using a Makefile.(Actually, this library is required when running the test, so it may require users' host to have this library installed.) While building PyTorch from source could resolve this issue, it introduces significant complexity for users.

The project kvcached uses PyTorch's CUDAExtension in its setup.py to compile its C++ source code. Notably, PyTorch's CUDAExtension can automatically detect the IS_HIP_EXTENSION flag to support ROCm, which could facilitate HIP compatibility. However, this would require refactoring the UCCL build system.

I'd like to hear your opinion. @YangZhou1997

@derekwin
Copy link
Collaborator Author

Maybe we should keep tensor creation on the Python side and register them on the C++ side.

@derekwin
Copy link
Collaborator Author

derekwin commented Sep 20, 2025

Tensor creation and destruction now automatically register and deregister the associated MR and IPC resources.:

tensor, tensor_id = utils.create_tensor((2, 4, 2), torch.float32, f"cuda:{use_gpu_id}")
t_id = utils.get_tensor_id_by_tensor(tensor)
utils.free_tensor(tensor)

@derekwin
Copy link
Collaborator Author

The endpoint and collective context appear ready to be refactored to use the new IPC and MR management.

@YangZhou1997
Copy link
Member

Wow, that looks great. So I guess "keep tensor creation on the Python side and register them on the C++ side" indeed solves the problems? I will take a look at the code sometime today! Thank you for the great work!

Copy link
Member

@YangZhou1997 YangZhou1997 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just one question: Has this been integrated into the collective.py?

@derekwin
Copy link
Collaborator Author

LGTM! Just one question: Has this been integrated into the collective.py?LGTM!只有一个问题:这是否已融入 collective.py?

Not yet — it's currently a standalone module. We need to refactor the C++ CollectiveContext and Endpoint components for integration, which may involve removing all current registration logic. This will be a substantial refactor.

@derekwin
Copy link
Collaborator Author

I think should wait for this PR (#351) to be merged first. Otherwise, there might run into code conflicts.

@derekwin
Copy link
Collaborator Author

derekwin commented Oct 9, 2025

weakref.finalize() help us automatically frees tensor via our deregistration logic when the tensor is garbage collected.

@derekwin derekwin requested a review from YangZhou1997 October 11, 2025 07:10
@YangZhou1997
Copy link
Member

YangZhou1997 commented Oct 11, 2025

This is a great work! I wonder how would you close any exported IPC handlers in other processes. We were discussing that it would require cross-process SHM for coordination? It seems now ipc_handler never closed?

For example, is this https://github.com/derekwin/uccl-dev/blob/88651a4013a352ad07282a692ff0758bdea413d9/p2p/engine.cc#L1337-L1395 compatible with the current APIs?

Another note is that open_ipc_handle seems never used?

Copy link
Member

@YangZhou1997 YangZhou1997 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks great. One quick comment is: can we also keep the manual reg and dereg interface, as vLLM/NIXL side has integrated with UCCL p2p, so we want to keep compatibility a bit for now (we will ultimately evolve into the auto strategy)

@derekwin
Copy link
Collaborator Author

This is a great work! I wonder how would you close any exported IPC handlers in other processes. We were discussing that it would require cross-process SHM for coordination? It seems now ipc_handler never closed?这是一部伟大的作品!我想知道您将如何关闭其他进程中导出的任何 IPC 处理程序。我们正在讨论它需要跨流程 SHM 进行协调?现在看来 ipc_handler 从未关闭过?

For example, is this https://github.com/derekwin/uccl-dev/blob/88651a4013a352ad07282a692ff0758bdea413d9/p2p/engine.cc#L1337-L1395 compatible with the current APIs?例如,这 https://github.com/derekwin/uccl-dev/blob/88651a4013a352ad07282a692ff0758bdea413d9/p2p/engine.cc#L1337-L1395 与当前的 API 兼容吗?

Another note is that open_ipc_handle seems never used?另一个注意事项是 open_ipc_handle 似乎从未使用过?

OK, I'll consider integrating this with the IPC cache later. Currently, the ipc handle logic still operates within coordination at the send/recv transmission interfaces, so these interfaces aren't being utilized yet.

Regarding sharing the IPC handle, I think the opened IPC handle is private to the process. So, is it meaningful to share them?

@derekwin
Copy link
Collaborator Author

The code looks great. One quick comment is: can we also keep the manual reg and dereg interface, as vLLM/NIXL side has integrated with UCCL p2p, so we want to keep compatibility a bit for now (we will ultimately evolve into the auto strategy)

ok.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants