Add seamless MR/IPC registration for P2P #394

derekwin · 2025-09-19T15:26:21Z

Description

Attempt to add seamless MR/IPC registration by introducing a tensor API.

Fixes # (issue)

Type of Change

Bug fix
New feature
Documentation update

How Has This Been Tested?

Include any tests here.

Unit tests
Integration tests
Manual testing

Checklist

My code follows the style guidelines, e.g. format.sh.
I have run build_and_install.sh to verify compilation.
I have removed redundant variables and comments.
I have updated the documentation.
I have added tests.

derekwin · 2025-09-19T15:29:42Z

Although both the compilation container and the host machine have the HIP driver, the following error occurs when using torch::from_blob to allocate a tensor.

Traceback (most recent call last):
  File "/xxxx/p2p/tests/test_tensor.py", line 22, in <module>
    tensor, mr_id, ipc_id = p2p.create_tensor(0, 32, 4)
                            ~~~~~~~~~~~~~~~~~^^^^^^^^^^
RuntimeError: Cannot initialize HIP without ATen_hip library.

derekwin · 2025-09-20T05:44:11Z

The PyTorch version installed via pip does not include the ATen_hip library, making it difficult to link the ATen_hip using a Makefile.(Actually, this library is required when running the test, so it may require users' host to have this library installed.) While building PyTorch from source could resolve this issue, it introduces significant complexity for users.

The project kvcached uses PyTorch's CUDAExtension in its setup.py to compile its C++ source code. Notably, PyTorch's CUDAExtension can automatically detect the IS_HIP_EXTENSION flag to support ROCm, which could facilitate HIP compatibility. However, this would require refactoring the UCCL build system.

I'd like to hear your opinion. @YangZhou1997

derekwin · 2025-09-20T06:16:38Z

Maybe we should keep tensor creation on the Python side and register them on the C++ side.

derekwin · 2025-09-20T08:23:24Z

Tensor creation and destruction now automatically register and deregister the associated MR and IPC resources.:

tensor, tensor_id = utils.create_tensor((2, 4, 2), torch.float32, f"cuda:{use_gpu_id}")
t_id = utils.get_tensor_id_by_tensor(tensor)
utils.free_tensor(tensor)

derekwin · 2025-09-20T09:07:33Z

The endpoint and collective context appear ready to be refactored to use the new IPC and MR management.

YangZhou1997 · 2025-09-20T16:59:30Z

Wow, that looks great. So I guess "keep tensor creation on the Python side and register them on the C++ side" indeed solves the problems? I will take a look at the code sometime today! Thank you for the great work!

YangZhou1997

LGTM! Just one question: Has this been integrated into the collective.py?

derekwin · 2025-09-23T02:25:49Z

LGTM! Just one question: Has this been integrated into the collective.py?LGTM！只有一个问题：这是否已融入 collective.py？

Not yet — it's currently a standalone module. We need to refactor the C++ CollectiveContext and Endpoint components for integration, which may involve removing all current registration logic. This will be a substantial refactor.

derekwin · 2025-09-23T02:31:49Z

I think should wait for this PR (#351) to be merged first. Otherwise, there might run into code conflicts.

derekwin · 2025-10-09T14:04:47Z

weakref.finalize() help us automatically frees tensor via our deregistration logic when the tensor is garbage collected.

YangZhou1997 · 2025-10-11T17:35:17Z

This is a great work! I wonder how would you close any exported IPC handlers in other processes. We were discussing that it would require cross-process SHM for coordination? It seems now ipc_handler never closed?

For example, is this https://github.com/derekwin/uccl-dev/blob/88651a4013a352ad07282a692ff0758bdea413d9/p2p/engine.cc#L1337-L1395 compatible with the current APIs?

Another note is that open_ipc_handle seems never used?

YangZhou1997

The code looks great. One quick comment is: can we also keep the manual reg and dereg interface, as vLLM/NIXL side has integrated with UCCL p2p, so we want to keep compatibility a bit for now (we will ultimately evolve into the auto strategy)

derekwin · 2025-10-13T02:21:33Z

This is a great work! I wonder how would you close any exported IPC handlers in other processes. We were discussing that it would require cross-process SHM for coordination? It seems now ipc_handler never closed?这是一部伟大的作品！我想知道您将如何关闭其他进程中导出的任何 IPC 处理程序。我们正在讨论它需要跨流程 SHM 进行协调？现在看来 ipc_handler 从未关闭过？

For example, is this https://github.com/derekwin/uccl-dev/blob/88651a4013a352ad07282a692ff0758bdea413d9/p2p/engine.cc#L1337-L1395 compatible with the current APIs?例如，这 https://github.com/derekwin/uccl-dev/blob/88651a4013a352ad07282a692ff0758bdea413d9/p2p/engine.cc#L1337-L1395 与当前的 API 兼容吗？

Another note is that open_ipc_handle seems never used?另一个注意事项是 open_ipc_handle 似乎从未使用过？

OK, I'll consider integrating this with the IPC cache later. Currently, the ipc handle logic still operates within coordination at the send/recv transmission interfaces, so these interfaces aren't being utilized yet.

Regarding sharing the IPC handle, I think the opened IPC handle is private to the process. So, is it meaningful to share them?

derekwin · 2025-10-13T02:21:57Z

The code looks great. One quick comment is: can we also keep the manual reg and dereg interface, as vLLM/NIXL side has integrated with UCCL p2p, so we want to keep compatibility a bit for now (we will ultimately evolve into the auto strategy)

ok.

derekwin force-pushed the p2p branch from b9f029b to de0166b Compare September 20, 2025 08:19

derekwin requested a review from YangZhou1997 September 20, 2025 09:08

YangZhou1997 reviewed Sep 22, 2025

View reviewed changes

derekwin added 3 commits October 10, 2025 06:08

add tensor api

ec27366

create_tensor/free_tensor with tensor_id

b711e76

global tensor id map

11bb33f

derekwin force-pushed the p2p branch from 3c80907 to 11bb33f Compare October 10, 2025 06:23

Yang Zhou and others added 5 commits October 10, 2025 06:44

format

01f8e8a

update

a661ef8

refactor p2p to support new reg/dereg

7d5efba

update doc

2eec72a

Merge branch 'main' into p2p

88651a4

derekwin requested a review from YangZhou1997 October 11, 2025 07:10

YangZhou1997 reviewed Oct 11, 2025

View reviewed changes

Add seamless MR/IPC registration for P2P #394

Are you sure you want to change the base?

Add seamless MR/IPC registration for P2P #394

Uh oh!

Conversation

derekwin commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

How Has This Been Tested?

Checklist

Uh oh!

derekwin commented Sep 19, 2025

Uh oh!

derekwin commented Sep 20, 2025

Uh oh!

derekwin commented Sep 20, 2025

Uh oh!

derekwin commented Sep 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

derekwin commented Sep 20, 2025

Uh oh!

YangZhou1997 commented Sep 20, 2025

Uh oh!

YangZhou1997 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

derekwin commented Sep 23, 2025

Uh oh!

derekwin commented Sep 23, 2025

Uh oh!

derekwin commented Oct 9, 2025

Uh oh!

YangZhou1997 commented Oct 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

YangZhou1997 left a comment

Choose a reason for hiding this comment

Uh oh!

derekwin commented Oct 13, 2025

Uh oh!

derekwin commented Oct 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

derekwin commented Sep 19, 2025 •

edited

Loading

derekwin commented Sep 20, 2025 •

edited

Loading

YangZhou1997 left a comment •

edited

Loading

YangZhou1997 commented Oct 11, 2025 •

edited

Loading