-
Notifications
You must be signed in to change notification settings - Fork 103
Add seamless MR/IPC registration for P2P #394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Although both the compilation container and the host machine have the HIP driver, the following error occurs when using |
|
The PyTorch version installed via The project kvcached uses PyTorch's I'd like to hear your opinion. @YangZhou1997 |
|
Maybe we should keep tensor creation on the Python side and register them on the C++ side. |
|
Tensor creation and destruction now automatically register and deregister the associated MR and IPC resources.: |
|
The endpoint and collective context appear ready to be refactored to use the new IPC and MR management. |
|
Wow, that looks great. So I guess "keep tensor creation on the Python side and register them on the C++ side" indeed solves the problems? I will take a look at the code sometime today! Thank you for the great work! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Just one question: Has this been integrated into the collective.py?
Not yet — it's currently a standalone module. We need to refactor the C++ CollectiveContext and Endpoint components for integration, which may involve removing all current registration logic. This will be a substantial refactor. |
|
I think should wait for this PR (#351) to be merged first. Otherwise, there might run into code conflicts. |
|
weakref.finalize() help us automatically frees tensor via our deregistration logic when the tensor is garbage collected. |
|
This is a great work! I wonder how would you close any exported IPC handlers in other processes. We were discussing that it would require cross-process SHM for coordination? It seems now ipc_handler never closed? For example, is this https://github.com/derekwin/uccl-dev/blob/88651a4013a352ad07282a692ff0758bdea413d9/p2p/engine.cc#L1337-L1395 compatible with the current APIs? Another note is that open_ipc_handle seems never used? |
YangZhou1997
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks great. One quick comment is: can we also keep the manual reg and dereg interface, as vLLM/NIXL side has integrated with UCCL p2p, so we want to keep compatibility a bit for now (we will ultimately evolve into the auto strategy)
OK, I'll consider integrating this with the IPC cache later. Currently, the ipc handle logic still operates within coordination at the send/recv transmission interfaces, so these interfaces aren't being utilized yet. Regarding sharing the IPC handle, I think the opened IPC handle is private to the process. So, is it meaningful to share them? |
ok. |
Description
Attempt to add seamless MR/IPC registration by introducing a tensor API.
Fixes # (issue)
Type of Change
How Has This Been Tested?
Include any tests here.
Checklist
format.sh.build_and_install.shto verify compilation.