- 
                Notifications
    
You must be signed in to change notification settings  - Fork 72
 
Open
Description
Hi, the following code causes GPU OOM on hopper with nvls enabled. I am using the latest main branch.
from mscclpp import Transport, TcpBootstrap, Communicator
from mscclpp._mscclpp import Context, RawGpuBuffer
import cupy as cp
cp.cuda.Device(0).use()
bootstrap = TcpBootstrap.create(0, 1)
bootstrap.initialize(bootstrap.create_unique_id(), 60)
comm = Communicator(bootstrap)
for i in range(100):
    if i % 10 == 0:
        print(f"{i=}", flush=True)
    mem = RawGpuBuffer(2 ** 30)
    reg = comm.register_memory(mem.data(), mem.bytes(), Transport.CudaIpc)
    del reg, memOutput:
i=0
i=10
i=20
i=30
i=40
i=50
i=60
i=70
Traceback (most recent call last):
  File "<stdin>", line 4, in <module>
mscclpp._mscclpp.CuError: (2, 'Call to result failed./.../mscclpp/src/gpu_utils.cc:128 (Cu failure: out of memory)')The code is fine if memory is not registered. Could you please check if it can be reproduced on your side?
Metadata
Metadata
Assignees
Labels
No labels