Commit 1725f1a
increate timeouts for daemon registration with Kineto (#1158)
Summary:
Pull Request resolved: #1158
I wanted to share an update regarding recent GPU timeout issues we've been experiencing, particularly affecting the last three GPUs in our 8-worker setups. We've identified the root cause as a "Thundering Herd + Timeout" problem within Dynolog's IPCMonitor, and I'm happy to report that a solution has been drafted.
Previously, when all eight processes simultaneously sent IPC requests to Dynolog, the single-threaded IPCMonitor would process these requests serially. Each request took approximately 10ms, causing later processes to exceed the original 50ms timeout. For instance, the Dynolog logs showed:
```
20:24:45.391549 - Process 2202 registered
20:24:45.401608 - Process 2201 registered (+10ms)
...
20:24:45.441941 - Process 2204 registered (+10ms)
20:24:45.452018 - Process 2206 registered (+10ms)
20:24:45.462101 - Process 2205 registered (+10ms)
```
This serial processing meant that the 6th, 7th, and 8th processes (2204, 2206, and 2205 respectively) were significantly delayed. As a result, they failed with errors like:
```
ERROR:2025-10-13 20:24:45 2204:2265 IpcFabricConfigClient.cpp:188] Failed to receive ondemand config type=3 from dyno: IPC recv fail
ERROR:2025-10-13 20:24:45 2206:2266 IpcFabricConfigClient.cpp:188] Failed to receive ondemand config type=3 from dyno: IPC recv fail
ERROR:2025-10-13 20:24:45 2205:2267 IpcFabricConfigClient.cpp:188] Failed to receive ondemand config type=3 from dyno: IPC recv fail
```
To resolve this, I've increased the IPC timeout to 90ms. This value was chosen because we observed approximately 10ms of processing time per rank, so for 8 ranks, plus a buffer, 90ms provides sufficient time for all processes to register successfully, even under simultaneous load, ensuring that all GPUs can initialize without encountering these timeout errors. This change should significantly improve the stability and reliability of our GPU-accelerated workloads.
Reviewed By: sraikund16
Differential Revision: D84573484
fbshipit-source-id: b89b29d182e4566100ca742c30de695715a70cfa1 parent 09d0e5e commit 1725f1a
1 file changed
+3
-3
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
95 | 95 | | |
96 | 96 | | |
97 | 97 | | |
98 | | - | |
| 98 | + | |
99 | 99 | | |
100 | 100 | | |
101 | 101 | | |
| |||
185 | 185 | | |
186 | 186 | | |
187 | 187 | | |
188 | | - | |
189 | | - | |
| 188 | + | |
| 189 | + | |
190 | 190 | | |
191 | 191 | | |
192 | 192 | | |
| |||
0 commit comments