-
-
Notifications
You must be signed in to change notification settings - Fork 11k
Enabling cooperative multi-gpu tests on multi-gpu nodes #27986
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Enabling cooperative multi-gpu tests on multi-gpu nodes #27986
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request enables cooperative multi-gpu tests by adding the container user to the render group for GPU device access and by removing the hardcoded HIP_VISIBLE_DEVICES=0 to allow tests to use multiple GPUs. The changes are correct in principle. However, I've pointed out a high-severity issue regarding the robustness of how the render group GID is obtained. My suggestions aim to make the script more robust and easier to debug by adding explicit checks and centralizing the logic, which also eliminates code duplication.
| --device /dev/kfd $BUILDKITE_AGENT_META_DATA_RENDER_DEVICES \ | ||
| --network=host \ | ||
| --shm-size=16gb \ | ||
| --group-add $(getent group render | cut -d: -f3) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The command substitution $(getent group render | cut -d: -f3) can produce an empty string if the render group does not exist on the host, causing the docker run command to fail with a confusing error. Since this logic is duplicated in the else block, it's best to perform the check once before the if/else block, store the GID in a variable, and reuse it.
This improves robustness, provides clearer error messages, and avoids code duplication. For example:
# Before the if-else block
render_gid=$(getent group render | cut -d: -f3)
if [[ -z "$render_gid" ]]; then
echo "Error: 'render' group not found. This is required for GPU access." >&2
exit 1
fi
# ... then inside both docker run commands:
--group-add "$render_gid" \| --device /dev/kfd $BUILDKITE_AGENT_META_DATA_RENDER_DEVICES \ | ||
| --network=host \ | ||
| --shm-size=16gb \ | ||
| --group-add $(getent group render | cut -d: -f3) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
6ef67e3 to
2fe0abd
Compare
Signed-off-by: Alexei V. Ivanov <[email protected]>
Signed-off-by: Alexei V. Ivanov <[email protected]>
2fe0abd to
5bc9dae
Compare
Enabling cooperative multi-gpu tests on multi-gpu nodes.
Signed-off-by: Alexei V. Ivanov [email protected]