Skip to content

Conversation

@Alexei-V-Ivanov-AMD
Copy link
Collaborator

@Alexei-V-Ivanov-AMD Alexei-V-Ivanov-AMD commented Nov 3, 2025

Enabling cooperative multi-gpu tests on multi-gpu nodes.

Signed-off-by: Alexei V. Ivanov [email protected]

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables cooperative multi-gpu tests by adding the container user to the render group for GPU device access and by removing the hardcoded HIP_VISIBLE_DEVICES=0 to allow tests to use multiple GPUs. The changes are correct in principle. However, I've pointed out a high-severity issue regarding the robustness of how the render group GID is obtained. My suggestions aim to make the script more robust and easier to debug by adding explicit checks and centralizing the logic, which also eliminates code duplication.

--device /dev/kfd $BUILDKITE_AGENT_META_DATA_RENDER_DEVICES \
--network=host \
--shm-size=16gb \
--group-add $(getent group render | cut -d: -f3) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The command substitution $(getent group render | cut -d: -f3) can produce an empty string if the render group does not exist on the host, causing the docker run command to fail with a confusing error. Since this logic is duplicated in the else block, it's best to perform the check once before the if/else block, store the GID in a variable, and reuse it.

This improves robustness, provides clearer error messages, and avoids code duplication. For example:

# Before the if-else block
render_gid=$(getent group render | cut -d: -f3)
if [[ -z "$render_gid" ]]; then
  echo "Error: 'render' group not found. This is required for GPU access." >&2
  exit 1
fi

# ... then inside both docker run commands:
--group-add "$render_gid" \

--device /dev/kfd $BUILDKITE_AGENT_META_DATA_RENDER_DEVICES \
--network=host \
--shm-size=16gb \
--group-add $(getent group render | cut -d: -f3) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This duplicates the logic for getting the render group GID. As suggested in my other comment, this should be done once before the if/else block to improve robustness and avoid duplication.

@gshtras gshtras added rocm Related to AMD ROCm ready ONLY add when PR is ready to merge/full CI is needed labels Nov 3, 2025
@Alexei-V-Ivanov-AMD Alexei-V-Ivanov-AMD force-pushed the MAIN_20251103 branch 5 times, most recently from 6ef67e3 to 2fe0abd Compare November 4, 2025 18:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants