Skip to content

Conversation

@koushikbillakanti-amd
Copy link
Contributor

Motivation

The amd-smi reset -l command was failing on MI300X/MI300A systems with an AttributeError when trying to clean local GPU data. This prevented users from using the process isolation feature properly, which is critical for security in data center environments where GPUs are shared between different users.

Technical Details

Fixed the set_gpu() function in [amdsmi_commands.py] by replacing direct attribute access with getattr() calls to safely handle missing attributes. For complex types (tuples/ints), we now extract values into local variables before use, ensuring consistent access throughout each code block. This defensive approach prevents crashes when different subcommands (like reset) pass args objects with different attribute sets.

JIRA ID

Resolves SWDEV-498649

Test Plan

Tested all reset command variations locally (-l, -c, -f, etc.) and verified they work without throwing AttributeError. Also tested set commands to ensure the changes don't break existing functionality. The fix is particularly important for MI300X/MI300A systems with partition features, though testing was done on consumer GPUs to verify general correctness.

Test Result

All tested commands execute successfully without errors. The amd-smi reset -l -g <gpu_id> command now properly cleans local GPU data instead of crashing. Other commands like amd-smi static and amd-smi set continue to work as expected, confirming backward compatibility.

Submission Checklist

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants