Skip to content

Conversation

@alexandraBara
Copy link
Collaborator

@alexandraBara alexandraBara commented Sep 23, 2025

Functionality supported:

   version = self._get_amdsmi_version()
            processes = self.get_process()
            partition = self.get_partition()
            firmware = self.get_firmware()
            gpu_list = self.get_gpu_list()
            statics = self.get_static()

How to run:

node-scraper run-plugin AmdSmiPlugin

It is the users responsibility to have Python API installed: https://rocm.docs.amd.com/projects/amdsmi/en/latest/reference/amdsmi-py-api.html#


return out

def _smi_try(self, fn, *a, default=None, **kw):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type hints should be added here


return out

def _get_soc_pstate(self, h) -> StaticSocPstate | None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type hints should be added, and docstring updated to reflect them.

except ValidationError:
return None

def _get_xgmi_plpd(self, h) -> StaticXgmiPlpd | None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type hints should be added, and docstring updated to reflect them.

except ValidationError:
return None

def _get_cache_info(self, h) -> list[StaticCacheInfoItem]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type hints should be added, and docstring updated to reflect them. Would also be good to use a more descriptive variable name.


return out

def _get_clock(self, h) -> StaticClockData | None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type hints should be added, and docstring updated to reflect them.

"""Collect AmdSmi data from system

Args:
args (_type_, optional): _description_. Defaults to None.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring placeholder needs to be updated.

self._log_event(
category=EventCategory.PLATFORM,
description=f"{key} is not consistent across all GPUs",
priority=EventPriority.ERROR,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is inconsistency here always an error condition? Just wondering if maybe 'warning' would be more appropriate for the priority here

Comment on lines 432 to 437
if args.l0_to_recovery_count_error_threshold is None:
args.l0_to_recovery_count_error_threshold = self.L0_TO_RECOVERY_COUNT_ERROR_THRESHOLD
if args.l0_to_recovery_count_warning_threshold is None:
args.l0_to_recovery_count_warning_threshold = (
self.L0_TO_RECOVERY_COUNT_WARNING_THRESHOLD
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be better to have these default values in the analyzer args model itself rather than having them as class variables and setting them manually like this.

Comment on lines 454 to 459
if args.expected_memory_partition_mode or args.expected_compute_partition_mode:
self.check_expected_memory_partition_mode(
data.partition,
args.expected_memory_partition_mode,
args.expected_compute_partition_mode,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this check does not depend on the 'static' attribute of data, why is it being skipped when static is None?

Comment on lines +410 to +413
na_validator_dict = field_validator("clock", mode="before")(na_to_none_dict)
na_validator = field_validator("soc_pstate", "xgmi_plpd", "vbios", "limit", mode="before")(
na_to_none
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these validators it might make sense to move the logic into the basemodels that they are validating instead of doing them here.

if getattr(self, "_amdsmi", None) is not None:
return True
try:
self._amdsmi = importlib.import_module("amdsmi")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't work targeting a remote system since the import will operate on the executor not that target system. Might need to implement a check to see if this plugin is getting run at a remote target and exit with an appropriate error if it is remote.

Comment on lines +200 to +201
version: Optional[str] = None
amdsmi_library_version: Optional[str] = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I ran it this looked like a dict but it is str. Do we types pick based on the amd-smi lib source code/documentation?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +100 to +101
def na(x) -> bool:
return x is None or (isinstance(x, str) and x.strip().upper() in {"N/A", "NA", ""})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the 'na_to_none' function be used here instead?

category=EventCategory.APPLICATION,
description="Failed to build ProcessListItem; skipping entry",
data={
"exception": get_exception_traceback(e),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For validation errors it will be better to log the errors like this:

data={"errors": exception.errors(include_url=False)},

This could be applied to all other places where ValidationError is being caught as well.

"""

if not self._bind_amdsmi_or_log():
self.result.status = ExecutionStatus.NOT_RAN
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonder if perhaps the result should be EXECUTION_FAILURE here instead of NOT_RAN for the case that amdsmi is not available

@alexandraBara alexandraBara marked this pull request as draft November 15, 2025 02:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants