-
Notifications
You must be signed in to change notification settings - Fork 1
AmdSmiPlugin part 1 #42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: development
Are you sure you want to change the base?
Conversation
|
|
||
| return out | ||
|
|
||
| def _smi_try(self, fn, *a, default=None, **kw): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type hints should be added here
|
|
||
| return out | ||
|
|
||
| def _get_soc_pstate(self, h) -> StaticSocPstate | None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type hints should be added, and docstring updated to reflect them.
| except ValidationError: | ||
| return None | ||
|
|
||
| def _get_xgmi_plpd(self, h) -> StaticXgmiPlpd | None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type hints should be added, and docstring updated to reflect them.
| except ValidationError: | ||
| return None | ||
|
|
||
| def _get_cache_info(self, h) -> list[StaticCacheInfoItem]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type hints should be added, and docstring updated to reflect them. Would also be good to use a more descriptive variable name.
|
|
||
| return out | ||
|
|
||
| def _get_clock(self, h) -> StaticClockData | None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type hints should be added, and docstring updated to reflect them.
| """Collect AmdSmi data from system | ||
|
|
||
| Args: | ||
| args (_type_, optional): _description_. Defaults to None. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docstring placeholder needs to be updated.
| self._log_event( | ||
| category=EventCategory.PLATFORM, | ||
| description=f"{key} is not consistent across all GPUs", | ||
| priority=EventPriority.ERROR, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is inconsistency here always an error condition? Just wondering if maybe 'warning' would be more appropriate for the priority here
| if args.l0_to_recovery_count_error_threshold is None: | ||
| args.l0_to_recovery_count_error_threshold = self.L0_TO_RECOVERY_COUNT_ERROR_THRESHOLD | ||
| if args.l0_to_recovery_count_warning_threshold is None: | ||
| args.l0_to_recovery_count_warning_threshold = ( | ||
| self.L0_TO_RECOVERY_COUNT_WARNING_THRESHOLD | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be better to have these default values in the analyzer args model itself rather than having them as class variables and setting them manually like this.
| if args.expected_memory_partition_mode or args.expected_compute_partition_mode: | ||
| self.check_expected_memory_partition_mode( | ||
| data.partition, | ||
| args.expected_memory_partition_mode, | ||
| args.expected_compute_partition_mode, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this check does not depend on the 'static' attribute of data, why is it being skipped when static is None?
| na_validator_dict = field_validator("clock", mode="before")(na_to_none_dict) | ||
| na_validator = field_validator("soc_pstate", "xgmi_plpd", "vbios", "limit", mode="before")( | ||
| na_to_none | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For these validators it might make sense to move the logic into the basemodels that they are validating instead of doing them here.
| if getattr(self, "_amdsmi", None) is not None: | ||
| return True | ||
| try: | ||
| self._amdsmi = importlib.import_module("amdsmi") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't work targeting a remote system since the import will operate on the executor not that target system. Might need to implement a check to see if this plugin is getting run at a remote target and exit with an appropriate error if it is remote.
| version: Optional[str] = None | ||
| amdsmi_library_version: Optional[str] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I ran it this looked like a dict but it is str. Do we types pick based on the amd-smi lib source code/documentation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah it is a dict, ill update https://github.com/ROCm/amdsmi/blob/0695e4407855bfbc76c343b765f745b06e5079d0/py-interface/amdsmi_interface.py#L3278
| def na(x) -> bool: | ||
| return x is None or (isinstance(x, str) and x.strip().upper() in {"N/A", "NA", ""}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could the 'na_to_none' function be used here instead?
| category=EventCategory.APPLICATION, | ||
| description="Failed to build ProcessListItem; skipping entry", | ||
| data={ | ||
| "exception": get_exception_traceback(e), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For validation errors it will be better to log the errors like this:
| data={"errors": exception.errors(include_url=False)}, |
This could be applied to all other places where ValidationError is being caught as well.
| """ | ||
|
|
||
| if not self._bind_amdsmi_or_log(): | ||
| self.result.status = ExecutionStatus.NOT_RAN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wonder if perhaps the result should be EXECUTION_FAILURE here instead of NOT_RAN for the case that amdsmi is not available
Functionality supported:
How to run:
It is the users responsibility to have Python API installed: https://rocm.docs.amd.com/projects/amdsmi/en/latest/reference/amdsmi-py-api.html#