AmdSmiPlugin part 1 #42

alexandraBara · 2025-09-23T19:51:21Z

Functionality supported:

   version = self._get_amdsmi_version()
            processes = self.get_process()
            partition = self.get_partition()
            firmware = self.get_firmware()
            gpu_list = self.get_gpu_list()
            statics = self.get_static()

How to run:

node-scraper run-plugin AmdSmiPlugin

It is the users responsibility to have Python API installed: https://rocm.docs.amd.com/projects/amdsmi/en/latest/reference/amdsmi-py-api.html#

nodescraper/plugins/inband/amdsmi/amdsmi_collector.py

landrews-amd · 2025-10-08T04:57:44Z

nodescraper/plugins/inband/amdsmi/amdsmi_collector.py

+
+        return out
+
+    def _smi_try(self, fn, *a, default=None, **kw):


Type hints should be added here

landrews-amd · 2025-10-08T04:58:17Z

nodescraper/plugins/inband/amdsmi/amdsmi_collector.py

+
+        return out
+
+    def _get_soc_pstate(self, h) -> StaticSocPstate | None:


Type hints should be added, and docstring updated to reflect them.

landrews-amd · 2025-10-08T04:58:26Z

nodescraper/plugins/inband/amdsmi/amdsmi_collector.py

+        except ValidationError:
+            return None
+
+    def _get_xgmi_plpd(self, h) -> StaticXgmiPlpd | None:


Type hints should be added, and docstring updated to reflect them.

landrews-amd · 2025-10-08T04:58:53Z

nodescraper/plugins/inband/amdsmi/amdsmi_collector.py

+        except ValidationError:
+            return None
+
+    def _get_cache_info(self, h) -> list[StaticCacheInfoItem]:


Type hints should be added, and docstring updated to reflect them. Would also be good to use a more descriptive variable name.

landrews-amd · 2025-10-08T04:59:16Z

nodescraper/plugins/inband/amdsmi/amdsmi_collector.py

+
+        return out
+
+    def _get_clock(self, h) -> StaticClockData | None:


Type hints should be added, and docstring updated to reflect them.

landrews-amd · 2025-10-08T04:59:46Z

nodescraper/plugins/inband/amdsmi/amdsmi_collector.py

+        """Collect AmdSmi data from system
+
+        Args:
+            args (_type_, optional): _description_. Defaults to None.


Docstring placeholder needs to be updated.

landrews-amd · 2025-11-10T06:07:10Z

nodescraper/plugins/inband/amdsmi/amdsmi_analyzer.py

+                self._log_event(
+                    category=EventCategory.PLATFORM,
+                    description=f"{key} is not consistent across all GPUs",
+                    priority=EventPriority.ERROR,


Is inconsistency here always an error condition? Just wondering if maybe 'warning' would be more appropriate for the priority here

landrews-amd · 2025-11-10T06:14:01Z

nodescraper/plugins/inband/amdsmi/amdsmi_analyzer.py

+        if args.l0_to_recovery_count_error_threshold is None:
+            args.l0_to_recovery_count_error_threshold = self.L0_TO_RECOVERY_COUNT_ERROR_THRESHOLD
+        if args.l0_to_recovery_count_warning_threshold is None:
+            args.l0_to_recovery_count_warning_threshold = (
+                self.L0_TO_RECOVERY_COUNT_WARNING_THRESHOLD
+            )


May be better to have these default values in the analyzer args model itself rather than having them as class variables and setting them manually like this.

landrews-amd · 2025-11-10T06:16:15Z

nodescraper/plugins/inband/amdsmi/amdsmi_analyzer.py

+            if args.expected_memory_partition_mode or args.expected_compute_partition_mode:
+                self.check_expected_memory_partition_mode(
+                    data.partition,
+                    args.expected_memory_partition_mode,
+                    args.expected_compute_partition_mode,
+                )


Looks like this check does not depend on the 'static' attribute of data, why is it being skipped when static is None?

graepaul · 2025-11-14T18:05:01Z

nodescraper/plugins/inband/amdsmi/amdsmidata.py

+    na_validator_dict = field_validator("clock", mode="before")(na_to_none_dict)
+    na_validator = field_validator("soc_pstate", "xgmi_plpd", "vbios", "limit", mode="before")(
+        na_to_none
+    )


For these validators it might make sense to move the logic into the basemodels that they are validating instead of doing them here.

graepaul · 2025-11-14T18:08:34Z

nodescraper/plugins/inband/amdsmi/amdsmi_collector.py

+        if getattr(self, "_amdsmi", None) is not None:
+            return True
+        try:
+            self._amdsmi = importlib.import_module("amdsmi")


This doesn't work targeting a remote system since the import will operate on the executor not that target system. Might need to implement a check to see if this plugin is getting run at a remote target and exit with an appropriate error if it is remote.

graepaul · 2025-11-14T18:15:58Z

nodescraper/plugins/inband/amdsmi/amdsmidata.py

+    version: Optional[str] = None
+    amdsmi_library_version: Optional[str] = None


When I ran it this looked like a dict but it is str. Do we types pick based on the amd-smi lib source code/documentation?

yeah it is a dict, ill update https://github.com/ROCm/amdsmi/blob/0695e4407855bfbc76c343b765f745b06e5079d0/py-interface/amdsmi_interface.py#L3278

landrews-amd · 2025-11-14T16:49:07Z

nodescraper/plugins/inband/amdsmi/amdsmidata.py

+        def na(x) -> bool:
+            return x is None or (isinstance(x, str) and x.strip().upper() in {"N/A", "NA", ""})


Could the 'na_to_none' function be used here instead?

landrews-amd · 2025-11-14T16:56:55Z

nodescraper/plugins/inband/amdsmi/amdsmi_collector.py

+                            category=EventCategory.APPLICATION,
+                            description="Failed to build ProcessListItem; skipping entry",
+                            data={
+                                "exception": get_exception_traceback(e),


For validation errors it will be better to log the errors like this:

node-scraper/nodescraper/interfaces/dataanalyzertask.py

Line 84 in 6081dd1

data={"errors": exception.errors(include_url=False)},

This could be applied to all other places where ValidationError is being caught as well.

landrews-amd · 2025-11-14T17:03:02Z

nodescraper/plugins/inband/amdsmi/amdsmi_collector.py

+        """
+
+        if not self._bind_amdsmi_or_log():
+            self.result.status = ExecutionStatus.NOT_RAN


Wonder if perhaps the result should be EXECUTION_FAILURE here instead of NOT_RAN for the case that amdsmi is not available

alexandraBara and others added 25 commits August 20, 2025 14:52

admi_smi folder

f08d574

need to fix collect_data call

1eac292

updated

b6be391

added sudo for all subcmds

dc62a1f

fixed utils

6d03cd2

moved utesdt

058da99

updates

93541ab

update

cfc9ca4

update

6294315

cleanup

38884ae

merged development

a8c39df

removed extra utest

b3b6352

utest + import check

0d86d3f

adding analyzer

c7b3344

updates

359a36d

cleanup

d1e73a0

mypy

7094979

filled in data for AmdSmiStatic, clock is left

08ed3f0

added clock and fixed try for API that doesnt exist in this version

a8437e4

updated partition, and other calls that look slightly differnt

c3be354

fixed partition(compute,gpu), static needs work

7faf0f3

fixed measuring units, mypy

f4a4064

added more analyzer parts

315c7d4

fixed payload for static data mismatch

f652fe3

temporarily removed the pytest + some cleanup

b217c24

alexandraBara requested a review from landrews-amd as a code owner September 23, 2025 19:51

alexandraBara requested a review from graepaul September 23, 2025 19:51

alexandraBara added 3 commits September 26, 2025 11:23

updates

1e456c3

docstring + mypy

6406994

pytest

a9b3ed3

alexandraBara added 2 commits September 30, 2025 11:51

fixed some tpos

6613ded

Merge branch 'development' into alex_amdsmi

9669bbf

landrews-amd reviewed Nov 10, 2025

View reviewed changes

alexandraBara added 4 commits November 11, 2025 09:28

Merge branch 'development' into alex_amdsmi

e13889b

addressed reviews

0eec9f7

moved check outside static if else

05b4772

fixed utest for py3.9

84f24f2

alexandraBara requested a review from landrews-amd November 13, 2025 15:02

alexandraBara added 2 commits November 13, 2025 10:25

removed deprecated API call

107fd46

adding version info during run

fccc0eb

graepaul reviewed Nov 14, 2025

View reviewed changes

landrews-amd reviewed Nov 14, 2025

View reviewed changes

removed python API calls and using cmd line tool to enable remote runs

5cbe96f

alexandraBara marked this pull request as draft November 15, 2025 02:20


		return out

		def _get_soc_pstate(self, h) -> StaticSocPstate \| None:


		return out

		def _get_clock(self, h) -> StaticClockData \| None:

		version: Optional[str] = None
		amdsmi_library_version: Optional[str] = None

		def na(x) -> bool:
		return x is None or (isinstance(x, str) and x.strip().upper() in {"N/A", "NA", ""})

AmdSmiPlugin part 1 #42

Are you sure you want to change the base?

AmdSmiPlugin part 1 #42

Uh oh!

Conversation

alexandraBara commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

alexandraBara commented Sep 23, 2025 •

edited

Loading