Check command errors on tunningRunner (and other nits) #2138

pabloantoniom · 2025-12-01T11:43:03Z

Motivation

If any command we run with subprocess.Popen fails, we will ignore the error and keep going. This makes debugging unnecessarily hard.

Technical Details

This PR adds 3 small improvements:

Print all commands that are about to be run (1st commit)
Abort if the tuning output is empty (2nd commit)
Check the return code from the commands we run and fail early if any of them fail (3rd commit)

Test Plan

No new test was added

Test Result

Test pass

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Copilot

Pull request overview

This PR adds error handling and debugging improvements to the tuning runner to prevent silent failures and aid debugging. The changes ensure that subprocess command failures are detected and reported, rather than being silently ignored.

Key Changes

Added debug logging to print all subprocess commands before execution
Added validation to detect and raise an error when tuning output is empty
Added return code checking for subprocess commands with early failure detection

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-01T14:15:02Z

mlir/utils/performance/tuningRunner.py

+
+            # Wait for both processes to finish.
+            tuning_loop_stdout, _ = tuning_loop.communicate()
+            kernel_gen.communicate()


The kernel_gen.stdout is closed before waiting for kernel_gen to complete. Line 204 was previously 'kernel_gen.stdout.close()' but is now removed. Without closing kernel_gen.stdout before calling communicate(), tuning_loop may not receive EOF and could hang. The original close() call should be restored before the communicate() calls.

@pabloantoniom

I might be wrong but I think this incorrect. As far as I understand, communicate will read all data from stdout/stderr until EOF. So if we add the close before the communicate it would actually fail, because there is nothing to read (file descriptor was closed)

Copilot · 2025-12-01T14:15:03Z

mlir/utils/performance/tuningRunner.py

                    f"Tested {i} configs, best perf {max_tflops} TFlops {min_ns} ns on perf_config {winning_config}",
                    file=sys.stderr)

+    if (num_tuning_outputs == 0):


[nitpick] Unnecessary parentheses around the condition. Python style guidelines recommend removing parentheses from simple if conditions.

Suggested change

if (num_tuning_outputs == 0):

if num_tuning_outputs == 0:

Copilot · 2025-12-01T14:15:03Z

mlir/utils/performance/tuningRunner.py

+    num_tuning_outputs = 0
+    for i, result in enumerate(tuning_output.splitlines()):
+        num_tuning_outputs += 1


The manual counter num_tuning_outputs is redundant since enumerate already provides the count. After the loop completes, 'i + 1' would give the total count (or check if the loop never executed). Consider using 'len(tuning_output.splitlines())' before the loop or checking if the loop executed by tracking whether winning_config was updated.

mirza-halilcevic · 2025-12-02T07:32:35Z

These changes might not be needed as #2025 already improves on error handling. If necessary, we can make this PR target #2025 instead of develop to avoid difficult conflicts.

pabloantoniom · 2025-12-02T15:17:56Z

These changes might not be needed as #2025 already improves on error handling. If necessary, we can make this PR target #2025 instead of develop to avoid difficult conflicts.

As far as I can see the error checking on this PR are different compared to #2025, but if you want we can wait until yours is merged and merge this one

dhernandez0 · 2025-12-03T10:01:11Z

mlir/utils/performance/tuningRunner.py

        ] + mlir_cpu_runner_args

    if options.debug:
+        print('Running commands:', file=sys.stderr)


this is only in verify_kernel_with_perfconfig(), what if verify="none"?

If verify="none" then we don't verify, so there is no command to print

there are some commands to print, the tuning commands. rocmlir-gen and tuning-driver

I find myself adding a print for those when something fails in tuning changes, it'd be nice to have them if -debug is passed

Ah, you mean in tune_mlir_kernels right? I can add additional prints there, I end up doing the same, adding prints there if something fails

dhernandez0 · 2025-12-03T10:01:56Z

mlir/utils/performance/tuningRunner.py

    min_ns = np.inf
    winning_config = "None"
-    for i, result in enumerate(tuning_output):
+    num_tuning_outputs = 0


just fail here if "len(tuning_output.splitlines())==0"?

I'm not sure why but I tried that and it didn't work

dhernandez0 · 2025-12-03T10:03:17Z

mlir/utils/performance/tuningRunner.py


+    if (num_tuning_outputs == 0):
+        raise RuntimeError('tuning output is empty')
    return winning_config, max_tflops


I'd prefer to print the command lines here, instead of the verify function

dhernandez0 · 2025-12-03T10:06:38Z

mlir/utils/performance/tuningRunner.py

+            if kernel_gen.returncode != 0:
+                raise RuntimeError(f'rocmlir-gen command failed: {kernel_gen_command}')
+            if tuning_loop.returncode != 0:
+                raise RuntimeError(f'rocmlir-tuning-driver command failed: {paths.mlir_paths.rocmlir_tuning_driver_path} {tuning_driver_args}')


if we use exceptions, can we add a catch somewhere to print everything?

My goal with this is to print just that error message, what else is there to print?

I think letting RuntimeError without a catch generates ugly messages. I'd prefer something like "error: ..." but consider this a nit.

With the exception it prints the backtrace and a good error. This is what I get:

Traceback (most recent call last): File "/home/pamartin/rocMLIR/build/./bin/tuningRunner.py", line 531, in <module> sys.exit(main()) File "/home/pamartin/rocMLIR/build/./bin/tuningRunner.py", line 494, in main winners, all_data = tune_mlir_kernels(configs, conf_class, paths, options) File "/home/pamartin/rocMLIR/build/./bin/tuningRunner.py", line 216, in tune_mlir_kernels raise RuntimeError(f'rocmlir-gen command failed: {kernel_gen_command}') RuntimeError: rocmlir-gen command failed: /home/pamartin/rocMLIR/build/bin/rocmlir-gen -operation gemm -t f16 -out_datatype f16 --arch gfx950:sramecc+:xnack- --num_cu 256 -g 1 -m 1 -k 768 -n 768 -transA=False -transB=False --kernel-repeats 10 --perf_config= --device=0

pabloantoniom added 3 commits December 1, 2025 11:38

Print all cmds, not only one (?)

152ca4c

Abort if tuning loop is empty

4fd447f

Check return codes

a29c017

pabloantoniom requested a review from causten as a code owner December 1, 2025 11:43

pabloantoniom requested review from dhernandez0, dorde-antic, justinrosner, mirza-halilcevic and umangyadav December 1, 2025 11:43

pabloantoniom mentioned this pull request Dec 1, 2025

Greedy tuning #2131

Merged

1 task

umangyadav approved these changes Dec 1, 2025

View reviewed changes

umangyadav requested a review from Copilot December 1, 2025 14:14

Copilot AI reviewed Dec 1, 2025

View reviewed changes

dorde-antic approved these changes Dec 1, 2025

View reviewed changes

dhernandez0 reviewed Dec 3, 2025

View reviewed changes

Add more debug prints

e6528b2

dhernandez0 approved these changes Dec 3, 2025

View reviewed changes

Merge branch 'develop' into pablo-2157

4d7845b

justinrosner approved these changes Dec 3, 2025

View reviewed changes

dhernandez0 and others added 4 commits December 4, 2025 09:59

Merge branch 'develop' into pablo-2157

dd19cc4

Merge branch 'develop' into pablo-2157

2130ad3

Merge branch 'develop' into pablo-2157

a4b72a2

Merge branch 'develop' into pablo-2157

ee15f35

Check command errors on tunningRunner (and other nits) #2138

Are you sure you want to change the base?

Check command errors on tunningRunner (and other nits) #2138

Conversation

pabloantoniom commented Dec 1, 2025

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Key Changes

Uh oh!

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

mirza-halilcevic commented Dec 2, 2025

Uh oh!

pabloantoniom commented Dec 2, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pabloantoniom Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

pabloantoniom Dec 3, 2025 •

edited

Loading