Skip to content

Conversation

@pabloantoniom
Copy link
Contributor

Motivation

If any command we run with subprocess.Popen fails, we will ignore the error and keep going. This makes debugging unnecessarily hard.

Technical Details

This PR adds 3 small improvements:

  • Print all commands that are about to be run (1st commit)
  • Abort if the tuning output is empty (2nd commit)
  • Check the return code from the commands we run and fail early if any of them fail (3rd commit)

Test Plan

No new test was added

Test Result

Test pass

Submission Checklist

@umangyadav umangyadav requested a review from Copilot December 1, 2025 14:14
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds error handling and debugging improvements to the tuning runner to prevent silent failures and aid debugging. The changes ensure that subprocess command failures are detected and reported, rather than being silently ignored.

Key Changes

  • Added debug logging to print all subprocess commands before execution
  • Added validation to detect and raise an error when tuning output is empty
  • Added return code checking for subprocess commands with early failure detection

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +204 to +207

# Wait for both processes to finish.
tuning_loop_stdout, _ = tuning_loop.communicate()
kernel_gen.communicate()
Copy link

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The kernel_gen.stdout is closed before waiting for kernel_gen to complete. Line 204 was previously 'kernel_gen.stdout.close()' but is now removed. Without closing kernel_gen.stdout before calling communicate(), tuning_loop may not receive EOF and could hang. The original close() call should be restored before the communicate() calls.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be wrong but I think this incorrect. As far as I understand, communicate will read all data from stdout/stderr until EOF. So if we add the close before the communicate it would actually fail, because there is nothing to read (file descriptor was closed)

f"Tested {i} configs, best perf {max_tflops} TFlops {min_ns} ns on perf_config {winning_config}",
file=sys.stderr)

if (num_tuning_outputs == 0):
Copy link

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Unnecessary parentheses around the condition. Python style guidelines recommend removing parentheses from simple if conditions.

Suggested change
if (num_tuning_outputs == 0):
if num_tuning_outputs == 0:

Copilot uses AI. Check for mistakes.
Comment on lines +129 to +131
num_tuning_outputs = 0
for i, result in enumerate(tuning_output.splitlines()):
num_tuning_outputs += 1
Copy link

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The manual counter num_tuning_outputs is redundant since enumerate already provides the count. After the loop completes, 'i + 1' would give the total count (or check if the loop never executed). Consider using 'len(tuning_output.splitlines())' before the loop or checking if the loop executed by tracking whether winning_config was updated.

Copilot uses AI. Check for mistakes.
@mirza-halilcevic
Copy link
Contributor

These changes might not be needed as #2025 already improves on error handling. If necessary, we can make this PR target #2025 instead of develop to avoid difficult conflicts.

@pabloantoniom
Copy link
Contributor Author

These changes might not be needed as #2025 already improves on error handling. If necessary, we can make this PR target #2025 instead of develop to avoid difficult conflicts.

As far as I can see the error checking on this PR are different compared to #2025, but if you want we can wait until yours is merged and merge this one

] + mlir_cpu_runner_args

if options.debug:
print('Running commands:', file=sys.stderr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is only in verify_kernel_with_perfconfig(), what if verify="none"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If verify="none" then we don't verify, so there is no command to print

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are some commands to print, the tuning commands. rocmlir-gen and tuning-driver

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find myself adding a print for those when something fails in tuning changes, it'd be nice to have them if -debug is passed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you mean in tune_mlir_kernels right? I can add additional prints there, I end up doing the same, adding prints there if something fails

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

min_ns = np.inf
winning_config = "None"
for i, result in enumerate(tuning_output):
num_tuning_outputs = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just fail here if "len(tuning_output.splitlines())==0"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why but I tried that and it didn't work


if (num_tuning_outputs == 0):
raise RuntimeError('tuning output is empty')
return winning_config, max_tflops
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to print the command lines here, instead of the verify function

if kernel_gen.returncode != 0:
raise RuntimeError(f'rocmlir-gen command failed: {kernel_gen_command}')
if tuning_loop.returncode != 0:
raise RuntimeError(f'rocmlir-tuning-driver command failed: {paths.mlir_paths.rocmlir_tuning_driver_path} {tuning_driver_args}')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we use exceptions, can we add a catch somewhere to print everything?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My goal with this is to print just that error message, what else is there to print?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think letting RuntimeError without a catch generates ugly messages. I'd prefer something like "error: ..." but consider this a nit.

Copy link
Contributor Author

@pabloantoniom pabloantoniom Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the exception it prints the backtrace and a good error. This is what I get:

Traceback (most recent call last):
  File "/home/pamartin/rocMLIR/build/./bin/tuningRunner.py", line 531, in <module>
    sys.exit(main())
  File "/home/pamartin/rocMLIR/build/./bin/tuningRunner.py", line 494, in main
    winners, all_data = tune_mlir_kernels(configs, conf_class, paths, options)
  File "/home/pamartin/rocMLIR/build/./bin/tuningRunner.py", line 216, in tune_mlir_kernels
    raise RuntimeError(f'rocmlir-gen command failed: {kernel_gen_command}')
RuntimeError: rocmlir-gen command failed: /home/pamartin/rocMLIR/build/bin/rocmlir-gen -operation gemm -t f16 -out_datatype f16 --arch gfx950:sramecc+:xnack- --num_cu 256 -g 1 -m 1 -k 768 -n 768 -transA=False -transB=False --kernel-repeats 10 --perf_config= --device=0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants