-
Notifications
You must be signed in to change notification settings - Fork 50
Check command errors on tunningRunner (and other nits) #2138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds error handling and debugging improvements to the tuning runner to prevent silent failures and aid debugging. The changes ensure that subprocess command failures are detected and reported, rather than being silently ignored.
Key Changes
- Added debug logging to print all subprocess commands before execution
- Added validation to detect and raise an error when tuning output is empty
- Added return code checking for subprocess commands with early failure detection
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| # Wait for both processes to finish. | ||
| tuning_loop_stdout, _ = tuning_loop.communicate() | ||
| kernel_gen.communicate() |
Copilot
AI
Dec 1, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The kernel_gen.stdout is closed before waiting for kernel_gen to complete. Line 204 was previously 'kernel_gen.stdout.close()' but is now removed. Without closing kernel_gen.stdout before calling communicate(), tuning_loop may not receive EOF and could hang. The original close() call should be restored before the communicate() calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be wrong but I think this incorrect. As far as I understand, communicate will read all data from stdout/stderr until EOF. So if we add the close before the communicate it would actually fail, because there is nothing to read (file descriptor was closed)
| f"Tested {i} configs, best perf {max_tflops} TFlops {min_ns} ns on perf_config {winning_config}", | ||
| file=sys.stderr) | ||
|
|
||
| if (num_tuning_outputs == 0): |
Copilot
AI
Dec 1, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Unnecessary parentheses around the condition. Python style guidelines recommend removing parentheses from simple if conditions.
| if (num_tuning_outputs == 0): | |
| if num_tuning_outputs == 0: |
| num_tuning_outputs = 0 | ||
| for i, result in enumerate(tuning_output.splitlines()): | ||
| num_tuning_outputs += 1 |
Copilot
AI
Dec 1, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The manual counter num_tuning_outputs is redundant since enumerate already provides the count. After the loop completes, 'i + 1' would give the total count (or check if the loop never executed). Consider using 'len(tuning_output.splitlines())' before the loop or checking if the loop executed by tracking whether winning_config was updated.
As far as I can see the error checking on this PR are different compared to #2025, but if you want we can wait until yours is merged and merge this one |
| ] + mlir_cpu_runner_args | ||
|
|
||
| if options.debug: | ||
| print('Running commands:', file=sys.stderr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is only in verify_kernel_with_perfconfig(), what if verify="none"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If verify="none" then we don't verify, so there is no command to print
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are some commands to print, the tuning commands. rocmlir-gen and tuning-driver
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find myself adding a print for those when something fails in tuning changes, it'd be nice to have them if -debug is passed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you mean in tune_mlir_kernels right? I can add additional prints there, I end up doing the same, adding prints there if something fails
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| min_ns = np.inf | ||
| winning_config = "None" | ||
| for i, result in enumerate(tuning_output): | ||
| num_tuning_outputs = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just fail here if "len(tuning_output.splitlines())==0"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why but I tried that and it didn't work
|
|
||
| if (num_tuning_outputs == 0): | ||
| raise RuntimeError('tuning output is empty') | ||
| return winning_config, max_tflops |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to print the command lines here, instead of the verify function
| if kernel_gen.returncode != 0: | ||
| raise RuntimeError(f'rocmlir-gen command failed: {kernel_gen_command}') | ||
| if tuning_loop.returncode != 0: | ||
| raise RuntimeError(f'rocmlir-tuning-driver command failed: {paths.mlir_paths.rocmlir_tuning_driver_path} {tuning_driver_args}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we use exceptions, can we add a catch somewhere to print everything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My goal with this is to print just that error message, what else is there to print?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think letting RuntimeError without a catch generates ugly messages. I'd prefer something like "error: ..." but consider this a nit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the exception it prints the backtrace and a good error. This is what I get:
Traceback (most recent call last):
File "/home/pamartin/rocMLIR/build/./bin/tuningRunner.py", line 531, in <module>
sys.exit(main())
File "/home/pamartin/rocMLIR/build/./bin/tuningRunner.py", line 494, in main
winners, all_data = tune_mlir_kernels(configs, conf_class, paths, options)
File "/home/pamartin/rocMLIR/build/./bin/tuningRunner.py", line 216, in tune_mlir_kernels
raise RuntimeError(f'rocmlir-gen command failed: {kernel_gen_command}')
RuntimeError: rocmlir-gen command failed: /home/pamartin/rocMLIR/build/bin/rocmlir-gen -operation gemm -t f16 -out_datatype f16 --arch gfx950:sramecc+:xnack- --num_cu 256 -g 1 -m 1 -k 768 -n 768 -transA=False -transB=False --kernel-repeats 10 --perf_config= --device=0
Motivation
If any command we run with subprocess.Popen fails, we will ignore the error and keep going. This makes debugging unnecessarily hard.
Technical Details
This PR adds 3 small improvements:
Test Plan
No new test was added
Test Result
Test pass
Submission Checklist