-
Notifications
You must be signed in to change notification settings - Fork 50
Check command errors on tunningRunner (and other nits) #2138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Changes from 3 commits
152ca4c
4fd447f
a29c017
e6528b2
4d7845b
dd19cc4
2130ad3
a4b72a2
ee15f35
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -74,7 +74,10 @@ def verify_kernel_with_perfconfig(perfconfig, config, paths: Paths, options: Opt | |||||
| ] + mlir_cpu_runner_args | ||||||
|
|
||||||
| if options.debug: | ||||||
| print('Running commands:', file=sys.stderr) | ||||||
| print(rocmlir_gen_command, file=sys.stderr) | ||||||
| print(rocmlir_driver_command, file=sys.stderr) | ||||||
| print(profiler_command, file=sys.stderr) | ||||||
|
|
||||||
| prevdir = os.getcwd() | ||||||
| with tempfile.TemporaryDirectory() as tmpdir: | ||||||
|
|
@@ -123,7 +126,9 @@ def get_winning_config(tuning_output, test_vector, config, all_data, paths: Path | |||||
| max_tflops = -np.inf | ||||||
| min_ns = np.inf | ||||||
| winning_config = "None" | ||||||
| for i, result in enumerate(tuning_output): | ||||||
| num_tuning_outputs = 0 | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. just fail here if "len(tuning_output.splitlines())==0"?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure why but I tried that and it didn't work |
||||||
| for i, result in enumerate(tuning_output.splitlines()): | ||||||
| num_tuning_outputs += 1 | ||||||
|
Comment on lines
+129
to
+131
|
||||||
| result = result.decode('utf-8').strip() | ||||||
| if not options.quiet and not options.compact_print and i > 0 and i % 100 == 0: | ||||||
| print( | ||||||
|
|
@@ -165,6 +170,8 @@ def get_winning_config(tuning_output, test_vector, config, all_data, paths: Path | |||||
| f"Tested {i} configs, best perf {max_tflops} TFlops {min_ns} ns on perf_config {winning_config}", | ||||||
| file=sys.stderr) | ||||||
|
|
||||||
| if (num_tuning_outputs == 0): | ||||||
|
||||||
| if (num_tuning_outputs == 0): | |
| if num_tuning_outputs == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to print the command lines here, instead of the verify function
Copilot
AI
Dec 1, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The kernel_gen.stdout is closed before waiting for kernel_gen to complete. Line 204 was previously 'kernel_gen.stdout.close()' but is now removed. Without closing kernel_gen.stdout before calling communicate(), tuning_loop may not receive EOF and could hang. The original close() call should be restored before the communicate() calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be wrong but I think this incorrect. As far as I understand, communicate will read all data from stdout/stderr until EOF. So if we add the close before the communicate it would actually fail, because there is nothing to read (file descriptor was closed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we use exceptions, can we add a catch somewhere to print everything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My goal with this is to print just that error message, what else is there to print?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think letting RuntimeError without a catch generates ugly messages. I'd prefer something like "error: ..." but consider this a nit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the exception it prints the backtrace and a good error. This is what I get:
Traceback (most recent call last):
File "/home/pamartin/rocMLIR/build/./bin/tuningRunner.py", line 531, in <module>
sys.exit(main())
File "/home/pamartin/rocMLIR/build/./bin/tuningRunner.py", line 494, in main
winners, all_data = tune_mlir_kernels(configs, conf_class, paths, options)
File "/home/pamartin/rocMLIR/build/./bin/tuningRunner.py", line 216, in tune_mlir_kernels
raise RuntimeError(f'rocmlir-gen command failed: {kernel_gen_command}')
RuntimeError: rocmlir-gen command failed: /home/pamartin/rocMLIR/build/bin/rocmlir-gen -operation gemm -t f16 -out_datatype f16 --arch gfx950:sramecc+:xnack- --num_cu 256 -g 1 -m 1 -k 768 -n 768 -transA=False -transB=False --kernel-repeats 10 --perf_config= --device=0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is only in verify_kernel_with_perfconfig(), what if verify="none"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If verify="none" then we don't verify, so there is no command to print
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are some commands to print, the tuning commands. rocmlir-gen and tuning-driver
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find myself adding a print for those when something fails in tuning changes, it'd be nice to have them if -debug is passed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you mean in
tune_mlir_kernelsright? I can add additional prints there, I end up doing the same, adding prints there if something failsThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done