Greedy third iteration #2140

dhernandez0 · 2025-12-03T09:02:32Z

Motivation

Add an extra greedy phase in order to tune for waves_per_eu, output_swizzle and grid_group_size.
Heuristic is the default value for other tuning kinds (and full/exhaustive), waves_per_eu=0, grid_group_size=0 mean use heuristic.
For output_swizzle -> 0->disabled, 1->enabled, 2->heuristic.

Technical Details

waves_per_eu is a backend variable we use to let the compiler restrict the amount of registers it's going to use based on the amount of waves we want to have concurrently in the same EU.

output_swizzle is a flag to enable/disable the output swizzle pass. This pass saves the output to LDS and loads it again in order to be able to use 128b writes to global memory.

grid_group_size is the group_size we use (see https://triton-lang.org/main/getting-started/tutorials/03-matrix-multiplication.html#l2-cache-optimizations for an explanation of this optimization).

Changes in this PR:

Add waves_per_eu and grid_group_size to tuning params
Add code to use these new params in RockToGPU.cpp, GridwiseGemmToBlockwise.cpp and OutputSwizzle.cpp
Changes in tuning code: RockTuningImpl.cpp
New tests and modifications to make existing tests work
Update attentionSweeps.py to use new params
Update parameterSweeps.py to use default heuristics (otherwise it would be slower)

Perf results

https://github.com/ROCm/rocMLIR-internal/issues/2161#issuecomment-3606635921

Test Plan

Added tests for passes modified by this PR.

Test Result

All tests pass.

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

closes https://github.com/ROCm/rocMLIR-internal/issues/2161

dhernandez0 · 2025-12-03T11:27:34Z

mlir/include/mlir/Dialect/Rock/Tuning/Serializable.h


  bool checkVersionFormat(const std::string &s) {
-    const int32_t maxNumTokensArray[] = {0, 8, 9, 11, 12};
+    const int32_t maxNumTokensArray[] = {0, 8, 9, 11, 14};


since we haven't done a release since v4 (I think), is it ok to keep the same version?

Copilot

Pull request overview

This PR adds support for tuning waves_per_eu, output_swizzle, and grid_group_size parameters as part of a third greedy iteration in the tuning process. The changes extend the V4 performance configuration format to include these parameters and update the codebase to propagate them through the compilation pipeline.

Key changes include:

Extension of V4 perf config format from 12 to 14 parameters (adding wavesPerEU and gridGroupSize)
Updates to Python test utilities to use heuristic defaults for new parameters
Systematic updates to all MLIR test files to include the new parameters in mfma_gemm_params and wmma_gemm_params
Implementation changes to read and use the new tuning parameters in passes

Reviewed changes

Copilot reviewed 57 out of 57 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
`parameterSweeps.py`	Added heuristic defaults (output_swizzle=2, waves_per_eu=0, grid_group_size=0) to test configurations
`attentionSweeps.py`	Extended parameter sweep ranges to include output_swizzle and wavesPerEU options
`InitParamsAccelTests.cpp`	Updated V4 config strings and added test assertions for new parameters
`rocmlir-gen.cpp`	Updated default perfConfig comment to include new parameters
Multiple `.mlir` test files	Systematically added `wavesPerEU = 0, gridGroupSize = 0` to all gemm params
`GridwiseGemmParams.cpp`	Added parameter passing through getGemmParamsAttr functions
`OutputSwizzle.cpp`	Implemented tuning-based control of output swizzle pass
`GridwiseGemmToBlockwise.cpp`	Added function attributes for waves_per_eu and output_swizzle
`GridLayoutEmitter.cpp`	Implemented grid_group_size tuning parameter usage
`AffixTuningParameters.cpp`	Updated default perfConfig string
`RockDialect.cpp`	Extended V3 attention config to support wavesPerEU parameter
`Serializable.h`	Updated max token count for V4 config format
`waves_per_eu.mlir`	New test file for waves_per_eu functionality
`lowering_output_swizzle_tuning.mlir`	New test file for output swizzle tuning

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pabloantoniom

With this PR do we lose the ability to run only greedy 1st and 2nd iteration? If so, I think it is useful to add an option or flag to set which greedy iterations we run. 1st we always want it, but 2nd/3rd, it would be better if the user can choose.
Now that we have even more parameters in the perfConfig, to me it's becoming very difficult to differentiate which parameter is which when you see a perfconfig, .e.g: v3:32,32,32,32,32,32,16,1,1,1,2,0,1. Could we change the textual representation to something more readable? Like: v3:[32,32,32],[32,32,32,16,1,1,1],[2,0,1], where the first group would correspond to greedy iteration 1, the 2nd group to 2nd iteration and the 3rd group to 3rd iteration.

pabloantoniom · 2025-12-03T13:40:06Z

mlir/include/mlir/Dialect/Rock/IR/RockAccelTuningParamAttrInterface.td

+      >,
+    InterfaceMethod<
+        /*desc=*/[{
+          Return param passed to the backend compiler: waves_per_eu.


This description is not very clear

pabloantoniom · 2025-12-03T13:40:53Z

mlir/include/mlir/Dialect/Rock/IR/RockAttrDefs.td

      "int64_t":$nPerWave, "int64_t":$mnPerXdl, "int64_t":$kpack,
      "int64_t":$splitKFactor, "int64_t":$scheduleVersion,
-      "int64_t":$outputSwizzle, "bool":$forceUnroll);
+      "int64_t":$outputSwizzle, "int64_t":$wavesPerEU, "bool":$forceUnroll);


Why not gridGroupSize?

gridGroupSize is only used for GEMMs/convs, not attention/g+g kernels.

pabloantoniom · 2025-12-03T13:42:21Z

mlir/include/mlir/Dialect/Rock/Tuning/GridwiseGemmParams.h

      f(self.gemmScheduleVersion);
      f(self.outputSwizzle);
    }
+    if (self.version >= Version::V4) {


Why adding this here instead of adding it in the previous check if (self.version >= Version::V4)?

because we want to visit the params in the right order.

see serialize() in Serializable.h

pabloantoniom · 2025-12-03T13:45:45Z

mlir/lib/Conversion/RockToGPU/RockToGPU.cpp

    gridSize = cast<IntegerAttr>(gridSizeAttr).getInt();
    gpuFunc.setKnownGridSizeAttr(b.getDenseI32ArrayAttr({gridSize, 1, 1}));

+    auto wavesPerEUAttr = theFunc->getAttr("waves_per_eu");


Why do we get waves_per_eu, and not group_size?

Because of the order of the passes. Both output_swizzle and gridgroupsize have been used already.

pabloantoniom · 2025-12-03T13:49:11Z

mlir/lib/Dialect/Rock/IR/RockDialect.cpp

    llvm_unreachable("Unknown version of the perfConfig");
  }
-  SmallVector<StringRef, 11> tokens;
+  SmallVector<StringRef, 13> tokens;


Can't we use expectedNumTokens?

expectedNumTokens is a variable, it wouldn't work here, we could change this to:

SmallVector<StringRef> tokens; tokens.reserve(expectedNumTokens);

Happy to change it to this if you prefer it.

Yes, I think it would look cleaner if we move SmallVector<int64_t, 13> params here and do:

SmallVector<StringRef> tokens; SmallVector<StringRef> params tokens.reserve(expectedNumTokens); params .reserve(expectedNumTokens);

pabloantoniom · 2025-12-03T13:52:21Z

mlir/lib/Dialect/Rock/Transforms/GridwiseGemmToBlockwise.cpp

+    IntegerAttr outputSwizzleAttr =
+        rewriter.getI64IntegerAttr(gemm0TuningParams.getOutputSwizzle());
+    func::FuncOp funcOp = cast<func::FuncOp>(op->getParentOp());
+    funcOp->setAttr("waves_per_eu", wavesPerEUAttr);


Unrelated to this PR but, could we define attribute names strings (like waves_per_eu, output_swizzle and many others) in RockAttrDefs and use them instead of a hardcoded string? That would be more robust and cleaner

yes, good point I'll do it

pabloantoniom · 2025-12-03T13:53:33Z

mlir/lib/Dialect/Rock/Transforms/GridwiseGemmToBlockwise.cpp

+         gridGroupSize},
        arch);

+    // some params are needed in future passes, add them to func as attributes


What future passes? Can you specify that in the comment?

pabloantoniom · 2025-12-03T13:55:05Z

mlir/lib/Dialect/Rock/Transforms/OutputSwizzle.cpp

+  bool useHeuristic = true;
+  bool enableOutputSwizzle = false;
+  if (func->hasAttrOfType<IntegerAttr>("output_swizzle")) {
+    // 0 -> disabled, 1 -> enabled, 2 -> heuristic


It makes sense to me that 0 is disabled. At the same time, I think that this is inconsistent with other parameters where 0 means use heuristic

I agree, but outputswizzle was not really introduced here as a tuning param (it was already in develop but unused). So, all existing perfConfigs set outputswizzle=2. Therefore, the most reasonable thing to do IMO (to maintain backwards compatibility) is to keep the 2 as heuristic.

dhernandez0 · 2025-12-03T14:56:38Z

With this PR do we lose the ability to run only greedy 1st and 2nd iteration? If so, I think it is useful to add an option or flag to set which greedy iterations we run. 1st we always want it, but 2nd/3rd, it would be better if the user can choose.

I can add the flag, however is it worth it? 3rd phase is adding 72 perfconfigs to test. It's not going to make things slower.

Now that we have even more parameters in the perfConfig, to me it's becoming very difficult to differentiate which parameter is which when you see a perfconfig, .e.g: v3:32,32,32,32,32,32,16,1,1,1,2,0,1. Could we change the textual representation to something more readable? Like: v3:[32,32,32],[32,32,32,16,1,1,1],[2,0,1], where the first group would correspond to greedy iteration 1, the 2nd group to 2nd iteration and the 3rd group to 3rd iteration.

In practise, after affix_tuning you'll see:

#rock.mfma_gemm_params<kpackPerBlock = 8, mPerBlock = 128, nPerBlock = 128, kpack = 8, mPerWave = 64, nPerWave = 64, mnPerXdl = 32, splitKFactor = 1, scheduleVersion = 1, outputSwizzle = 2, wavesPerEU = 0, gridGroupSize = 0, forceUnroll = true>

I think that's easier to understand? We can have a tool (bin/print_perfconfig v3:...) that converts v3:... to #rock.mfma_gemm_params format, so it's easier than generating the whole IR?

…pSize)

umangyadav · 2025-12-03T20:04:10Z

mlir/lib/Conversion/RockToGPU/RockToGPU.cpp

+        gpuFunc->setAttr("rocdl.waves_per_eu", b.getI32IntegerAttr(wavesPerEU));
+        LLVM_DEBUG(llvm::dbgs() << "Setting waves_per_eu using tuning param\n");
+        // we are done
+        return;


Why is it returning from here ? Doesn't it need to set some other attributes ?

no, we are inside op.walk([](gpu::GPUFuncOp gpuFunc) {, only two attributes are set: rock.shared_buffer_size and rocdl.waves_per_eu. The code below is just the heuristic to set rocdl.waves_per_eu.

I also was confused by this but yes, it looks correct. However, to make it more readable, we could move all the code below into a function like "runWavesPerEUHeuristic", that way it would be easier to understand why a return here

umangyadav · 2025-12-03T23:26:46Z

mlir/lib/Dialect/Rock/Transforms/OutputSwizzle.cpp

+  func.walk([&writes, &rewriter, ldsAllocated, useHeuristic,
+             enableOutputSwizzle](ThreadwiseWriteAllOp threadwiseWriteAll) {


nit:
better to keep enum for outputSwizzle.

here useHeuristic == true would make someone think that output swizzle is enabled. But "enableOutputSwizzle" has a different meaning here.

umangyadav

Changes Looks good

umangyadav · 2025-12-03T23:28:05Z

mlir/lib/Dialect/Rock/Tuning/RockTuningImpl.cpp

+  std::vector<std::vector<int64_t>> finetuningParams = {
+      {0, 1},                 // outputSwizzle
+      wavesPerEUList,         // wavesPerEU
+      {0, 4, 8, 16, 32, 64}}; // gridGroupSize


Is there a reason why it is multiple of 4 ?

not really, this corresponds to GROUP_SIZE_M in triton I think, and WGM in HipKittens: https://github.com/HazyResearch/HipKittens/blob/7f6986b502396aa865c0c80625121daf7caa756d/kernels/gemm/bf16fp32/mi350x/256_256_64_32_with16x32.cpp#L60.

So, I've seen 64, 8, 4... In other codebases, but this requires more investigation.

dhernandez0 · 2025-12-04T09:45:34Z

I got some results of this branch vs develop exhaustive. So, exhaustive wins in 86 cases (<0.95x) and this branch (greedy) wins in 68 cases (>1.05x). The min is 0.75x and max 1.45x (greedy TFlops / exhaustive TFlops). So, I guess until we find a way to incorporate the new params to exhaustive, we want to run both exhaustive + greedy to determine the best perf.

pabloantoniom

LGTM

pabloantoniom · 2025-12-04T10:07:37Z

mlir/lib/Conversion/RockToGPU/RockToGPU.cpp

+        gpuFunc->setAttr("rocdl.waves_per_eu", b.getI32IntegerAttr(wavesPerEU));
+        LLVM_DEBUG(llvm::dbgs() << "Setting waves_per_eu using tuning param\n");
+        // we are done
+        return;


I also was confused by this but yes, it looks correct. However, to make it more readable, we could move all the code below into a function like "runWavesPerEUHeuristic", that way it would be easier to understand why a return here

pabloantoniom · 2025-12-04T10:09:40Z

mlir/lib/Dialect/Rock/IR/RockDialect.cpp

    llvm_unreachable("Unknown version of the perfConfig");
  }
-  SmallVector<StringRef, 11> tokens;
+  SmallVector<StringRef, 13> tokens;


Yes, I think it would look cleaner if we move SmallVector<int64_t, 13> params here and do:

SmallVector<StringRef> tokens; SmallVector<StringRef> params tokens.reserve(expectedNumTokens); params .reserve(expectedNumTokens);

pabloantoniom · 2025-12-04T10:21:45Z

mlir/lib/Dialect/Rock/Tuning/RockTuningImpl.cpp

                NUM_RANDOM_PERFCONFIGS_PER_TILE_SIZE, RND_SEED);
-          } else {
+          } else if (settings.iteration == 1) {
            // Second iteration: brute force with winning tile sizes


I would change the comments to:

// Second iteration: brute force (except waves_per_eu, output_swizzle and grid_group_size, which we hardcode to use the heuristic) with winning tile sizes // Third iteration: brute force the remaining configs (waves_per_eu, output_swizzle and grid_group_size)

pabloantoniom · 2025-12-04T10:26:54Z

With this PR do we lose the ability to run only greedy 1st and 2nd iteration? If so, I think it is useful to add an option or flag to set which greedy iterations we run. 1st we always want it, but 2nd/3rd, it would be better if the user can choose.

I can add the flag, however is it worth it? 3rd phase is adding 72 perfconfigs to test. It's not going to make things slower.

Now that we have even more parameters in the perfConfig, to me it's becoming very difficult to differentiate which parameter is which when you see a perfconfig, .e.g: v3:32,32,32,32,32,32,16,1,1,1,2,0,1. Could we change the textual representation to something more readable? Like: v3:[32,32,32],[32,32,32,16,1,1,1],[2,0,1], where the first group would correspond to greedy iteration 1, the 2nd group to 2nd iteration and the 3rd group to 3rd iteration.

In practise, after affix_tuning you'll see:
#rock.mfma_gemm_params<kpackPerBlock = 8, mPerBlock = 128, nPerBlock = 128, kpack = 8, mPerWave = 64, nPerWave = 64, mnPerXdl = 32, splitKFactor = 1, scheduleVersion = 1, outputSwizzle = 2, wavesPerEU = 0, gridGroupSize = 0, forceUnroll = true>
I think that's easier to understand? We can have a tool (bin/print_perfconfig v3:...) that converts v3:... to #rock.mfma_gemm_params format, so it's easier than generating the whole IR?

Regarding 2. I meant when we get the tuning output from tuningRunner.py, we see the perfconfig in plaintext, e.g: v3:32,32,32,32,32,32,16,1,1,1,2,0,1. So I don't know, maybe tuningRunner.py (and the rest of the scripts) should incorporate some function to convert that raw text representation to something more readable. To me something like v3:[32,32,32],[32,32,32,16,1,1,1],[2,0,1] is more readable because I quickly see that tile sizes are 32,32,32, the second group corresponds to schedule version etc, and the last group corresponds to the 3 params tuned in the 3rd iteration of greedy. It's just some visual help.

dhernandez0 · 2025-12-04T10:41:21Z

With this PR do we lose the ability to run only greedy 1st and 2nd iteration? If so, I think it is useful to add an option or flag to set which greedy iterations we run. 1st we always want it, but 2nd/3rd, it would be better if the user can choose.

I can add the flag, however is it worth it? 3rd phase is adding 72 perfconfigs to test. It's not going to make things slower.

Now that we have even more parameters in the perfConfig, to me it's becoming very difficult to differentiate which parameter is which when you see a perfconfig, .e.g: v3:32,32,32,32,32,32,16,1,1,1,2,0,1. Could we change the textual representation to something more readable? Like: v3:[32,32,32],[32,32,32,16,1,1,1],[2,0,1], where the first group would correspond to greedy iteration 1, the 2nd group to 2nd iteration and the 3rd group to 3rd iteration.

In practise, after affix_tuning you'll see:
#rock.mfma_gemm_params<kpackPerBlock = 8, mPerBlock = 128, nPerBlock = 128, kpack = 8, mPerWave = 64, nPerWave = 64, mnPerXdl = 32, splitKFactor = 1, scheduleVersion = 1, outputSwizzle = 2, wavesPerEU = 0, gridGroupSize = 0, forceUnroll = true>
I think that's easier to understand? We can have a tool (bin/print_perfconfig v3:...) that converts v3:... to #rock.mfma_gemm_params format, so it's easier than generating the whole IR?
Regarding 2. I meant when we get the tuning output from tuningRunner.py, we see the perfconfig in plaintext, e.g: v3:32,32,32,32,32,32,16,1,1,1,2,0,1. So I don't know, maybe tuningRunner.py (and the rest of the scripts) should incorporate some function to convert that raw text representation to something more readable. To me something like v3:[32,32,32],[32,32,32,16,1,1,1],[2,0,1] is more readable because I quickly see that tile sizes are 32,32,32, the second group corresponds to schedule version etc, and the last group corresponds to the 3 params tuned in the 3rd iteration of greedy. It's just some visual help.

sounds good to me, could you open a ticket about this?

dhernandez0 requested a review from causten as a code owner December 3, 2025 09:02

dhernandez0 changed the base branch from develop to nonpoweroftwo_tiles December 3, 2025 09:06

dhernandez0 force-pushed the greedy_third_iteration branch from ee9aad1 to 31dc230 Compare December 3, 2025 09:07

dhernandez0 commented Dec 3, 2025

View reviewed changes

Base automatically changed from nonpoweroftwo_tiles to develop December 3, 2025 12:05

dhernandez0 force-pushed the greedy_third_iteration branch from 16e1449 to af83be3 Compare December 3, 2025 12:06

dhernandez0 changed the title ~~[DRAFT] Greedy third iteration~~ Greedy third iteration Dec 3, 2025

dhernandez0 self-assigned this Dec 3, 2025

dhernandez0 requested review from Copilot, justinrosner, pabloantoniom and umangyadav December 3, 2025 12:23

Copilot started reviewing on behalf of dhernandez0 December 3, 2025 12:26 View session

Copilot finished reviewing on behalf of dhernandez0 December 3, 2025 12:26

Copilot AI reviewed Dec 3, 2025

View reviewed changes

pabloantoniom reviewed Dec 3, 2025

View reviewed changes

dhernandez0 requested a review from pabloantoniom December 3, 2025 15:23

dhernandez0 added 2 commits December 3, 2025 21:03

Add third phase to greedy tuning (outputSwizzle, wavesPerEU, gridGrou…

5ad0967

…pSize)

Addressing PR comments

8884d8b

dhernandez0 force-pushed the greedy_third_iteration branch from 30c8aca to 8884d8b Compare December 3, 2025 20:03

umangyadav reviewed Dec 3, 2025

View reviewed changes

umangyadav approved these changes Dec 3, 2025

View reviewed changes

dhernandez0 added 2 commits December 4, 2025 09:55

Addressing PR comments

21449a2

Merge branch 'develop' into greedy_third_iteration

87ed630

Merge branch 'develop' into greedy_third_iteration

3ae9033

pabloantoniom approved these changes Dec 4, 2025

View reviewed changes

dhernandez0 and others added 6 commits December 4, 2025 12:08

Addressing PR comments

414ef56

update tests

7e10468

Merge branch 'develop' into greedy_third_iteration

0f7b9b8

Merge branch 'develop' into greedy_third_iteration

e9088bc

Merge branch 'develop' into greedy_third_iteration

4f4141c

Merge branch 'develop' into greedy_third_iteration

b625975

		func.walk([&writes, &rewriter, ldsAllocated, useHeuristic,
		enableOutputSwizzle](ThreadwiseWriteAllOp threadwiseWriteAll) {

Greedy third iteration #2140

Are you sure you want to change the base?

Greedy third iteration #2140

Uh oh!

Conversation

dhernandez0 commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Perf results

Test Plan

Test Result

Submission Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

pabloantoniom left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dhernandez0 Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dhernandez0 commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dhernandez0 Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

umangyadav Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

dhernandez0 commented Dec 3, 2025 •

edited

Loading

dhernandez0 Dec 3, 2025 •

edited

Loading

dhernandez0 commented Dec 3, 2025 •

edited

Loading

dhernandez0 Dec 4, 2025 •

edited

Loading

umangyadav Dec 3, 2025 •

edited

Loading