Skip to content

Conversation

@dhernandez0
Copy link
Contributor

@dhernandez0 dhernandez0 commented Dec 3, 2025

Motivation

Add an extra greedy phase in order to tune for waves_per_eu, output_swizzle and grid_group_size.
Heuristic is the default value for other tuning kinds (and full/exhaustive), waves_per_eu=0, grid_group_size=0 mean use heuristic.
For output_swizzle -> 0->disabled, 1->enabled, 2->heuristic.

Technical Details

waves_per_eu is a backend variable we use to let the compiler restrict the amount of registers it's going to use based on the amount of waves we want to have concurrently in the same EU.

output_swizzle is a flag to enable/disable the output swizzle pass. This pass saves the output to LDS and loads it again in order to be able to use 128b writes to global memory.

grid_group_size is the group_size we use (see https://triton-lang.org/main/getting-started/tutorials/03-matrix-multiplication.html#l2-cache-optimizations for an explanation of this optimization).

Changes in this PR:

  • Add waves_per_eu and grid_group_size to tuning params
  • Add code to use these new params in RockToGPU.cpp, GridwiseGemmToBlockwise.cpp and OutputSwizzle.cpp
  • Changes in tuning code: RockTuningImpl.cpp
  • New tests and modifications to make existing tests work
  • Update attentionSweeps.py to use new params
  • Update parameterSweeps.py to use default heuristics (otherwise it would be slower)

Perf results

https://github.com/ROCm/rocMLIR-internal/issues/2161#issuecomment-3606635921

Test Plan

Added tests for passes modified by this PR.

Test Result

All tests pass.

Submission Checklist

closes https://github.com/ROCm/rocMLIR-internal/issues/2161

@dhernandez0 dhernandez0 requested a review from causten as a code owner December 3, 2025 09:02
@dhernandez0 dhernandez0 changed the base branch from develop to nonpoweroftwo_tiles December 3, 2025 09:06
@dhernandez0 dhernandez0 force-pushed the greedy_third_iteration branch from ee9aad1 to 31dc230 Compare December 3, 2025 09:07

bool checkVersionFormat(const std::string &s) {
const int32_t maxNumTokensArray[] = {0, 8, 9, 11, 12};
const int32_t maxNumTokensArray[] = {0, 8, 9, 11, 14};
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we haven't done a release since v4 (I think), is it ok to keep the same version?

Base automatically changed from nonpoweroftwo_tiles to develop December 3, 2025 12:05
@dhernandez0 dhernandez0 force-pushed the greedy_third_iteration branch from 16e1449 to af83be3 Compare December 3, 2025 12:06
@dhernandez0 dhernandez0 changed the title [DRAFT] Greedy third iteration Greedy third iteration Dec 3, 2025
@dhernandez0 dhernandez0 self-assigned this Dec 3, 2025
Copilot finished reviewing on behalf of dhernandez0 December 3, 2025 12:26
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for tuning waves_per_eu, output_swizzle, and grid_group_size parameters as part of a third greedy iteration in the tuning process. The changes extend the V4 performance configuration format to include these parameters and update the codebase to propagate them through the compilation pipeline.

Key changes include:

  • Extension of V4 perf config format from 12 to 14 parameters (adding wavesPerEU and gridGroupSize)
  • Updates to Python test utilities to use heuristic defaults for new parameters
  • Systematic updates to all MLIR test files to include the new parameters in mfma_gemm_params and wmma_gemm_params
  • Implementation changes to read and use the new tuning parameters in passes

Reviewed changes

Copilot reviewed 57 out of 57 changed files in this pull request and generated no comments.

Show a summary per file
File Description
parameterSweeps.py Added heuristic defaults (output_swizzle=2, waves_per_eu=0, grid_group_size=0) to test configurations
attentionSweeps.py Extended parameter sweep ranges to include output_swizzle and wavesPerEU options
InitParamsAccelTests.cpp Updated V4 config strings and added test assertions for new parameters
rocmlir-gen.cpp Updated default perfConfig comment to include new parameters
Multiple .mlir test files Systematically added wavesPerEU = 0, gridGroupSize = 0 to all gemm params
GridwiseGemmParams.cpp Added parameter passing through getGemmParamsAttr functions
OutputSwizzle.cpp Implemented tuning-based control of output swizzle pass
GridwiseGemmToBlockwise.cpp Added function attributes for waves_per_eu and output_swizzle
GridLayoutEmitter.cpp Implemented grid_group_size tuning parameter usage
AffixTuningParameters.cpp Updated default perfConfig string
RockDialect.cpp Extended V3 attention config to support wavesPerEU parameter
Serializable.h Updated max token count for V4 config format
waves_per_eu.mlir New test file for waves_per_eu functionality
lowering_output_swizzle_tuning.mlir New test file for output swizzle tuning

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

@pabloantoniom pabloantoniom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • With this PR do we lose the ability to run only greedy 1st and 2nd iteration? If so, I think it is useful to add an option or flag to set which greedy iterations we run. 1st we always want it, but 2nd/3rd, it would be better if the user can choose.
  • Now that we have even more parameters in the perfConfig, to me it's becoming very difficult to differentiate which parameter is which when you see a perfconfig, .e.g: v3:32,32,32,32,32,32,16,1,1,1,2,0,1. Could we change the textual representation to something more readable? Like: v3:[32,32,32],[32,32,32,16,1,1,1],[2,0,1], where the first group would correspond to greedy iteration 1, the 2nd group to 2nd iteration and the 3rd group to 3rd iteration.

>,
InterfaceMethod<
/*desc=*/[{
Return param passed to the backend compiler: waves_per_eu.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This description is not very clear

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rewritten

"int64_t":$nPerWave, "int64_t":$mnPerXdl, "int64_t":$kpack,
"int64_t":$splitKFactor, "int64_t":$scheduleVersion,
"int64_t":$outputSwizzle, "bool":$forceUnroll);
"int64_t":$outputSwizzle, "int64_t":$wavesPerEU, "bool":$forceUnroll);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not gridGroupSize?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gridGroupSize is only used for GEMMs/convs, not attention/g+g kernels.

f(self.gemmScheduleVersion);
f(self.outputSwizzle);
}
if (self.version >= Version::V4) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why adding this here instead of adding it in the previous check if (self.version >= Version::V4)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because we want to visit the params in the right order.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see serialize() in Serializable.h

gridSize = cast<IntegerAttr>(gridSizeAttr).getInt();
gpuFunc.setKnownGridSizeAttr(b.getDenseI32ArrayAttr({gridSize, 1, 1}));

auto wavesPerEUAttr = theFunc->getAttr("waves_per_eu");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we get waves_per_eu, and not group_size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of the order of the passes. Both output_swizzle and gridgroupsize have been used already.

llvm_unreachable("Unknown version of the perfConfig");
}
SmallVector<StringRef, 11> tokens;
SmallVector<StringRef, 13> tokens;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we use expectedNumTokens?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expectedNumTokens is a variable, it wouldn't work here, we could change this to:

SmallVector<StringRef> tokens;
tokens.reserve(expectedNumTokens);

Happy to change it to this if you prefer it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it would look cleaner if we move SmallVector<int64_t, 13> params here and do:

SmallVector<StringRef> tokens; 
SmallVector<StringRef> params 
tokens.reserve(expectedNumTokens);
params .reserve(expectedNumTokens);

IntegerAttr outputSwizzleAttr =
rewriter.getI64IntegerAttr(gemm0TuningParams.getOutputSwizzle());
func::FuncOp funcOp = cast<func::FuncOp>(op->getParentOp());
funcOp->setAttr("waves_per_eu", wavesPerEUAttr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this PR but, could we define attribute names strings (like waves_per_eu, output_swizzle and many others) in RockAttrDefs and use them instead of a hardcoded string? That would be more robust and cleaner

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, good point I'll do it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

gridGroupSize},
arch);

// some params are needed in future passes, add them to func as attributes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What future passes? Can you specify that in the comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rewritten

bool useHeuristic = true;
bool enableOutputSwizzle = false;
if (func->hasAttrOfType<IntegerAttr>("output_swizzle")) {
// 0 -> disabled, 1 -> enabled, 2 -> heuristic
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense to me that 0 is disabled. At the same time, I think that this is inconsistent with other parameters where 0 means use heuristic

Copy link
Contributor Author

@dhernandez0 dhernandez0 Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but outputswizzle was not really introduced here as a tuning param (it was already in develop but unused). So, all existing perfConfigs set outputswizzle=2. Therefore, the most reasonable thing to do IMO (to maintain backwards compatibility) is to keep the 2 as heuristic.

@dhernandez0
Copy link
Contributor Author

dhernandez0 commented Dec 3, 2025

  • With this PR do we lose the ability to run only greedy 1st and 2nd iteration? If so, I think it is useful to add an option or flag to set which greedy iterations we run. 1st we always want it, but 2nd/3rd, it would be better if the user can choose.

I can add the flag, however is it worth it? 3rd phase is adding 72 perfconfigs to test. It's not going to make things slower.

Now that we have even more parameters in the perfConfig, to me it's becoming very difficult to differentiate which parameter is which when you see a perfconfig, .e.g: v3:32,32,32,32,32,32,16,1,1,1,2,0,1. Could we change the textual representation to something more readable? Like: v3:[32,32,32],[32,32,32,16,1,1,1],[2,0,1], where the first group would correspond to greedy iteration 1, the 2nd group to 2nd iteration and the 3rd group to 3rd iteration.

In practise, after affix_tuning you'll see:

#rock.mfma_gemm_params<kpackPerBlock = 8, mPerBlock = 128, nPerBlock = 128, kpack = 8, mPerWave = 64, nPerWave = 64, mnPerXdl = 32, splitKFactor = 1, scheduleVersion = 1, outputSwizzle = 2, wavesPerEU = 0, gridGroupSize = 0, forceUnroll = true>

I think that's easier to understand? We can have a tool (bin/print_perfconfig v3:...) that converts v3:... to #rock.mfma_gemm_params format, so it's easier than generating the whole IR?

@dhernandez0 dhernandez0 force-pushed the greedy_third_iteration branch from 30c8aca to 8884d8b Compare December 3, 2025 20:03
gpuFunc->setAttr("rocdl.waves_per_eu", b.getI32IntegerAttr(wavesPerEU));
LLVM_DEBUG(llvm::dbgs() << "Setting waves_per_eu using tuning param\n");
// we are done
return;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it returning from here ? Doesn't it need to set some other attributes ?

Copy link
Contributor Author

@dhernandez0 dhernandez0 Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, we are inside op.walk([](gpu::GPUFuncOp gpuFunc) {, only two attributes are set: rock.shared_buffer_size and rocdl.waves_per_eu. The code below is just the heuristic to set rocdl.waves_per_eu.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also was confused by this but yes, it looks correct. However, to make it more readable, we could move all the code below into a function like "runWavesPerEUHeuristic", that way it would be easier to understand why a return here

Comment on lines 428 to 429
func.walk([&writes, &rewriter, ldsAllocated, useHeuristic,
enableOutputSwizzle](ThreadwiseWriteAllOp threadwiseWriteAll) {
Copy link
Member

@umangyadav umangyadav Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:
better to keep enum for outputSwizzle.

here useHeuristic == true would make someone think that output swizzle is enabled. But "enableOutputSwizzle" has a different meaning here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

@umangyadav umangyadav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes Looks good

std::vector<std::vector<int64_t>> finetuningParams = {
{0, 1}, // outputSwizzle
wavesPerEUList, // wavesPerEU
{0, 4, 8, 16, 32, 64}}; // gridGroupSize
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why it is multiple of 4 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really, this corresponds to GROUP_SIZE_M in triton I think, and WGM in HipKittens: https://github.com/HazyResearch/HipKittens/blob/7f6986b502396aa865c0c80625121daf7caa756d/kernels/gemm/bf16fp32/mi350x/256_256_64_32_with16x32.cpp#L60.

So, I've seen 64, 8, 4... In other codebases, but this requires more investigation.

@dhernandez0
Copy link
Contributor Author

I got some results of this branch vs develop exhaustive. So, exhaustive wins in 86 cases (<0.95x) and this branch (greedy) wins in 68 cases (>1.05x). The min is 0.75x and max 1.45x (greedy TFlops / exhaustive TFlops). So, I guess until we find a way to incorporate the new params to exhaustive, we want to run both exhaustive + greedy to determine the best perf.

Copy link
Contributor

@pabloantoniom pabloantoniom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

gpuFunc->setAttr("rocdl.waves_per_eu", b.getI32IntegerAttr(wavesPerEU));
LLVM_DEBUG(llvm::dbgs() << "Setting waves_per_eu using tuning param\n");
// we are done
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also was confused by this but yes, it looks correct. However, to make it more readable, we could move all the code below into a function like "runWavesPerEUHeuristic", that way it would be easier to understand why a return here

llvm_unreachable("Unknown version of the perfConfig");
}
SmallVector<StringRef, 11> tokens;
SmallVector<StringRef, 13> tokens;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it would look cleaner if we move SmallVector<int64_t, 13> params here and do:

SmallVector<StringRef> tokens; 
SmallVector<StringRef> params 
tokens.reserve(expectedNumTokens);
params .reserve(expectedNumTokens);

NUM_RANDOM_PERFCONFIGS_PER_TILE_SIZE, RND_SEED);
} else {
} else if (settings.iteration == 1) {
// Second iteration: brute force with winning tile sizes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would change the comments to:

// Second iteration: brute force (except waves_per_eu, output_swizzle and grid_group_size, which we hardcode to use the heuristic) with winning tile sizes
// Third iteration: brute force the remaining configs (waves_per_eu, output_swizzle and grid_group_size)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@pabloantoniom
Copy link
Contributor

  • With this PR do we lose the ability to run only greedy 1st and 2nd iteration? If so, I think it is useful to add an option or flag to set which greedy iterations we run. 1st we always want it, but 2nd/3rd, it would be better if the user can choose.

I can add the flag, however is it worth it? 3rd phase is adding 72 perfconfigs to test. It's not going to make things slower.

Now that we have even more parameters in the perfConfig, to me it's becoming very difficult to differentiate which parameter is which when you see a perfconfig, .e.g: v3:32,32,32,32,32,32,16,1,1,1,2,0,1. Could we change the textual representation to something more readable? Like: v3:[32,32,32],[32,32,32,16,1,1,1],[2,0,1], where the first group would correspond to greedy iteration 1, the 2nd group to 2nd iteration and the 3rd group to 3rd iteration.

In practise, after affix_tuning you'll see:

#rock.mfma_gemm_params<kpackPerBlock = 8, mPerBlock = 128, nPerBlock = 128, kpack = 8, mPerWave = 64, nPerWave = 64, mnPerXdl = 32, splitKFactor = 1, scheduleVersion = 1, outputSwizzle = 2, wavesPerEU = 0, gridGroupSize = 0, forceUnroll = true>

I think that's easier to understand? We can have a tool (bin/print_perfconfig v3:...) that converts v3:... to #rock.mfma_gemm_params format, so it's easier than generating the whole IR?

Regarding 2. I meant when we get the tuning output from tuningRunner.py, we see the perfconfig in plaintext, e.g: v3:32,32,32,32,32,32,16,1,1,1,2,0,1. So I don't know, maybe tuningRunner.py (and the rest of the scripts) should incorporate some function to convert that raw text representation to something more readable. To me something like v3:[32,32,32],[32,32,32,16,1,1,1],[2,0,1] is more readable because I quickly see that tile sizes are 32,32,32, the second group corresponds to schedule version etc, and the last group corresponds to the 3 params tuned in the 3rd iteration of greedy. It's just some visual help.

@dhernandez0
Copy link
Contributor Author

  • With this PR do we lose the ability to run only greedy 1st and 2nd iteration? If so, I think it is useful to add an option or flag to set which greedy iterations we run. 1st we always want it, but 2nd/3rd, it would be better if the user can choose.

I can add the flag, however is it worth it? 3rd phase is adding 72 perfconfigs to test. It's not going to make things slower.

Now that we have even more parameters in the perfConfig, to me it's becoming very difficult to differentiate which parameter is which when you see a perfconfig, .e.g: v3:32,32,32,32,32,32,16,1,1,1,2,0,1. Could we change the textual representation to something more readable? Like: v3:[32,32,32],[32,32,32,16,1,1,1],[2,0,1], where the first group would correspond to greedy iteration 1, the 2nd group to 2nd iteration and the 3rd group to 3rd iteration.

In practise, after affix_tuning you'll see:

#rock.mfma_gemm_params<kpackPerBlock = 8, mPerBlock = 128, nPerBlock = 128, kpack = 8, mPerWave = 64, nPerWave = 64, mnPerXdl = 32, splitKFactor = 1, scheduleVersion = 1, outputSwizzle = 2, wavesPerEU = 0, gridGroupSize = 0, forceUnroll = true>

I think that's easier to understand? We can have a tool (bin/print_perfconfig v3:...) that converts v3:... to #rock.mfma_gemm_params format, so it's easier than generating the whole IR?

Regarding 2. I meant when we get the tuning output from tuningRunner.py, we see the perfconfig in plaintext, e.g: v3:32,32,32,32,32,32,16,1,1,1,2,0,1. So I don't know, maybe tuningRunner.py (and the rest of the scripts) should incorporate some function to convert that raw text representation to something more readable. To me something like v3:[32,32,32],[32,32,32,16,1,1,1],[2,0,1] is more readable because I quickly see that tile sizes are 32,32,32, the second group corresponds to schedule version etc, and the last group corresponds to the 3 params tuned in the 3rd iteration of greedy. It's just some visual help.

sounds good to me, could you open a ticket about this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants