-
Notifications
You must be signed in to change notification settings - Fork 50
Greedy third iteration #2140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Greedy third iteration #2140
Conversation
ee9aad1 to
31dc230
Compare
|
|
||
| bool checkVersionFormat(const std::string &s) { | ||
| const int32_t maxNumTokensArray[] = {0, 8, 9, 11, 12}; | ||
| const int32_t maxNumTokensArray[] = {0, 8, 9, 11, 14}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we haven't done a release since v4 (I think), is it ok to keep the same version?
16e1449 to
af83be3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds support for tuning waves_per_eu, output_swizzle, and grid_group_size parameters as part of a third greedy iteration in the tuning process. The changes extend the V4 performance configuration format to include these parameters and update the codebase to propagate them through the compilation pipeline.
Key changes include:
- Extension of V4 perf config format from 12 to 14 parameters (adding
wavesPerEUandgridGroupSize) - Updates to Python test utilities to use heuristic defaults for new parameters
- Systematic updates to all MLIR test files to include the new parameters in
mfma_gemm_paramsandwmma_gemm_params - Implementation changes to read and use the new tuning parameters in passes
Reviewed changes
Copilot reviewed 57 out of 57 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
parameterSweeps.py |
Added heuristic defaults (output_swizzle=2, waves_per_eu=0, grid_group_size=0) to test configurations |
attentionSweeps.py |
Extended parameter sweep ranges to include output_swizzle and wavesPerEU options |
InitParamsAccelTests.cpp |
Updated V4 config strings and added test assertions for new parameters |
rocmlir-gen.cpp |
Updated default perfConfig comment to include new parameters |
Multiple .mlir test files |
Systematically added wavesPerEU = 0, gridGroupSize = 0 to all gemm params |
GridwiseGemmParams.cpp |
Added parameter passing through getGemmParamsAttr functions |
OutputSwizzle.cpp |
Implemented tuning-based control of output swizzle pass |
GridwiseGemmToBlockwise.cpp |
Added function attributes for waves_per_eu and output_swizzle |
GridLayoutEmitter.cpp |
Implemented grid_group_size tuning parameter usage |
AffixTuningParameters.cpp |
Updated default perfConfig string |
RockDialect.cpp |
Extended V3 attention config to support wavesPerEU parameter |
Serializable.h |
Updated max token count for V4 config format |
waves_per_eu.mlir |
New test file for waves_per_eu functionality |
lowering_output_swizzle_tuning.mlir |
New test file for output swizzle tuning |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
pabloantoniom
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- With this PR do we lose the ability to run only greedy 1st and 2nd iteration? If so, I think it is useful to add an option or flag to set which greedy iterations we run. 1st we always want it, but 2nd/3rd, it would be better if the user can choose.
- Now that we have even more parameters in the perfConfig, to me it's becoming very difficult to differentiate which parameter is which when you see a perfconfig, .e.g:
v3:32,32,32,32,32,32,16,1,1,1,2,0,1. Could we change the textual representation to something more readable? Like:v3:[32,32,32],[32,32,32,16,1,1,1],[2,0,1], where the first group would correspond to greedy iteration 1, the 2nd group to 2nd iteration and the 3rd group to 3rd iteration.
| >, | ||
| InterfaceMethod< | ||
| /*desc=*/[{ | ||
| Return param passed to the backend compiler: waves_per_eu. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This description is not very clear
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rewritten
| "int64_t":$nPerWave, "int64_t":$mnPerXdl, "int64_t":$kpack, | ||
| "int64_t":$splitKFactor, "int64_t":$scheduleVersion, | ||
| "int64_t":$outputSwizzle, "bool":$forceUnroll); | ||
| "int64_t":$outputSwizzle, "int64_t":$wavesPerEU, "bool":$forceUnroll); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not gridGroupSize?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gridGroupSize is only used for GEMMs/convs, not attention/g+g kernels.
| f(self.gemmScheduleVersion); | ||
| f(self.outputSwizzle); | ||
| } | ||
| if (self.version >= Version::V4) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why adding this here instead of adding it in the previous check if (self.version >= Version::V4)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because we want to visit the params in the right order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see serialize() in Serializable.h
| gridSize = cast<IntegerAttr>(gridSizeAttr).getInt(); | ||
| gpuFunc.setKnownGridSizeAttr(b.getDenseI32ArrayAttr({gridSize, 1, 1})); | ||
|
|
||
| auto wavesPerEUAttr = theFunc->getAttr("waves_per_eu"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we get waves_per_eu, and not group_size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of the order of the passes. Both output_swizzle and gridgroupsize have been used already.
| llvm_unreachable("Unknown version of the perfConfig"); | ||
| } | ||
| SmallVector<StringRef, 11> tokens; | ||
| SmallVector<StringRef, 13> tokens; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we use expectedNumTokens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expectedNumTokens is a variable, it wouldn't work here, we could change this to:
SmallVector<StringRef> tokens;
tokens.reserve(expectedNumTokens);
Happy to change it to this if you prefer it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think it would look cleaner if we move SmallVector<int64_t, 13> params here and do:
SmallVector<StringRef> tokens;
SmallVector<StringRef> params
tokens.reserve(expectedNumTokens);
params .reserve(expectedNumTokens);
| IntegerAttr outputSwizzleAttr = | ||
| rewriter.getI64IntegerAttr(gemm0TuningParams.getOutputSwizzle()); | ||
| func::FuncOp funcOp = cast<func::FuncOp>(op->getParentOp()); | ||
| funcOp->setAttr("waves_per_eu", wavesPerEUAttr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to this PR but, could we define attribute names strings (like waves_per_eu, output_swizzle and many others) in RockAttrDefs and use them instead of a hardcoded string? That would be more robust and cleaner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, good point I'll do it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| gridGroupSize}, | ||
| arch); | ||
|
|
||
| // some params are needed in future passes, add them to func as attributes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What future passes? Can you specify that in the comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rewritten
| bool useHeuristic = true; | ||
| bool enableOutputSwizzle = false; | ||
| if (func->hasAttrOfType<IntegerAttr>("output_swizzle")) { | ||
| // 0 -> disabled, 1 -> enabled, 2 -> heuristic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense to me that 0 is disabled. At the same time, I think that this is inconsistent with other parameters where 0 means use heuristic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, but outputswizzle was not really introduced here as a tuning param (it was already in develop but unused). So, all existing perfConfigs set outputswizzle=2. Therefore, the most reasonable thing to do IMO (to maintain backwards compatibility) is to keep the 2 as heuristic.
I can add the flag, however is it worth it? 3rd phase is adding 72 perfconfigs to test. It's not going to make things slower.
In practise, after affix_tuning you'll see: I think that's easier to understand? We can have a tool ( |
30c8aca to
8884d8b
Compare
| gpuFunc->setAttr("rocdl.waves_per_eu", b.getI32IntegerAttr(wavesPerEU)); | ||
| LLVM_DEBUG(llvm::dbgs() << "Setting waves_per_eu using tuning param\n"); | ||
| // we are done | ||
| return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it returning from here ? Doesn't it need to set some other attributes ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, we are inside op.walk([](gpu::GPUFuncOp gpuFunc) {, only two attributes are set: rock.shared_buffer_size and rocdl.waves_per_eu. The code below is just the heuristic to set rocdl.waves_per_eu.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also was confused by this but yes, it looks correct. However, to make it more readable, we could move all the code below into a function like "runWavesPerEUHeuristic", that way it would be easier to understand why a return here
| func.walk([&writes, &rewriter, ldsAllocated, useHeuristic, | ||
| enableOutputSwizzle](ThreadwiseWriteAllOp threadwiseWriteAll) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
better to keep enum for outputSwizzle.
here useHeuristic == true would make someone think that output swizzle is enabled. But "enableOutputSwizzle" has a different meaning here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
umangyadav
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes Looks good
| std::vector<std::vector<int64_t>> finetuningParams = { | ||
| {0, 1}, // outputSwizzle | ||
| wavesPerEUList, // wavesPerEU | ||
| {0, 4, 8, 16, 32, 64}}; // gridGroupSize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why it is multiple of 4 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not really, this corresponds to GROUP_SIZE_M in triton I think, and WGM in HipKittens: https://github.com/HazyResearch/HipKittens/blob/7f6986b502396aa865c0c80625121daf7caa756d/kernels/gemm/bf16fp32/mi350x/256_256_64_32_with16x32.cpp#L60.
So, I've seen 64, 8, 4... In other codebases, but this requires more investigation.
|
I got some results of this branch vs develop exhaustive. So, exhaustive wins in 86 cases (<0.95x) and this branch (greedy) wins in 68 cases (>1.05x). The min is 0.75x and max 1.45x (greedy TFlops / exhaustive TFlops). So, I guess until we find a way to incorporate the new params to exhaustive, we want to run both exhaustive + greedy to determine the best perf. |
pabloantoniom
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| gpuFunc->setAttr("rocdl.waves_per_eu", b.getI32IntegerAttr(wavesPerEU)); | ||
| LLVM_DEBUG(llvm::dbgs() << "Setting waves_per_eu using tuning param\n"); | ||
| // we are done | ||
| return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also was confused by this but yes, it looks correct. However, to make it more readable, we could move all the code below into a function like "runWavesPerEUHeuristic", that way it would be easier to understand why a return here
| llvm_unreachable("Unknown version of the perfConfig"); | ||
| } | ||
| SmallVector<StringRef, 11> tokens; | ||
| SmallVector<StringRef, 13> tokens; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think it would look cleaner if we move SmallVector<int64_t, 13> params here and do:
SmallVector<StringRef> tokens;
SmallVector<StringRef> params
tokens.reserve(expectedNumTokens);
params .reserve(expectedNumTokens);
| NUM_RANDOM_PERFCONFIGS_PER_TILE_SIZE, RND_SEED); | ||
| } else { | ||
| } else if (settings.iteration == 1) { | ||
| // Second iteration: brute force with winning tile sizes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would change the comments to:
// Second iteration: brute force (except waves_per_eu, output_swizzle and grid_group_size, which we hardcode to use the heuristic) with winning tile sizes
// Third iteration: brute force the remaining configs (waves_per_eu, output_swizzle and grid_group_size)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Regarding 2. I meant when we get the tuning output from tuningRunner.py, we see the perfconfig in plaintext, e.g: v3:32,32,32,32,32,32,16,1,1,1,2,0,1. So I don't know, maybe tuningRunner.py (and the rest of the scripts) should incorporate some function to convert that raw text representation to something more readable. To me something like |
sounds good to me, could you open a ticket about this? |
Motivation
Add an extra greedy phase in order to tune for waves_per_eu, output_swizzle and grid_group_size.
Heuristic is the default value for other tuning kinds (and full/exhaustive), waves_per_eu=0, grid_group_size=0 mean use heuristic.
For output_swizzle -> 0->disabled, 1->enabled, 2->heuristic.
Technical Details
waves_per_euis a backend variable we use to let the compiler restrict the amount of registers it's going to use based on the amount of waves we want to have concurrently in the same EU.output_swizzleis a flag to enable/disable the output swizzle pass. This pass saves the output to LDS and loads it again in order to be able to use 128b writes to global memory.grid_group_sizeis the group_size we use (see https://triton-lang.org/main/getting-started/tutorials/03-matrix-multiplication.html#l2-cache-optimizations for an explanation of this optimization).Changes in this PR:
Perf results
https://github.com/ROCm/rocMLIR-internal/issues/2161#issuecomment-3606635921
Test Plan
Added tests for passes modified by this PR.
Test Result
All tests pass.
Submission Checklist
closes https://github.com/ROCm/rocMLIR-internal/issues/2161