-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Support PDL for SM90 Array TMA GEMM #2719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Support PDL for SM90 Array TMA GEMM #2719
Conversation
| if (producer_warp_role == ProducerWarpRole::Scheduler) { | ||
| // GroupScheduler requires a producer warp to iterate over the group infos and push | ||
| // the work tile infos to the downstream pipelines. | ||
| #ifdef CUTLASS_ENABLE_GDC_FOR_SM90 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be safer / simpler to hoist these waits above the warp specialized region. Further optimization / pulling it into specialized regions - should be performance data driven - do you happen to have any details you can share ?.
++ @ANIKET-SHIVAM , @depaulmillz for review.
I see initial work-tile info & tensormap updates inside consumer regions not guarded / waiting for prior grid to complete, is that fine ?. I would have thought if problem shape is on device, before reading it - you'd need for the prior / dependent grid to complete ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your reply. After analysis and testing, i hoist these waits above the warp specialized region. There are two main reasons for this:
- Placing waits within or outside the WS region yields almost identical performance in my scenario.
- I think your second point is correct, work-tile info & tensormap updates inside consumer regions should be guarded.
I rebase the code and adjust the position of the waits, and it's ready for review.
b0a83c0 to
b0f28c1
Compare
|
|
||
| // Get next work tile | ||
| auto [next_work_tile_info, increment_pipe] = scheduler.fetch_next_work(work_tile_info, tile_scheduler_pipeline, tile_scheduler_pipe_consumer_state); | ||
| #ifdef CUTLASS_ENABLE_GDC_FOR_SM90 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: You don't have to wrap these stubs with #ifdef - #endif. wait_on_dependent_grids() and launch_dependent_grids() already do this for you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for pointing out the problem. I checked the code and found that this was indeed the case. I have removed unnecessary #ifdef and #endif.
|
|
||
| // Get next work tile | ||
| auto [next_work_tile_info, increment_pipe] = scheduler.fetch_next_work(work_tile_info, tile_scheduler_pipeline, tile_scheduler_pipe_consumer_state); | ||
| #ifdef CUTLASS_ENABLE_GDC_FOR_SM90 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider removing the ifdef since the function launch_dependent_grids internally checks this and the compiler should be able to remove the if statement if the function has an empty body on not supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reminder. You are right, the compiler should be able to remove the if statement if the function has an empty body.
8ed043b to
7d40287
Compare
|
@Algy @d-k-b I've noticed that even the general gemm contains unnecessary macro definitions: cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp Lines 871 to 879 in a243955
cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp Lines 793 to 801 in a243955
Even in TMA WS Pingpong GEMM, the early check will be skipped if cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp Lines 803 to 814 in a243955
Should we fix it? |
|
@hwu36 -- what are thoughts on removing |
|
They can be removed. May not be in this pr. |
|
We noticed there are more kernels that needs this fix - so we are pushing some basic / safe version into 4.3 release (coming out very soon). Would it be possible to wait for that and rebase this PR on top or see if that is enough ? |
@IonThruster OK. |
|
@HydraQYH I've found out some more SM90 kernels need to be fixed in the same way. See the issue I posted before. |
@Algy NVIDIA engineers have also discovered this issue and will fix it in version 4.3. cc @IonThruster @hwu36 |
7d40287 to
79d03b1
Compare
@IonThruster It seems that the main branch only addresses the race condition raised in #2760. There are still two problems that need to be solved:
I rebase code and fix the two issues mentioned above, and it's ready for review. cc @hwu36 |
|
I find the correctness issue fixed in the main branch, though no |
Yes! |
|
++ @Junkai-Wu , @hwu36 |
Recently, I wanted to use PDL to optimize SM90 Blockwise Grouped GEMM in a project. After reading the CUTLASS code, I noticed that PDL only supports general GEMM and does not support Array GEMM (Grouped GEMM) -
cutlass::arch::wait_on_dependent_grids();andcutlass::arch::launch_dependent_grids();only appear in these two files:Therefore, I mimicked these two files to add PDL support to Array GEMM, and i still have two questions about the current code implementation:
cutlass::arch::wait_on_dependent_grids();is not called by all producer warps. For example, in Cooperative, the producer warps that call thecutlass::arch::wait_on_dependent_grids();include the scheduler warp, the mainloop warp, and the epilogue warp, but not the mainloopAux warp. However, in Pingpong, the producer warps that call thecutlass::arch::wait_on_dependent_grids();include the mainloop warp, the mainloopAux warp, and the epilogue warp, but not the scheduler warp. Why is this?cutlass::arch::launch_dependent_grids();be advanced to before thecollective_epilogue.store?Fix: #2760