Skip to content

Conversation

@HydraQYH
Copy link
Contributor

@HydraQYH HydraQYH commented Oct 24, 2025

Recently, I wanted to use PDL to optimize SM90 Blockwise Grouped GEMM in a project. After reading the CUTLASS code, I noticed that PDL only supports general GEMM and does not support Array GEMM (Grouped GEMM) - cutlass::arch::wait_on_dependent_grids(); and cutlass::arch::launch_dependent_grids(); only appear in these two files:

  • include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp
  • include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp

Therefore, I mimicked these two files to add PDL support to Array GEMM, and i still have two questions about the current code implementation:

  1. In a general GEMM, the cutlass::arch::wait_on_dependent_grids(); is not called by all producer warps. For example, in Cooperative, the producer warps that call the cutlass::arch::wait_on_dependent_grids(); include the scheduler warp, the mainloop warp, and the epilogue warp, but not the mainloopAux warp. However, in Pingpong, the producer warps that call the cutlass::arch::wait_on_dependent_grids(); include the mainloop warp, the mainloopAux warp, and the epilogue warp, but not the scheduler warp. Why is this?
  2. In Array GEMM, can the cutlass::arch::launch_dependent_grids(); be advanced to before the collective_epilogue.store?

Fix: #2760

if (producer_warp_role == ProducerWarpRole::Scheduler) {
// GroupScheduler requires a producer warp to iterate over the group infos and push
// the work tile infos to the downstream pipelines.
#ifdef CUTLASS_ENABLE_GDC_FOR_SM90
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be safer / simpler to hoist these waits above the warp specialized region. Further optimization / pulling it into specialized regions - should be performance data driven - do you happen to have any details you can share ?.

++ @ANIKET-SHIVAM , @depaulmillz for review.

I see initial work-tile info & tensormap updates inside consumer regions not guarded / waiting for prior grid to complete, is that fine ?. I would have thought if problem shape is on device, before reading it - you'd need for the prior / dependent grid to complete ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your reply. After analysis and testing, i hoist these waits above the warp specialized region. There are two main reasons for this:

  1. Placing waits within or outside the WS region yields almost identical performance in my scenario.
  2. I think your second point is correct, work-tile info & tensormap updates inside consumer regions should be guarded.

I rebase the code and adjust the position of the waits, and it's ready for review.

@HydraQYH HydraQYH force-pushed the dev_support_pdl_for_sm90_gemm_array_tma_ws branch from b0a83c0 to b0f28c1 Compare November 17, 2025 01:28

// Get next work tile
auto [next_work_tile_info, increment_pipe] = scheduler.fetch_next_work(work_tile_info, tile_scheduler_pipeline, tile_scheduler_pipe_consumer_state);
#ifdef CUTLASS_ENABLE_GDC_FOR_SM90
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: You don't have to wrap these stubs with #ifdef - #endif. wait_on_dependent_grids() and launch_dependent_grids() already do this for you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing out the problem. I checked the code and found that this was indeed the case. I have removed unnecessary #ifdef and #endif.


// Get next work tile
auto [next_work_tile_info, increment_pipe] = scheduler.fetch_next_work(work_tile_info, tile_scheduler_pipeline, tile_scheduler_pipe_consumer_state);
#ifdef CUTLASS_ENABLE_GDC_FOR_SM90
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider removing the ifdef since the function launch_dependent_grids internally checks this and the compiler should be able to remove the if statement if the function has an empty body on not supported.

Copy link
Contributor Author

@HydraQYH HydraQYH Nov 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reminder. You are right, the compiler should be able to remove the if statement if the function has an empty body.

@HydraQYH HydraQYH force-pushed the dev_support_pdl_for_sm90_gemm_array_tma_ws branch from 8ed043b to 7d40287 Compare November 18, 2025 01:17
@HydraQYH
Copy link
Contributor Author

HydraQYH commented Nov 18, 2025

@Algy @d-k-b I've noticed that even the general gemm contains unnecessary macro definitions:

#ifdef CUTLASS_ENABLE_GDC_FOR_SM90
if (scheduler.is_last_tile(work_tile_info, NumMmaWarpGroups)) {
// Hint on an early release of global memory resources.
// The timing of calling this function only influences performance,
// not functional correctness.
cutlass::arch::launch_dependent_grids();
}
#endif

#ifdef CUTLASS_ENABLE_GDC_FOR_SM90
if (scheduler.is_last_tile(work_tile_info)) {
// Hint on an early release of global memory resources.
// The timing of calling this function only influences performance,
// not functional correctness.
cutlass::arch::launch_dependent_grids();
}
#endif

Even in TMA WS Pingpong GEMM, the early check will be skipped if CUTLASS_ENABLE_GDC_FOR_SM90 is not defined:

#ifdef CUTLASS_ENABLE_GDC_FOR_SM90
// It is possible to have work tiles start off invalid,
// so we have to check that first.
if (not work_tile_info.is_valid()) {
// Hint on an early release of global memory resources.
// The timing of calling this function only influences performance,
// not functional correctness.
cutlass::arch::launch_dependent_grids();
return;
}
#endif

Should we fix it?

@d-k-b
Copy link
Collaborator

d-k-b commented Nov 18, 2025

@hwu36 -- what are thoughts on removing ifdef checks from other locations?

@hwu36
Copy link
Collaborator

hwu36 commented Nov 18, 2025

They can be removed. May not be in this pr.

@IonThruster
Copy link
Collaborator

We noticed there are more kernels that needs this fix - so we are pushing some basic / safe version into 4.3 release (coming out very soon). Would it be possible to wait for that and rebase this PR on top or see if that is enough ?

@HydraQYH
Copy link
Contributor Author

We noticed there are more kernels that needs this fix - so we are pushing some basic / safe version into 4.3 release (coming out very soon). Would it be possible to wait for that and rebase this PR on top or see if that is enough ?

@IonThruster OK.

@Algy
Copy link
Contributor

Algy commented Nov 20, 2025

@HydraQYH I've found out some more SM90 kernels need to be fixed in the same way. See the issue I posted before.

@HydraQYH
Copy link
Contributor Author

@HydraQYH I've found out some more SM90 kernels need to be fixed in the same way. See the issue I posted before.

@Algy NVIDIA engineers have also discovered this issue and will fix it in version 4.3. cc @IonThruster @hwu36

@IonThruster
Copy link
Collaborator

@Algy , @HydraQYH - could you check main branch if the changes look good (the changes are merged now), or if you'd like to rebase this PR ?

@HydraQYH HydraQYH force-pushed the dev_support_pdl_for_sm90_gemm_array_tma_ws branch from 7d40287 to 79d03b1 Compare November 21, 2025 03:06
@HydraQYH
Copy link
Contributor Author

@Algy , @HydraQYH - could you check main branch if the changes look good (the changes are merged now), or if you'd like to rebase this PR ?

@IonThruster It seems that the main branch only addresses the race condition raised in #2760.

There are still two problems that need to be solved:

  1. Enable cutlass::arch::launch_dependent_grids(); in array gemm.
  2. Remove unnecessary #ifdef / #endif

I rebase code and fix the two issues mentioned above, and it's ready for review. cc @hwu36

@Algy
Copy link
Contributor

Algy commented Nov 21, 2025

@HydraQYH @IonThruster

I find the correctness issue fixed in the main branch, though no cutlass::arch::launch_dependent_grids() s are found in those kernels. I'm ok with the fix, but it appears @HydraQYH needs dep grid launch for the sake of performance, right?

@HydraQYH
Copy link
Contributor Author

@HydraQYH @IonThruster

I find the correctness issue fixed in the main branch, though no cutlass::arch::launch_dependent_grids() s are found in those kernels. I'm ok with the fix, but it appears @HydraQYH needs dep grid launch for the sake of performance, right?

Yes!

@IonThruster
Copy link
Collaborator

++ @Junkai-Wu , @hwu36

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Race condition causing correctness issue of SM90 array/grouped gemms when PDL enabled

5 participants