Support PDL for SM90 Array TMA GEMM #2719

HydraQYH · 2025-10-24T11:56:12Z

Recently, I wanted to use PDL to optimize SM90 Blockwise Grouped GEMM in a project. After reading the CUTLASS code, I noticed that PDL only supports general GEMM and does not support Array GEMM (Grouped GEMM) - cutlass::arch::wait_on_dependent_grids(); and cutlass::arch::launch_dependent_grids(); only appear in these two files:

include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp
include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp

Therefore, I mimicked these two files to add PDL support to Array GEMM, and i still have two questions about the current code implementation:

In a general GEMM, the cutlass::arch::wait_on_dependent_grids(); is not called by all producer warps. For example, in Cooperative, the producer warps that call the cutlass::arch::wait_on_dependent_grids(); include the scheduler warp, the mainloop warp, and the epilogue warp, but not the mainloopAux warp. However, in Pingpong, the producer warps that call the cutlass::arch::wait_on_dependent_grids(); include the mainloop warp, the mainloopAux warp, and the epilogue warp, but not the scheduler warp. Why is this?
In Array GEMM, can the cutlass::arch::launch_dependent_grids(); be advanced to before the collective_epilogue.store?

Fix: #2760

IonThruster · 2025-11-14T18:38:10Z

include/cutlass/gemm/kernel/sm90_gemm_array_tma_warpspecialized_cooperative.hpp

      if (producer_warp_role == ProducerWarpRole::Scheduler) {
        // GroupScheduler requires a producer warp to iterate over the group infos and push
        // the work tile infos to the downstream pipelines.
+        #ifdef CUTLASS_ENABLE_GDC_FOR_SM90


It might be safer / simpler to hoist these waits above the warp specialized region. Further optimization / pulling it into specialized regions - should be performance data driven - do you happen to have any details you can share ?.

++ @ANIKET-SHIVAM , @depaulmillz for review.

I see initial work-tile info & tensormap updates inside consumer regions not guarded / waiting for prior grid to complete, is that fine ?. I would have thought if problem shape is on device, before reading it - you'd need for the prior / dependent grid to complete ?

Thank you for your reply. After analysis and testing, i hoist these waits above the warp specialized region. There are two main reasons for this:

Placing waits within or outside the WS region yields almost identical performance in my scenario.

I think your second point is correct, work-tile info & tensormap updates inside consumer regions should be guarded.

I rebase the code and adjust the position of the waits, and it's ready for review.

Algy · 2025-11-17T02:56:42Z

include/cutlass/gemm/kernel/sm90_gemm_array_tma_warpspecialized_pingpong.hpp


        // Get next work tile
        auto [next_work_tile_info, increment_pipe] = scheduler.fetch_next_work(work_tile_info, tile_scheduler_pipeline, tile_scheduler_pipe_consumer_state);
+        #ifdef CUTLASS_ENABLE_GDC_FOR_SM90


Nit: You don't have to wrap these stubs with #ifdef - #endif. wait_on_dependent_grids() and launch_dependent_grids() already do this for you.

Thank you for pointing out the problem. I checked the code and found that this was indeed the case. I have removed unnecessary #ifdef and #endif.

d-k-b · 2025-11-17T18:49:53Z

include/cutlass/gemm/kernel/sm90_gemm_array_tma_warpspecialized_cooperative.hpp


        // Get next work tile
        auto [next_work_tile_info, increment_pipe] = scheduler.fetch_next_work(work_tile_info, tile_scheduler_pipeline, tile_scheduler_pipe_consumer_state);
+        #ifdef CUTLASS_ENABLE_GDC_FOR_SM90


Consider removing the ifdef since the function launch_dependent_grids internally checks this and the compiler should be able to remove the if statement if the function has an empty body on not supported.

Thanks for the reminder. You are right, the compiler should be able to remove the if statement if the function has an empty body.

HydraQYH · 2025-11-18T01:29:58Z

@Algy @d-k-b I've noticed that even the general gemm contains unnecessary macro definitions:

cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp

Lines 871 to 879 in a243955

    
           #ifdef CUTLASS_ENABLE_GDC_FOR_SM90 
        
           if (scheduler.is_last_tile(work_tile_info, NumMmaWarpGroups)) { 
        
             // Hint on an early release of global memory resources. 
        
             // The timing of calling this function only influences performance, 
        
             // not functional correctness. 
        
             cutlass::arch::launch_dependent_grids(); 
        
           } 
        
           #endif

cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp

Lines 793 to 801 in a243955

    
                   #ifdef CUTLASS_ENABLE_GDC_FOR_SM90 
        
                   if (scheduler.is_last_tile(work_tile_info)) { 
        
                     // Hint on an early release of global memory resources. 
        
                     // The timing of calling this function only influences performance, 
        
                     // not functional correctness. 
        
                     cutlass::arch::launch_dependent_grids(); 
        
                   } 
        
                   #endif

Even in TMA WS Pingpong GEMM, the early check will be skipped if CUTLASS_ENABLE_GDC_FOR_SM90 is not defined:

cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp

Lines 803 to 814 in a243955

    
                 #ifdef CUTLASS_ENABLE_GDC_FOR_SM90 
        
                 // It is possible to have work tiles start off invalid, 
        
                 // so we have to check that first. 
        
                 if (not work_tile_info.is_valid()) { 
        
                   // Hint on an early release of global memory resources. 
        
                   // The timing of calling this function only influences performance, 
        
                   // not functional correctness. 
        
                   cutlass::arch::launch_dependent_grids(); 
        
                   return; 
        
                 } 
        
                 #endif

Should we fix it?

d-k-b · 2025-11-18T02:20:18Z

@hwu36 -- what are thoughts on removing ifdef checks from other locations?

hwu36 · 2025-11-18T02:25:44Z

They can be removed. May not be in this pr.

IonThruster · 2025-11-20T00:55:00Z

We noticed there are more kernels that needs this fix - so we are pushing some basic / safe version into 4.3 release (coming out very soon). Would it be possible to wait for that and rebase this PR on top or see if that is enough ?

HydraQYH · 2025-11-20T01:16:04Z

We noticed there are more kernels that needs this fix - so we are pushing some basic / safe version into 4.3 release (coming out very soon). Would it be possible to wait for that and rebase this PR on top or see if that is enough ?

@IonThruster OK.

Algy · 2025-11-20T09:40:43Z

@HydraQYH I've found out some more SM90 kernels need to be fixed in the same way. See the issue I posted before.

HydraQYH · 2025-11-21T00:37:36Z

@HydraQYH I've found out some more SM90 kernels need to be fixed in the same way. See the issue I posted before.

@Algy NVIDIA engineers have also discovered this issue and will fix it in version 4.3. cc @IonThruster @hwu36

IonThruster · 2025-11-21T02:41:04Z

@Algy , @HydraQYH - could you check main branch if the changes look good (the changes are merged now), or if you'd like to rebase this PR ?

HydraQYH · 2025-11-21T03:23:39Z

@Algy , @HydraQYH - could you check main branch if the changes look good (the changes are merged now), or if you'd like to rebase this PR ?

@IonThruster It seems that the main branch only addresses the race condition raised in #2760.

There are still two problems that need to be solved:

Enable cutlass::arch::launch_dependent_grids(); in array gemm.
Remove unnecessary #ifdef / #endif

I rebase code and fix the two issues mentioned above, and it's ready for review. cc @hwu36

Algy · 2025-11-21T04:57:04Z

@HydraQYH @IonThruster

I find the correctness issue fixed in the main branch, though no cutlass::arch::launch_dependent_grids() s are found in those kernels. I'm ok with the fix, but it appears @HydraQYH needs dep grid launch for the sake of performance, right?

HydraQYH · 2025-11-21T05:17:48Z

@HydraQYH @IonThruster

I find the correctness issue fixed in the main branch, though no cutlass::arch::launch_dependent_grids() s are found in those kernels. I'm ok with the fix, but it appears @HydraQYH needs dep grid launch for the sake of performance, right?

Yes!

IonThruster · 2025-11-21T06:39:11Z

++ @Junkai-Wu , @hwu36

This was referenced Nov 10, 2025

[sgl-kernel][5/N]Support Expert Specialization Grouped GEMM sgl-project/sglang#12666

Merged

[BUG] Race condition causing correctness issue of SM90 array/grouped gemms when PDL enabled #2760

Open

IonThruster reviewed Nov 14, 2025

View reviewed changes

HydraQYH force-pushed the dev_support_pdl_for_sm90_gemm_array_tma_ws branch from b0a83c0 to b0f28c1 Compare November 17, 2025 01:28

Algy reviewed Nov 17, 2025

View reviewed changes

d-k-b reviewed Nov 17, 2025

View reviewed changes

HydraQYH force-pushed the dev_support_pdl_for_sm90_gemm_array_tma_ws branch from 8ed043b to 7d40287 Compare November 18, 2025 01:17

HydraQYH added 6 commits November 21, 2025 11:05

Support PDL in sm90_gemm_array_tma_warpspecialized_pingpong.hpp

d542ed4

Refine position for wait_on_dependent_grids.

4c77d6a

Support PDL in sm90_gemm_array_tma_warpspecialized_cooperative

c328bae

Hoist waits above the warp specialized region.

9a902f3

Delete unnecessary #ifdef / #endif.

db9708f

Remove unnecessary #ifdef / #endif for launch_dependent_grids.

79d03b1

HydraQYH force-pushed the dev_support_pdl_for_sm90_gemm_array_tma_ws branch from 7d40287 to 79d03b1 Compare November 21, 2025 03:06

HydraQYH added 2 commits November 21, 2025 11:10

Remove duplicated cutlass::arch::wait_on_dependent_grids();

a3a8e0a

Remove unnecessary #ifdef #endif for general gemm.

f6e1d5a

IonThruster approved these changes Nov 21, 2025

View reviewed changes

Support PDL for SM90 Array TMA GEMM #2719

Are you sure you want to change the base?

Support PDL for SM90 Array TMA GEMM #2719

Conversation

HydraQYH commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

IonThruster Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

HydraQYH Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

Algy Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

HydraQYH Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

d-k-b Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

HydraQYH Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HydraQYH commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

d-k-b commented Nov 18, 2025

Uh oh!

hwu36 commented Nov 18, 2025

Uh oh!

IonThruster commented Nov 20, 2025

Uh oh!

HydraQYH commented Nov 20, 2025

Uh oh!

Algy commented Nov 20, 2025

Uh oh!

HydraQYH commented Nov 21, 2025

Uh oh!

IonThruster commented Nov 21, 2025

Uh oh!

HydraQYH commented Nov 21, 2025

Uh oh!

Algy commented Nov 21, 2025

Uh oh!

HydraQYH commented Nov 21, 2025

Uh oh!

IonThruster commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

HydraQYH commented Oct 24, 2025 •

edited

Loading

HydraQYH Nov 18, 2025 •

edited

Loading

HydraQYH commented Nov 18, 2025 •

edited

Loading