I think for kernel 10 (warp tiling), additional constraints for block parameters may be needed:
const uint K10_NUM_THREADS = 128;
const uint K10_BN = 256;
const uint K10_BM = 128;
const uint K10_BK = 8;
const uint K10_WN = 256;
const uint K10_WM = 32;
const uint K10_WNITER = 1;
const uint K10_TN = 4;
const uint K10_TM = 8;
The above combination does not cause compiler error but failed at run time with matrix size 256.
changing to
const uint K10_TN = 8;
const uint K10_TM = 4;
the kernel worked.
I think the constraint (K10_WM / K10_WMITER) % TM == 0 && (K10_WN / K10_WNITER) % TN == 0 is needed unless the code changes to accommodate the situation.