-
Notifications
You must be signed in to change notification settings - Fork 676
Open
Description
Hi! I'm trying to use the new blackwell kernels you added fbgemm_gpu/experimental/gen_ai/gen_ai/attention/cutlass_blackwell_fmha to use them in efficient attention in xformers. But, I am getting OOM'ed while trying to build the current fbgemm_gpu-gen_ai on a RTX5090.
Is there any information you guys can share with the build process (specifially for blackwell kernels)? (e.g. required memory)
Some logs:
[87/382] Building CUDA object experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/bf16bf16bf16_grouped/bf16bf16bf16_grouped_128_128_128_1_1_1_9_f.cu.o
FAILED: [code=137] experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/bf16bf16bf16_grouped/bf16bf16bf16_grouped_128_128_128_1_1_1_9_f.cu.o
/usr/local/cuda-12.9/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/tmp/tmpjnp_ryly/FBGEMM-0d49628d/fbgemm_gpu/../include -I/tmp/tmpjnp_ryly/FBGEMM-0d49628d/fbgemm_gpu -I/tmp/tmpjnp_ryly/FBGEMM-0d49628d/fbgemm_gpu/include -I/tmp/tmpjnp_ryly/FBGEMM-0d49628d/fbgemm_gpu/../external/asmjit/src -I/tmp/tmpjnp_ryly/FBGEMM-0d49628d/fbgemm_gpu/../external/cpuinfo/include -I/tmp/tmpjnp_ryly/FBGEMM-0d49628d/fbgemm_gpu/../external/cutlass/include -I/tmp/tmpjnp_ryly/FBGEMM-0d49628d/fbgemm_gpu/../external/cutlass/tools/util/include -I/tmp/tmpjnp_ryly/FBGEMM-0d49628d/fbgemm_gpu/../external/composable_kernel/include -I/tmp/tmpjnp_ryly/FBGEMM-0d49628d/fbgemm_gpu/../external/composable_kernel/library/include -I/tmp/tmpjnp_ryly/FBGEMM-0d49628d/fbgemm_gpu/../external/json/include -I/tmp/tmpjnp_ryly/FBGEMM-0d49628d/fbgemm_gpu/experimental/gen_ai/src/attention/cuda/cutlass_blackwell_fmha -I/tmp/tmpjnp_ryly/FBGEMM-0d49628d/fbgemm_gpu/experimental/gen_ai/src/quantize -I/tmp/tmpjnp_ryly/FBGEMM-0d49628d/fbgemm_gpu/experimental/gen_ai/src/quantize/include -I/tmp/tmpjnp_ryly/FBGEMM-0d49628d/fbgemm_gpu/experimental/gen_ai/src/quantize/common/include -I/tmp/tmpjnp_ryly/FBGEMM-0d49628d/fbgemm_gpu/experimental/gen_ai/src/kv_cache -I/tmp/tmpjnp_ryly/FBGEMM-0d49628d/fbgemm_gpu/experimental/gen_ai/../../include -isystem /.venv/lib/python3.11/site-packages/torch/include -isystem /.venv/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /.venv/lib/python3.11/site-packages/torch/include -isystem /.venv/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.9/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_120,code=sm_120 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-enum-enum-conversion -Wno-deprecated-declarations -Wno-unused-command-line-argument -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/bf16bf16bf16_grouped/bf16bf16bf16_grouped_128_128_128_1_1_1_9_f.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/bf16bf16bf16_grouped/bf16bf16bf16_grouped_128_128_128_1_1_1_9_f.cu.o.d -x cu -c /tmp/tmpjnp_ryly/FBGEMM-0d49628d/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/bf16bf16bf16_grouped/bf16bf16bf16_grouped_128_128_128_1_1_1_9_f.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/bf16bf16bf16_grouped/bf16bf16bf16_grouped_128_128_128_1_1_1_9_f.cu.o
/tmp/tmpjnp_ryly/FBGEMM-0d49628d/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_sparse_mma_warpspecialized.hpp(902): warning #2908-D: the implicit by-copy capture of "this" is deprecated
Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_));
^
Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
/tmp/tmpjnp_ryly/FBGEMM-0d49628d/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(717): warning #2908-D: the implicit by-copy capture of "this" is deprecated
Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_));
^
/tmp/tmpjnp_ryly/FBGEMM-0d49628d/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_array_warpspecialized.hpp(716): warning #2908-D: the implicit by-copy capture of "this" is deprecated
Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB));
^
/tmp/tmpjnp_ryly/FBGEMM-0d49628d/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_mixed_tma_cpasync_warpspecialized.hpp(554): warning #2908-D: the implicit by-copy capture of "this" is deprecated
Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_));
^
/tmp/tmpjnp_ryly/FBGEMM-0d49628d/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm103_blockscaled_mma_warpspecialized.hpp(682): warning #2908-D: the implicit by-copy capture of "this" is deprecated
Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(params.layout_SFB));
^
/tmp/tmpjnp_ryly/FBGEMM-0d49628d/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm103_blockscaled_mma_array_warpspecialized.hpp(810): warning #2908-D: the implicit by-copy capture of "this" is deprecated
Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB));
^
Killed
Metadata
Metadata
Assignees
Labels
No labels