GradCAM for multi modal semantic segmentaion model

Hi. I have a semantic segmentation model that takes both RGB and depth image as input. Each input is processed by a different Deformable Attention Transformer backbone. Their features are fused in middle level. How can I visualize heatmap on the depth image? As I saw in tutorials, GradCAM only accept a single input tensor, is that correct?