Skip to content

OOM when training #84

@LanesraL

Description

@LanesraL

Hi,
Thank you for your excellent work on this project!

I'm currently trying to train on the blendedMVS dataset using 8x H00 GPUs (each with 80GB memory). I ran the training script via bash bash_scripts/train/examples/mapa_curri_4v_bmvs_48ipg_8g.sh 8, but I'm encountering an Out-of-Memory (OOM) error despite several optimizations.

To mitigate the issue, I've already:

  • Reduced the number of views to 2.
  • Add accum_iter to 12
  • Enabled gradient checkpointing by setting model.info_sharing.module_args.gradient_checkpointing=true and model.pred_head.gradient_checkpointing=true.

However, the OOM error persists. What confuses me is that the model only has around 500M parameters—why does it require such a large amount of memory?

Could you please provide some insights or suggestions on how to resolve this? Thank you in advance for your help!

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions