OOM when training

Hi, 
Thank you for your excellent work on this project!  

I'm currently trying to train on the blendedMVS dataset using 8x H00 GPUs (each with 80GB memory). I ran the training script via `bash bash_scripts/train/examples/mapa_curri_4v_bmvs_48ipg_8g.sh 8`, but I'm encountering an Out-of-Memory (OOM) error despite several optimizations.  

To mitigate the issue, I've already:  
- Reduced the number of views to 2.  
- Add accum_iter to 12
- Enabled gradient checkpointing by setting `model.info_sharing.module_args.gradient_checkpointing=true` and `model.pred_head.gradient_checkpointing=true`.  

However, the OOM error persists. What confuses me is that the model only has around 500M parameters—why does it require such a large amount of memory?  

Could you please provide some insights or suggestions on how to resolve this? Thank you in advance for your help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OOM when training #84

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

OOM when training #84

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions