-
Notifications
You must be signed in to change notification settings - Fork 27
Open
Milestone
Description
Need to implement logic that avoids the following scenario:
Bagging a model which trains iteratively and increases in memory usage each iteration.
If all fold models train to full iterations and then save at the same time, they consume more total memory than estimated.
Because each fold model checks if it should early stop based on its own memory usage compared to the system available memory, it may use more peak memory than it should individually.
Because LightGBM and XGBoost peak in memory usage during save / train finish, the following occurs:
- All folds train until ~10k estimators (Total Mem 24GB, per-child: 2GB, remaining: 8 GB)
- The first to finish saves, spikes to 5 GB used: remaining = 5 GB
- The second to finish saves, spikes to 5 GB uses: remaining = 2 GB
- Early stopping due to low memory triggers for remaining 6 models, so they all spike to 5 GB by saving at the same time: remaining = -16 GB -> OOM
Logic to avoid:
- Max mem per child = 2.4 GB (give 20% overhead) -> pass as
fitargument ->mem_limit. - Know that peak mem = 2.5x model size
- Once child reaches 1 GB size at 5000 estimators, early stopping triggers -> spike to 2.5GB
- 2.5 GB x 8 = 20 GB, still safe, doesn't go OOM, 4 GB on machine remaining to spare.
Metadata
Metadata
Assignees
Labels
No labels