Skip to content

[TabArena] OOM Prevention during Bagging #135

@Innixma

Description

@Innixma

Need to implement logic that avoids the following scenario:

Bagging a model which trains iteratively and increases in memory usage each iteration.

If all fold models train to full iterations and then save at the same time, they consume more total memory than estimated.

Because each fold model checks if it should early stop based on its own memory usage compared to the system available memory, it may use more peak memory than it should individually.

Because LightGBM and XGBoost peak in memory usage during save / train finish, the following occurs:

  1. All folds train until ~10k estimators (Total Mem 24GB, per-child: 2GB, remaining: 8 GB)
  2. The first to finish saves, spikes to 5 GB used: remaining = 5 GB
  3. The second to finish saves, spikes to 5 GB uses: remaining = 2 GB
  4. Early stopping due to low memory triggers for remaining 6 models, so they all spike to 5 GB by saving at the same time: remaining = -16 GB -> OOM

Logic to avoid:

  1. Max mem per child = 2.4 GB (give 20% overhead) -> pass as fit argument -> mem_limit.
  2. Know that peak mem = 2.5x model size
  3. Once child reaches 1 GB size at 5000 estimators, early stopping triggers -> spike to 2.5GB
  4. 2.5 GB x 8 = 20 GB, still safe, doesn't go OOM, 4 GB on machine remaining to spare.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions