Skip to content

Memory Retention / Leak in BuildKit Workers Over Time #6310

@maicondssiqueira

Description

@maicondssiqueira

Contributing guidelines and issue reporting guide

Well-formed report checklist

  • I have found a bug that the documentation does not mention anything about my problem
  • I have found a bug that there are no open or closed issues that are related to my problem
  • I have provided version/information about my environment and done my best to provide a reproducer

Description of bug

BuildKit version: 0.25.0 14d1ccb 14d1ccb56dbc5e1748c73cda77af2a61a5c3603a
Docker: 28.5.0

Mode: OCI worker (worker.oci)

Cluster: 6 workers, each handling ~100 builds/hour

Runtime: long-running, constant workload (~6h continuous execution)

GC config:
[worker.oci]
  enabled = true
  gc = true
  max-parallelism = 8
  memory = "4g"
  keepDuration = "60m"
  reservedSpace = "20%"
  maxUsedSpace = "80%"
  minFreeSpace = "10GB"

[history]
  maxAge = 0
  maxEntries = 0

[cdi]
  disabled = true

[worker.containerd]
  enabled = false

[[worker.oci.gcpolicy]]
  all = false
  filters = ["type==source.local", "type==exec.cachemount"]
  keepDuration = "60m"

[[worker.oci.gcpolicy]]
  all = true
  keepDuration = "120m"
  reservedSpace = "20%"
  maxUsedSpace = "80%"
  minFreeSpace = "5GB"

Problem Description

Even with GC fully enabled and short keepDuration settings, BuildKit workers do not release memory or clean up cached items over time.
Memory usage grows steadily and reaches the limit (~4 GB) after roughly 6 hours of continuous builds, at which point the workers are killed by the kernel (OOM) or must be manually restarted.

The only effective workaround so far is manually restarting the workers, which immediately resets memory consumption.

Observations

GC logs show cleanup events being triggered, but memory usage does not decrease.

The problem persists across all workers and environments.

No significant disk pressure is observed — the issue is isolated to memory retention.

Setting maxAge and maxEntries in [history] had no effect.

Profiling Data

A memory profile (alloc_space) was captured after ~6h of runtime and shows persistent allocations not being reclaimed.
Key hot paths (from heap.allocs.prof and flamegraph):

Function / Package % of Total Alloc Space Notes
io.Copy / io.copyBuffer ~37% Large persistent buffers (likely during layer export/copy)
contenthash.(*cacheContext).Checksum ~18% Repeated allocations for digest computation
cache.(*cacheManager).Prune ~12% GC called but memory not reclaimed
bbolt.(*DB).Update / Commit ~17% Retained pages from metadata updates
fsutil.(*DiskWriter).processChange ~4–5% Persistent diff-related allocations
sync.(*Once).Do / doSlow ~40% cumulative Possibly from flightcontrol routines

Full profile excerpt:
“Showing nodes accounting for 65,993.65MB, 70.91% of 93,064.62MB total...”

Expected Behavior

Unused cache and memory objects should be reclaimed according to keepDuration.

Memory footprint should remain stable across long-running workloads.

Actual Behavior

Memory grows continuously over time.

GC does not seem to release any of the retained memory regions.

Only a full restart of the worker process releases memory.

Steps to Reproduce

Run 6 workers with the above configuration.

Trigger ~100 builds/hour continuously for 6 hours.

Observe memory usage per worker.

Notice memory steadily increases until OOM/restart.

Possible Causes

Leaked buffers in io.Copy or contenthash routines.

Cache metadata transactions (bbolt) not being garbage-collected.

Stale references in cacheManager or uncollected flightcontrol entries.

Request

Please investigate potential memory retention or leak in the OCI worker code path — particularly around:

io.CopyBuffer / contenthash.Checksum

cacheManager.Prune effectiveness

BoltDB transaction lifecycle and caching.

profile001.pdf

heap-alloc-top.log

heap-alloc.log

Metadata

Metadata

Assignees

No one assigned

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions