-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
Contributing guidelines and issue reporting guide
- I've read the contributing guidelines and wholeheartedly agree. I've also read the issue reporting guide.
Well-formed report checklist
- I have found a bug that the documentation does not mention anything about my problem
- I have found a bug that there are no open or closed issues that are related to my problem
- I have provided version/information about my environment and done my best to provide a reproducer
Description of bug
BuildKit version: 0.25.0 14d1ccb 14d1ccb56dbc5e1748c73cda77af2a61a5c3603a
Docker: 28.5.0
Mode: OCI worker (worker.oci)
Cluster: 6 workers, each handling ~100 builds/hour
Runtime: long-running, constant workload (~6h continuous execution)
GC config:
[worker.oci]
enabled = true
gc = true
max-parallelism = 8
memory = "4g"
keepDuration = "60m"
reservedSpace = "20%"
maxUsedSpace = "80%"
minFreeSpace = "10GB"
[history]
maxAge = 0
maxEntries = 0
[cdi]
disabled = true
[worker.containerd]
enabled = false
[[worker.oci.gcpolicy]]
all = false
filters = ["type==source.local", "type==exec.cachemount"]
keepDuration = "60m"
[[worker.oci.gcpolicy]]
all = true
keepDuration = "120m"
reservedSpace = "20%"
maxUsedSpace = "80%"
minFreeSpace = "5GB"Problem Description
Even with GC fully enabled and short keepDuration settings, BuildKit workers do not release memory or clean up cached items over time.
Memory usage grows steadily and reaches the limit (~4 GB) after roughly 6 hours of continuous builds, at which point the workers are killed by the kernel (OOM) or must be manually restarted.
The only effective workaround so far is manually restarting the workers, which immediately resets memory consumption.
Observations
GC logs show cleanup events being triggered, but memory usage does not decrease.
The problem persists across all workers and environments.
No significant disk pressure is observed — the issue is isolated to memory retention.
Setting maxAge and maxEntries in [history] had no effect.
Profiling Data
A memory profile (alloc_space) was captured after ~6h of runtime and shows persistent allocations not being reclaimed.
Key hot paths (from heap.allocs.prof and flamegraph):
| Function / Package | % of Total Alloc Space | Notes |
|---|---|---|
io.Copy / io.copyBuffer |
~37% | Large persistent buffers (likely during layer export/copy) |
contenthash.(*cacheContext).Checksum |
~18% | Repeated allocations for digest computation |
cache.(*cacheManager).Prune |
~12% | GC called but memory not reclaimed |
bbolt.(*DB).Update / Commit |
~17% | Retained pages from metadata updates |
fsutil.(*DiskWriter).processChange |
~4–5% | Persistent diff-related allocations |
sync.(*Once).Do / doSlow |
~40% cumulative | Possibly from flightcontrol routines |
Full profile excerpt:
“Showing nodes accounting for 65,993.65MB, 70.91% of 93,064.62MB total...”
Expected Behavior
Unused cache and memory objects should be reclaimed according to keepDuration.
Memory footprint should remain stable across long-running workloads.
Actual Behavior
Memory grows continuously over time.
GC does not seem to release any of the retained memory regions.
Only a full restart of the worker process releases memory.
Steps to Reproduce
Run 6 workers with the above configuration.
Trigger ~100 builds/hour continuously for 6 hours.
Observe memory usage per worker.
Notice memory steadily increases until OOM/restart.
Possible Causes
Leaked buffers in io.Copy or contenthash routines.
Cache metadata transactions (bbolt) not being garbage-collected.
Stale references in cacheManager or uncollected flightcontrol entries.
Request
Please investigate potential memory retention or leak in the OCI worker code path — particularly around:
io.CopyBuffer / contenthash.Checksum
cacheManager.Prune effectiveness
BoltDB transaction lifecycle and caching.