Memory Retention / Leak in BuildKit Workers Over Time

### Contributing guidelines and issue reporting guide

- [x] I've read the [contributing guidelines](https://github.com/moby/buildkit/blob/master/.github/CONTRIBUTING.md) and wholeheartedly agree. I've also read the [issue reporting guide](https://github.com/moby/buildkit/blob/master/.github/issue_reporting_guide.md).

### Well-formed report checklist

- [x] I have found a bug that the documentation does not mention anything about my problem
- [x] I have found a bug that there are no open or closed issues that are related to my problem
- [x] I have provided version/information about my environment and done my best to provide a reproducer

### Description of bug

BuildKit version: `0.25.0 14d1ccb 14d1ccb56dbc5e1748c73cda77af2a61a5c3603a`
Docker: `28.5.0`

Mode: OCI worker (worker.oci)

Cluster: 6 workers, each handling ~100 builds/hour

Runtime: long-running, constant workload (~6h continuous execution)

```yaml
GC config:
[worker.oci]
  enabled = true
  gc = true
  max-parallelism = 8
  memory = "4g"
  keepDuration = "60m"
  reservedSpace = "20%"
  maxUsedSpace = "80%"
  minFreeSpace = "10GB"

[history]
  maxAge = 0
  maxEntries = 0

[cdi]
  disabled = true

[worker.containerd]
  enabled = false

[[worker.oci.gcpolicy]]
  all = false
  filters = ["type==source.local", "type==exec.cachemount"]
  keepDuration = "60m"

[[worker.oci.gcpolicy]]
  all = true
  keepDuration = "120m"
  reservedSpace = "20%"
  maxUsedSpace = "80%"
  minFreeSpace = "5GB"
```

### Problem Description

Even with GC fully enabled and short keepDuration settings, BuildKit workers do not release memory or clean up cached items over time.
Memory usage grows steadily and reaches the limit (~4 GB) after roughly 6 hours of continuous builds, at which point the workers are killed by the kernel (OOM) or must be manually restarted.

The only effective workaround so far is manually restarting the workers, which immediately resets memory consumption.

### Observations

GC logs show cleanup events being triggered, but memory usage does not decrease.

The problem persists across all workers and environments.

No significant disk pressure is observed — the issue is isolated to memory retention.

Setting maxAge and maxEntries in [history] had no effect.

### Profiling Data

A memory profile (alloc_space) was captured after ~6h of runtime and shows persistent allocations not being reclaimed.
Key hot paths (from heap.allocs.prof and flamegraph):

| Function / Package                     | % of Total Alloc Space | Notes                                                      |
| -------------------------------------- | ---------------------- | ---------------------------------------------------------- |
| `io.Copy` / `io.copyBuffer`            | ~37%                   | Large persistent buffers (likely during layer export/copy) |
| `contenthash.(*cacheContext).Checksum` | ~18%                   | Repeated allocations for digest computation                |
| `cache.(*cacheManager).Prune`          | ~12%                   | GC called but memory not reclaimed                         |
| `bbolt.(*DB).Update` / `Commit`        | ~17%                   | Retained pages from metadata updates                       |
| `fsutil.(*DiskWriter).processChange`   | ~4–5%                  | Persistent diff-related allocations                        |
| `sync.(*Once).Do` / `doSlow`           | ~40% cumulative        | Possibly from `flightcontrol` routines                     |


Full profile excerpt:
“Showing nodes accounting for 65,993.65MB, 70.91% of 93,064.62MB total...”

### Expected Behavior

Unused cache and memory objects should be reclaimed according to keepDuration.

Memory footprint should remain stable across long-running workloads.

### Actual Behavior

Memory grows continuously over time.

GC does not seem to release any of the retained memory regions.

Only a full restart of the worker process releases memory.

### Steps to Reproduce

Run 6 workers with the above configuration.

Trigger ~100 builds/hour continuously for 6 hours.

Observe memory usage per worker.

Notice memory steadily increases until OOM/restart.

### Possible Causes

Leaked buffers in io.Copy or contenthash routines.

Cache metadata transactions (bbolt) not being garbage-collected.

Stale references in cacheManager or uncollected flightcontrol entries.

### Request

Please investigate potential memory retention or leak in the OCI worker code path — particularly around:

io.CopyBuffer / contenthash.Checksum

cacheManager.Prune effectiveness

BoltDB transaction lifecycle and caching.

[profile001.pdf](https://github.com/user-attachments/files/23212093/profile001.pdf)

[heap-alloc-top.log](https://github.com/user-attachments/files/23212279/heap-alloc-top.log)

[heap-alloc.log](https://github.com/user-attachments/files/23212282/heap-alloc.log)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Memory Retention / Leak in BuildKit Workers Over Time #6310

Contributing guidelines and issue reporting guide

Well-formed report checklist

Description of bug

Problem Description

Observations

Profiling Data

Expected Behavior

Actual Behavior

Steps to Reproduce

Possible Causes

Request

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Function / Package	% of Total Alloc Space	Notes
`io.Copy` / `io.copyBuffer`	~37%	Large persistent buffers (likely during layer export/copy)
`contenthash.(*cacheContext).Checksum`	~18%	Repeated allocations for digest computation
`cache.(*cacheManager).Prune`	~12%	GC called but memory not reclaimed
`bbolt.(*DB).Update` / `Commit`	~17%	Retained pages from metadata updates
`fsutil.(*DiskWriter).processChange`	~4–5%	Persistent diff-related allocations
`sync.(*Once).Do` / `doSlow`	~40% cumulative	Possibly from `flightcontrol` routines

Memory Retention / Leak in BuildKit Workers Over Time #6310

Description

Contributing guidelines and issue reporting guide

Well-formed report checklist

Description of bug

Problem Description

Observations

Profiling Data

Expected Behavior

Actual Behavior

Steps to Reproduce

Possible Causes

Request

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions