Skip to content

Conversation

@whoiskatrin
Copy link
Collaborator

@whoiskatrin whoiskatrin commented Dec 11, 2025

Fix Flaky PID Retrieval
Problem: startData.pid was sometimes undefined in tests. Root cause: race condition between bash script writing PID file and Node.js polling for it (1s timeout too tight under load).
Solution: Replace file polling with FIFO-based synchronization:

  1. Create a PID notification FIFO before sending the command
  2. Bash script writes PID to the FIFO after backgrounding
  3. Node.js does a blocking read from FIFO (guarantees synchronization)
  4. Falls back to file polling if FIFO fails

Also fixed: processRecord.pid wasn't being updated in the callback, only the store was updated.

@changeset-bot
Copy link

changeset-bot bot commented Dec 11, 2025

🦋 Changeset detected

Latest commit: 8554f1c

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
@cloudflare/sandbox Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

claude[bot]

This comment was marked as outdated.

@pkg-pr-new
Copy link

pkg-pr-new bot commented Dec 11, 2025

Open in StackBlitz

npm i https://pkg.pr.new/cloudflare/sandbox-sdk/@cloudflare/sandbox@296

commit: 8554f1c

@github-actions
Copy link
Contributor

github-actions bot commented Dec 11, 2025

🐳 Docker Images Published

Default:

FROM cloudflare/sandbox:0.0.0-pr-296-6c99b63

With Python:

FROM cloudflare/sandbox:0.0.0-pr-296-6c99b63-python

With OpenCode:

FROM cloudflare/sandbox:0.0.0-pr-296-6c99b63-opencode

Version: 0.0.0-pr-296-6c99b63

Use the -python variant if you need Python code execution, or -opencode for the variant with OpenCode AI coding agent pre-installed.


📦 Standalone Binary

For arbitrary Dockerfiles:

COPY --from=cloudflare/sandbox:0.0.0-pr-296-6c99b63 /container-server/sandbox /sandbox
ENTRYPOINT ["/sandbox"]

Download via GitHub CLI:

gh run download 20162754523 -n sandbox-binary

Extract from Docker:

docker run --rm cloudflare/sandbox:0.0.0-pr-296-6c99b63 cat /container-server/sandbox > sandbox && chmod +x sandbox

claude[bot]

This comment was marked as outdated.

@whoiskatrin whoiskatrin changed the title Fix flaky tests add some resilience primitives Dec 11, 2025
claude[bot]

This comment was marked as outdated.

@whoiskatrin whoiskatrin changed the title add some resilience primitives Fix Flaky PID Retrieval Dec 12, 2025
@whoiskatrin whoiskatrin marked this pull request as ready for review December 12, 2025 09:34
Updated the changeset to fix a race condition during PID retrieval.
Copy link
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

All previous threads reference files not in this PR (circuit-breaker.ts, request-queue.ts, base-client.ts) - resolving those as not applicable.

This PR fixes a real race condition in PID retrieval by replacing file polling with FIFO-based synchronization. The core approach is sound, but there are critical issues that need fixing.

Critical Issues

1. Missing FIFO Cleanup in Error Path (session.ts)

The execStream method creates the pidPipe FIFO at the start, but the error handler doesn't clean it up. If an error occurs after FIFO creation, it leaks forever.

Look at the catch block around line 548-567 in session.ts - it calls cleanupCommandFiles(logFile, exitCodeFile) but pidPipe isn't included.

Fix: Add pidPipe parameter to cleanupCommandFiles() and call cleanup in both success and error paths.

2. Misleading Error Messages (session.ts:824)

When directory change fails, bash writes empty string to the pipe:

else
  printf '\x02\x02\x02%s\n' "Failed to change directory..." >> "$log"
  EXIT_CODE=1
  echo "" > ${safePidPipe}  # <-- Empty string = timeout in TS
fi

TypeScript's readPidFromPipe returns undefined for empty string (parseInt of empty = NaN), which then causes waitForPidViaPipe to log "PID pipe read timed out" - this is misleading because the command failed, it didn't timeout.

Fix: Write a sentinel value (e.g., "-1") to the pipe on error, parse it explicitly in TypeScript, and propagate the actual error message.

3. Missing Validation for Empty Read (session.ts:1226)

The readPidFromPipe method at line 494-495 doesn't check if bytesRead === 0, which would indicate EOF without data:

const { bytesRead } = await fd.read(buffer, 0, 64, null);
const content = buffer.toString('utf8', 0, bytesRead).trim();  // <-- No check for bytesRead === 0

Fix: Add validation:

const { bytesRead } = await fd.read(buffer, 0, 64, null);
if (bytesRead === 0) {
  return undefined;  // EOF without data
}
const content = buffer.toString('utf8', 0, bytesRead).trim();

What's Good

Process service fix (process-service.ts:154) is excellent - processRecord.pid wasn't being updated, only the store. This would break killCommand() and liveness checks. Good catch!

Test improvement (comprehensive-workflow.test.ts:516-529) replacing fixed sleep with waitForLog is a best practice that should be applied to other E2E tests. This actually waits for completion rather than guessing.

FIFO blocking semantics are correct - Node.js fs.open() properly handles FIFO blocking on Linux, and the 64-byte buffer is sufficient for PIDs (max 7 digits).

Fallback to file polling is good defense-in-depth that maintains backward compatibility if FIFO operations fail.

Verdict

Request changes - fix the three critical issues above before merge. Once addressed, this will be a high-quality improvement that eliminates real test flakiness. The FIFO approach is fundamentally sound.

Copy link
Member

@ghostwriternr ghostwriternr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! The timing based synchronisation was super flaky, this should definitely be more robust.

@ghostwriternr ghostwriternr merged commit 8a5d275 into main Dec 12, 2025
13 checks passed
@ghostwriternr ghostwriternr deleted the fix-flaky-tests branch December 12, 2025 12:52
@github-actions github-actions bot mentioned this pull request Dec 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants