Skip to content

Conversation

@lidel
Copy link
Member

@lidel lidel commented Aug 23, 2025

Warning

Do not merge, using this PR as a sandbox for now, to see what fails on CI but rarely locally.

This PR is exploration into making test/cli/harness more robust when things go bad with spawned processes and we run on self-hosted runners.

  • Process tracking: new system tracks all spawned processes and force-kills them on test end/panic
  • Connection fixes: validate peers array before access, use serial connections to avoid TLS issues
  • Daemon reliability: add retry logic for transient startup failures, verify daemon readiness
    • 👉 unsure if this is the right thing to do -- i'm ok with removing this from this PR (important part is tracking dameons and ensuring we kill them).
  • Auto-cleanup: background processes started via harness are automatically registered for cleanup

Impact:

  • No more zombie ipfs daemon processes left after test panics
  • More reliable test execution in CI
  • Cleaner test harness code with better error handling

lidel added 12 commits August 22, 2025 17:45
check if peers slice is non-empty before accessing first element
add retry mechanism for connection verification with 1s timeout
improve error message when no peers are found
add process_tracker.go with daemon PID tracking
register/unregister daemons in StartDaemon/StopDaemon
track all spawned processes for later cleanup
add defer/recover in NewT to ensure cleanup on panic
force kill daemons even if normal cleanup fails
register double cleanup to handle edge cases
make Cleanup() more resilient to failures
fix compilation error by using local helper instead of os.IsProcessDone
run gofmt on harness package
add missing newline at end of file
tested process tracker with race detector - no issues found
confirmed no zombie daemons left after test panics
cleanup mechanism works correctly even with concurrent operations
use ConnectAndWait for reliable connections with 10s timeout
collect and report all connection failures with details
verify expected peer count and protocols
use errgroup for better concurrent error handling
simplify by removing unused env variable and method
disable MDNS to prevent test interference when running in parallel
add retry logic for daemon startup to handle transient failures
replace fixed sleep with proper readiness check
remove unnecessary parallel execution of subtests
follow Go idioms by keeping internal methods unexported
- automatically register any process started with RunFunc that doesn't wait
- add RegisterBackgroundProcess for tests using raw exec.Command
- ensure all background processes are killed on test failure/panic
use serial Connect() implementation from PR #10933 to avoid TLS handshake issues
- fix process count logging bug
- use errors.Is() for error checking
- extract daemon retry logic to helper methods
- consolidate cleanup paths
@lidel lidel added the skip/changelog This change does NOT require a changelog entry label Aug 23, 2025
@lidel lidel changed the title fix(ci): avoid panic and cleanup zombie daemons fix(ci): cleanup zombie daemons on panics Aug 23, 2025
@gammazero
Copy link
Contributor

gammazero commented Aug 25, 2025

add retry logic for transient startup failures

I want to be more careful about not masking any startup failures, even if transient. When we identify a specific failure condition, like a TCP port is still in use but will not be at the next retry, then we can retry on detecting that specific case. We should not retry on any unknown failure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

skip/changelog This change does NOT require a changelog entry

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants