-
Notifications
You must be signed in to change notification settings - Fork 12
Improve grpc reliability #188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Enhanced startup script with proper health checks and retry logic - Added process monitoring with automatic restart capabilities - Optimized Dagster configuration for single-container deployment - Enhanced workspace configuration with gRPC keepalive settings - Reduced concurrency limits to prevent resource contention - Extended timeouts for better cold start handling Resolves gRPC connectivity issues and 'unix socket not found' errors
- Added comprehensive gRPC connectivity debugger script - Enhanced Dockerfile with network debugging tools (net-tools, netcat) - Added process status, port status, and log analysis - Includes direct connection testing and system resource monitoring - Provides actionable troubleshooting guidance Facilitates faster diagnosis of deployment issues
|
Caution Review failedThe pull request is closed. WalkthroughThe updates introduce improved configuration and operational resilience for a Dagster deployment in a single-container Fly.io environment. YAML files are adjusted for lower concurrency and extended timeouts, the Dockerfile adds debugging tools and a new script, and the startup script is refactored for robust process management, health checks, retries, and graceful shutdown. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant StartScript
participant CodeServer
participant WebServer
participant Daemon
participant Dashboard
participant Nginx
User->>StartScript: Run start.sh
StartScript->>CodeServer: Start with retries
StartScript->>CodeServer: Check gRPC health
alt CodeServer healthy
StartScript->>WebServer: Start with retries
StartScript->>Daemon: Start with retries
StartScript->>Dashboard: Start with retries
StartScript->>Nginx: Start (no retry)
loop Monitoring
StartScript->>CodeServer: Check process
StartScript->>WebServer: Check process
StartScript->>Daemon: Check process
StartScript->>Dashboard: Check process
alt Any process down
StartScript->>Process: Restart with retries
end
end
else CodeServer unhealthy
StartScript-->>User: Exit with error
end
User->>StartScript: Send SIGTERM/SIGINT
StartScript->>All Services: Graceful shutdown
Estimated code review effort4 (~90 minutes) Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (5)
✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request enhances the reliability and stability of a Dagster deployment on Fly.io by implementing comprehensive improvements to gRPC connectivity, process management, and resource optimization. The changes focus on making the single-container deployment more robust through better error handling, monitoring, and configuration tuning.
Key changes include:
- Enhanced startup script with retry mechanisms, health checks, and automatic process restart capabilities
- Addition of a comprehensive gRPC debugging tool for troubleshooting connectivity issues
- Optimized Dagster configuration with reduced concurrency limits and improved timeout settings
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/deployment/start.sh | Major rewrite adding retry logic, health checks, graceful shutdown, and automatic process monitoring/restart |
| scripts/deployment/debug_grpc.sh | New debugging script for diagnosing gRPC connectivity issues with comprehensive system checks |
| docker/Dockerfile.fly | Added network debugging tools (net-tools, netcat-openbsd) and included the new debug script |
| dagster_home/workspace.yaml | Enhanced gRPC connection settings with keepalive configurations and increased timeouts |
| dagster_fly.yaml | Reduced concurrency limits and worker counts, increased timeouts and retry counts for better stability |
| nginx -t && nginx -g "daemon off;" & | ||
| NGINX_PID=$! |
Copilot
AI
Jul 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The nginx startup doesn't use the retry mechanism like other services, creating inconsistency. Consider using start_process_with_retry for nginx as well, or add a comment explaining why it's handled differently.
| nginx -t && nginx -g "daemon off;" & | |
| NGINX_PID=$! | |
| NGINX_PID=$(start_process_with_retry "Nginx" "nginx -t && nginx -g 'daemon off;'" "/tmp/nginx.log") | |
| if [ $? -ne 0 ]; then | |
| echo "❌ Failed to start nginx, exiting" | |
| exit 1 | |
| fi |
| CODE_SERVER_PID=$(start_process_with_retry "Code Server" "dagster code-server start -h 0.0.0.0 -p 4000 -f anomstack/main.py" "/tmp/code_server.log") | ||
| fi | ||
|
|
||
| if ! kill -0 $WEBSERVER_PID 2>/dev/null; then | ||
| echo "❌ Webserver crashed, restarting..." | ||
| WEBSERVER_PID=$(start_process_with_retry "Webserver" "dagster-webserver -h 0.0.0.0 -p 3000 -w /opt/dagster/dagster_home/workspace.yaml" "/tmp/webserver.log") |
Copilot
AI
Jul 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The monitoring loop doesn't handle the case where start_process_with_retry fails and returns a non-zero exit code instead of a PID. This could cause the kill -0 command to fail with an invalid PID, potentially breaking the monitoring logic.
| CODE_SERVER_PID=$(start_process_with_retry "Code Server" "dagster code-server start -h 0.0.0.0 -p 4000 -f anomstack/main.py" "/tmp/code_server.log") | |
| fi | |
| if ! kill -0 $WEBSERVER_PID 2>/dev/null; then | |
| echo "❌ Webserver crashed, restarting..." | |
| WEBSERVER_PID=$(start_process_with_retry "Webserver" "dagster-webserver -h 0.0.0.0 -p 3000 -w /opt/dagster/dagster_home/workspace.yaml" "/tmp/webserver.log") | |
| CODE_SERVER_PID=$(start_process_with_retry "Code Server" "dagster code-server start -h 0.0.0.0 -p 4000 -f anomstack/main.py" "/tmp/code_server.log") | |
| if ! [[ $CODE_SERVER_PID =~ ^[0-9]+$ ]]; then | |
| echo "❌ Failed to restart Code Server: Invalid PID returned" | |
| CODE_SERVER_PID="" | |
| continue | |
| fi | |
| fi | |
| if ! kill -0 $WEBSERVER_PID 2>/dev/null; then | |
| echo "❌ Webserver crashed, restarting..." | |
| WEBSERVER_PID=$(start_process_with_retry "Webserver" "dagster-webserver -h 0.0.0.0 -p 3000 -w /opt/dagster/dagster_home/workspace.yaml" "/tmp/webserver.log") | |
| if ! [[ $WEBSERVER_PID =~ ^[0-9]+$ ]]; then | |
| echo "❌ Failed to restart Webserver: Invalid PID returned" | |
| WEBSERVER_PID="" | |
| continue | |
| fi |
| # Check ports | ||
| echo "🌐 Port Status:" | ||
| echo "Port 4000 (code server):" | ||
| if netstat -tuln 2>/dev/null | grep ":4000" > /dev/null; then |
Copilot
AI
Jul 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The script uses netstat but the Dockerfile adds netcat-openbsd. Consider using ss (which is more modern and likely available) or nc consistently, as netstat may not be available in minimal container images.
| grpc_keepalive_timeout_ms: 10000 # Wait 10 seconds for keepalive response | ||
| grpc_keepalive_permit_without_calls: true # Allow keepalive without active calls | ||
| grpc_http2_max_pings_without_data: 0 # No limit on pings without data | ||
| grpc_http2_min_ping_interval_without_data_ms: 300000 # 5 minutes between pings |
Copilot
AI
Jul 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The 5-minute ping interval seems excessive for a single-container deployment and conflicts with the 30-second keepalive time. Consider reducing this value to maintain consistency with other keepalive settings.
| grpc_http2_min_ping_interval_without_data_ms: 300000 # 5 minutes between pings | |
| grpc_http2_min_ping_interval_without_data_ms: 30000 # 30 seconds between pings |
📊 Test Coverage ReportCoverage: 59% (yellow) ✅ Coverage maintained or improved!
|
This pull request introduces several updates to improve the stability, reliability, and debugging capabilities of the Dagster deployment in a Fly.io single-container environment. Key changes include adjustments to configuration values for resource optimization, enhanced monitoring and retry mechanisms, and the addition of debugging tools.
Configuration Updates for Stability and Resource Optimization:
max_concurrent_runsfrom 12 to 8 and lowered concurrency limits to prevent resource contention in thedagster_fly.yamlfile (max_concurrent_runsandtag_concurrency_limits).run_retries.max_retriesfrom 1 to 2 for improved reliability in thedagster_fly.yamlfile.run_monitoringtimeouts and polling intervals, and reduced worker counts for schedules and sensors to better align with the single-container environment.Enhanced Debugging and Monitoring:
debug_grpc.shscript to facilitate troubleshooting of gRPC connectivity issues, including process status checks, port monitoring, and workspace validation (scripts/deployment/debug_grpc.sh).code_serversconfiguration to improve debugging capabilities indagster_fly.yaml.Improvements to Deployment Scripts:
start.shscript to include retry mechanisms for starting critical services (e.g., code server, webserver, daemon) and added health checks for the code server (scripts/deployment/start.sh).scripts/deployment/start.sh).Dockerfile Enhancements:
net-toolsandnetcat-openbsdto the Docker image for debugging network issues (docker/Dockerfile.fly).debug_grpc.shscript in the Docker image and ensured proper permissions for execution (docker/Dockerfile.fly).gRPC Connection Optimization:
workspace.yamlwith enhanced gRPC connection settings, including heartbeat timeouts, message size limits, and keepalive configurations for better stability (dagster_home/workspace.yaml).Summary by CodeRabbit
New Features
Improvements
Chores