Skip to content

Conversation

@andrewm4894
Copy link
Owner

@andrewm4894 andrewm4894 commented Jul 29, 2025

This pull request introduces a comprehensive disk cleanup system for Fly.io instances, including new Makefile targets, a standalone Python script, and a Dagster job for automated cleanup. These changes aim to optimize disk space usage and improve maintenance workflows. The most important changes are grouped below:

New Disk Cleanup Functionality

  • Added three new Makefile targets: fly-cleanup, fly-cleanup-preview, and fly-cleanup-aggressive, each providing varying levels of disk cleanup for Fly.io instances. (Makefile, MakefileR419-R438)
  • Introduced a standalone script, cleanup_disk_space.py, for managing disk space, supporting dry-run and aggressive cleanup modes. (scripts/maintenance/cleanup_disk_space.py, scripts/maintenance/cleanup_disk_space.pyR1-R262)
  • Enhanced documentation with detailed usage instructions for the cleanup script and Makefile targets. (Makefile.md, [1]; scripts/maintenance/README.md, [2]

Automated Disk Cleanup in Dagster

  • Created a new Dagster job, cleanup_job, with operations for cleaning old artifacts, logs, and metrics, as well as reporting disk usage before and after cleanup. The job is scheduled to run every 2 hours. (anomstack/jobs/cleanup.py, anomstack/jobs/cleanup.pyR1-R187)
  • Integrated the new cleanup job and schedule into the Dagster pipeline by updating anomstack/main.py. (anomstack/main.py, [1] [2] [3]

Fly.io Optimization

  • Adjusted Fly.io retention policies in dagster_fly.yaml for more aggressive purging of schedules and sensor data, reducing retention periods to as short as 6 hours. (dagster_fly.yaml, dagster_fly.yamlL35-R40)

Testing Updates

  • Updated test cases in tests/test_main.py to account for the addition of the cleanup job and schedule. (tests/test_main.py, tests/test_main.pyL12-R24)

Summary by CodeRabbit

  • New Features

    • Introduced automated disk cleanup jobs and schedules for managing storage, log files, and database metrics (temporarily disabled).
    • Added a standalone script for manual or scheduled disk cleanup with normal, aggressive, and dry-run modes.
    • Provided Makefile targets for running disk cleanup operations on Fly.io instances.
    • Enhanced deployment script with improved cache busting for Docker builds.
  • Documentation

    • Added detailed usage instructions and safety notes for the new cleanup script and Makefile targets.
  • Bug Fixes

    • Adjusted test expectations to reflect the temporary disabling of cleanup jobs and schedules.
  • Chores

    • Updated retention policies to reduce data storage durations for improved disk management.
    • Improved deployment startup script robustness by allowing continued execution despite service startup or health check failures.

- Reduce schedule run retention to 1 day (was 2 days)
- Set successful runs to 6 hours retention (was 1 day)
- Set skipped runs to 12 hours retention (was 1 day)
- Keep failed runs at 1 day for debugging

This addresses the root cause of disk space issues by reducing
the time artifacts are kept on the limited Fly.io volume.
- New cleanup.py with comprehensive disk management operations
- Removes artifacts older than 6 hours automatically
- Cleans log files older than 24 hours
- Removes database metrics older than 90 days with VACUUM
- Reports disk usage before/after cleanup operations
- Scheduled to run every 2 hours automatically
- Integrated into main Dagster pipeline

This prevents disk space issues by automatically maintaining
the Fly.io volume with regular cleanup operations.
- New cleanup_disk_space.py script for manual/emergency cleanup
- Supports --dry-run mode for safe preview of cleanup actions
- Supports --aggressive mode for emergency situations
- Detailed reporting of disk usage and cleanup results
- Makefile commands for easy Fly.io cleanup:
  - make fly-cleanup-preview: safe preview mode
  - make fly-cleanup: normal cleanup
  - make fly-cleanup-aggressive: emergency cleanup

Provides manual tools for disk space management alongside
the automated cleanup job for maximum flexibility.
- Add comprehensive cleanup_disk_space.py documentation to maintenance README
- Document new Fly.io cleanup Makefile commands in Makefile.md
- Update test expectations to account for new cleanup job/schedule
- Include usage examples and safety features for all cleanup tools

Provides complete documentation for the disk space management
solution including automated and manual cleanup options.
@andrewm4894 andrewm4894 requested a review from Copilot July 29, 2025 10:14
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jul 29, 2025

"""

Walkthrough

This update introduces a new disk space cleanup system for Fly.io instances, including a standalone Python script, new Makefile targets, Dagster job and schedule definitions, and corresponding documentation. Retention policies are tightened, and the main application temporarily disables integration of the cleanup job and schedule. Tests are updated with comments reflecting this temporary disablement.

Changes

Cohort / File(s) Change Summary
Makefile targets & documentation
Makefile, Makefile.md
Added three new Makefile targets (fly-cleanup, fly-cleanup-preview, fly-cleanup-aggressive) for remote disk cleanup on Fly.io; updated Makefile documentation with a section describing these targets and usage. Enhanced existing Fly deploy fresh targets to prune Docker builder cache.
Standalone cleanup script & docs
scripts/maintenance/cleanup_disk_space.py, scripts/maintenance/README.md
Introduced a standalone Python script for disk cleanup with normal, aggressive, and dry-run modes; added detailed documentation covering features, usage, and safety.
Dagster cleanup job & schedule (disabled)
anomstack/jobs/cleanup.py, anomstack/main.py
Added a Dagster job and schedule for disk cleanup, including ops for artifact, log, and metrics cleanup and disk usage reporting; temporarily commented out their import and inclusion in the main job and schedule lists.
Retention policy configuration
dagster_fly.yaml
Shortened retention periods for schedules and sensor-related data, reducing disk usage duration.
Test adjustments
tests/test_main.py
Added comments to test assertions explaining that cleanup job and schedule are currently disabled; no logic changes made.
Startup script tolerance improvement
scripts/deployment/start.sh
Removed set -e to allow continued execution despite errors; modified health check and nginx startup logic to log warnings and continue rather than exit on failure.
Docker build cache busting
docker/Dockerfile.fly, scripts/deployment/deploy_fly.sh
Added a layer to write cache bust argument to a file to force cache invalidation; enhanced deployment script to generate a more unique cache bust value and explicitly specify Dockerfile path with conditional no-cache usage.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Makefile
    participant Fly.io Instance
    participant CleanupScript

    User->>Makefile: make fly-cleanup[-preview|-aggressive]
    Makefile->>Fly.io Instance: SSH & run cleanup_disk_space.py [flags]
    Fly.io Instance->>CleanupScript: Execute with appropriate mode
    CleanupScript->>Fly.io Instance: Clean artifacts, logs, DB, report disk usage
    Fly.io Instance-->>User: Output cleanup summary
Loading
sequenceDiagram
    participant DagsterScheduler
    participant CleanupJob
    participant Artifacts
    participant Logs
    participant MetricsDB

    DagsterScheduler->>CleanupJob: Trigger on schedule
    CleanupJob->>Artifacts: Remove old artifacts
    CleanupJob->>Logs: Remove old logs
    CleanupJob->>MetricsDB: Delete old metrics, VACUUM
    CleanupJob->>DagsterScheduler: Report disk usage before/after
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~15–20 minutes

Poem

🐇
A rabbit hops through disk and log,
Tidying bytes through mist and fog.
With scripts and jobs, the cleanup’s neat,
Old files gone, new space a treat!
Schedules paused, retention’s tight,
The burrow’s clean—oh, what delight!
🧹✨
"""

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch disk-cleanup

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@andrewm4894 andrewm4894 marked this pull request as ready for review July 29, 2025 10:15

This comment was marked as outdated.

@github-actions
Copy link

📊 Test Coverage Report

Coverage: 58% (yellow)

✅ Coverage maintained or improved!

💡 See detailed coverage report in the tests README

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (13)
Makefile.md (3)

568-568: Fix markdown heading format.

Use proper heading syntax instead of emphasis for better document structure.

-#### `make fly-cleanup-preview`
+### `make fly-cleanup-preview`

579-579: Fix markdown heading format.

Use proper heading syntax instead of emphasis for better document structure.

-#### `make fly-cleanup`
+### `make fly-cleanup`

591-591: Fix markdown heading format.

Use proper heading syntax instead of emphasis for better document structure.

-#### `make fly-cleanup-aggressive`
+### `make fly-cleanup-aggressive`
anomstack/jobs/cleanup.py (6)

9-11: Remove unused imports.

Clean up the unused imports flagged by static analysis.

-from pathlib import Path
-
-from dagster import DefaultScheduleStatus, JobDefinition, ScheduleDefinition, job, op, get_dagster_logger
+from dagster import DefaultScheduleStatus, ScheduleDefinition, job, op, get_dagster_logger

71-71: Fix unused loop variable.

Rename the unused dirs variable to follow Python conventions.

-        for root, dirs, files in os.walk(log_dir):
+        for root, _dirs, files in os.walk(log_dir):

19-19: Consider making the artifacts path configurable.

The hardcoded path /data/artifacts/storage might not be suitable for all environments. Consider making this configurable through environment variables or Dagster configuration.

artifacts_path = os.getenv("DAGSTER_ARTIFACTS_PATH", "/data/artifacts/storage")

38-48: Potential performance issue with directory size calculation.

The nested loop for calculating directory sizes could be slow for large directories. Consider using shutil.disk_usage() or implementing a timeout mechanism for very large directories.

                    try:
-                        size = sum(
-                            os.path.getsize(os.path.join(dirpath, filename))
-                            for dirpath, dirnames, filenames in os.walk(item_path)
-                            for filename in filenames
-                        )
+                        # Use a more efficient approach for large directories
+                        size = 0
+                        for dirpath, dirnames, filenames in os.walk(item_path):
+                            size += sum(os.path.getsize(os.path.join(dirpath, filename)) 
+                                       for filename in filenames if os.path.exists(os.path.join(dirpath, filename)))
                        shutil.rmtree(item_path)

76-76: Improve timestamp comparison for cross-platform compatibility.

Using datetime.fromtimestamp() directly may have timezone issues. Consider using UTC timestamps for consistency.

-                        if os.path.getmtime(file_path) < (datetime.now() - timedelta(hours=24)).timestamp():
+                        file_time = datetime.fromtimestamp(os.path.getmtime(file_path))
+                        cutoff_time = datetime.now() - timedelta(hours=24)
+                        if file_time < cutoff_time:

179-183: Schedule frequency might be too aggressive.

Running cleanup every 2 hours could be resource-intensive, especially the database VACUUM operation. Consider if this frequency is necessary or if it should be configurable.

Consider making the schedule configurable or reducing frequency:

-    cron_schedule="0 */2 * * *",  # Every 2 hours
+    cron_schedule="0 */6 * * *",  # Every 6 hours
scripts/maintenance/cleanup_disk_space.py (4)

19-21: Remove unused imports.

Clean up the unused imports flagged by static analysis.

-import sys
-from datetime import datetime, timedelta
-from pathlib import Path
+from datetime import datetime, timedelta

119-119: Fix unused loop variable.

Rename the unused dirs variable to follow Python conventions.

-            for root, dirs, files in os.walk(log_dir):
+            for root, _dirs, files in os.walk(log_dir):

70-74: Potential performance issue with large directories.

Similar to the Dagster job, the directory size calculation could be slow for very large directories. Consider adding progress feedback or implementing a timeout.

Consider adding progress feedback for large directories:

# Add progress tracking for large directories
file_count = sum(len(filenames) for _, _, filenames in os.walk(item_path))
if file_count > 1000:  # Large directory
    print(f"  Calculating size for large directory ({file_count} files)...")

43-191: Consider consolidating cleanup logic with Dagster job.

There's significant code duplication between this script and the Dagster job implementation. Consider extracting common cleanup logic into a shared module to maintain consistency and reduce duplication.

Create a shared cleanup utilities module that both implementations can use:

# cleanup_utils.py
def cleanup_artifacts_common(path, hours_back, dry_run=False):
    # Common cleanup logic
    pass

def cleanup_logs_common(log_dirs, hours_back, aggressive=False, dry_run=False):
    # Common cleanup logic  
    pass
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d3d7a6d and e04ccb0.

📒 Files selected for processing (8)
  • Makefile (1 hunks)
  • Makefile.md (1 hunks)
  • anomstack/jobs/cleanup.py (1 hunks)
  • anomstack/main.py (3 hunks)
  • dagster_fly.yaml (1 hunks)
  • scripts/maintenance/README.md (1 hunks)
  • scripts/maintenance/cleanup_disk_space.py (1 hunks)
  • tests/test_main.py (1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (2)
scripts/maintenance/README.md (1)
scripts/utils/cleanup_dagster_storage.sh (7)
  • usage (430-442)
  • show_cleanup_options (78-105)
  • cleanup_aggressive (151-266)
  • main_menu (361-427)
  • cleanup_standard (128-149)
  • analyze_storage (43-75)
  • cleanup_cli_based (299-324)
scripts/maintenance/cleanup_disk_space.py (1)
scripts/utils/cleanup_dagster_storage.sh (6)
  • usage (430-442)
  • cleanup_aggressive (151-266)
  • show_cleanup_options (78-105)
  • cleanup_cli_based (299-324)
  • cleanup_standard (128-149)
  • analyze_storage (43-75)
🪛 Ruff (0.12.2)
scripts/maintenance/cleanup_disk_space.py

19-19: sys imported but unused

Remove unused import: sys

(F401)


21-21: pathlib.Path imported but unused

Remove unused import: pathlib.Path

(F401)


119-119: Loop control variable dirs not used within loop body

Rename unused dirs to _dirs

(B007)

anomstack/jobs/cleanup.py

9-9: pathlib.Path imported but unused

Remove unused import: pathlib.Path

(F401)


11-11: dagster.JobDefinition imported but unused

Remove unused import: dagster.JobDefinition

(F401)


71-71: Loop control variable dirs not used within loop body

Rename unused dirs to _dirs

(B007)

🪛 markdownlint-cli2 (0.17.2)
Makefile.md

568-568: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


579-579: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


591-591: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: test
🔇 Additional comments (11)
anomstack/main.py (2)

9-9: LGTM! Clean integration of cleanup module.

The import follows the established pattern used by other job modules in the codebase.


33-33: LGTM! Consistent integration pattern.

The cleanup jobs and schedules are properly appended to the existing tuples, maintaining consistency with how other modules are integrated.

Also applies to: 47-47

dagster_fly.yaml (1)

35-40: Verify impact of aggressive retention policies on operational visibility.

The retention periods have been significantly reduced, particularly for successful runs (6 hours) and skipped runs (12 hours). While this supports the disk cleanup objectives, consider whether these short retention windows might impact:

  • Debugging capabilities for recent issues
  • Historical analysis and trend monitoring
  • Compliance or audit requirements

Consider testing these settings in a development environment first to ensure they don't hinder operational visibility.

tests/test_main.py (1)

12-12: LGTM! Test updates correctly reflect cleanup integration.

The test assertions have been properly updated to account for the new cleanup job and schedule. The comments clearly explain the reasoning behind the calculation changes (2 non-metric jobs total including cleanup).

Also applies to: 16-16, 20-20, 24-24

Makefile (1)

419-437: LGTM! Well-implemented Fly.io cleanup targets.

The new Makefile targets are properly structured with:

  • Appropriate error handling for missing FLY_APP environment variable
  • Clear informational messages for each operation
  • Correct SSH command structure for remote execution

The naming convention and functionality align well with existing Fly.io targets.

scripts/maintenance/README.md (1)

42-85: LGTM! Comprehensive and well-structured documentation.

The new documentation section for cleanup_disk_space.py is excellent:

  • Clear feature descriptions and use cases
  • Practical usage examples with different modes
  • Detailed cleanup targets and safety considerations
  • Consistent formatting with existing documentation

This provides users with all necessary information to effectively use the cleanup functionality.

Makefile.md (1)

565-601: Well-documented disk cleanup functionality.

The documentation provides clear usage examples and explains the different cleanup modes effectively. The retention periods and behavior are well-explained, helping users choose the appropriate cleanup level.

anomstack/jobs/cleanup.py (1)

96-124: Database cleanup implementation looks solid.

The SQLite database cleanup with VACUUM operation is well-implemented. Good error handling and the approach of counting records before deletion provides useful logging information.

scripts/maintenance/cleanup_disk_space.py (3)

43-94: Good implementation with comprehensive error handling.

The artifact cleanup function is well-structured with proper error handling, clear output, and support for both dry-run and aggressive modes. The size calculation and progress reporting are helpful for users.


147-191: Database cleanup with good dry-run support.

The database cleanup function properly implements dry-run mode and provides clear feedback about what would be deleted. The VACUUM operation is correctly placed after the deletion.


255-258: Good warning for high disk usage.

The warning when disk usage remains above 90% after cleanup is helpful for users to understand they may need additional action.

try:
for root, dirs, files in os.walk(log_dir):
for file in files:
if file.endswith(('.log', '.out', '.err')) or 'dagster' in file.lower():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Broad file matching pattern could be risky.

The condition 'dagster' in file.lower() is quite broad and might match unintended files. Consider being more specific about which files to target.

-                    if file.endswith(('.log', '.out', '.err')) or 'dagster' in file.lower():
+                    if (file.endswith(('.log', '.out', '.err')) or 
+                        file.startswith('dagster') or 
+                        'dagster.log' in file.lower()):
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if file.endswith(('.log', '.out', '.err')) or 'dagster' in file.lower():
if (file.endswith(('.log', '.out', '.err')) or
file.startswith('dagster') or
'dagster.log' in file.lower()):
🤖 Prompt for AI Agents
In scripts/maintenance/cleanup_disk_space.py at line 121, the condition checking
if 'dagster' is in the filename is too broad and may match unintended files.
Refine this condition to target specific filenames or patterns related to
dagster logs or outputs, such as checking for exact filenames or more precise
suffixes/prefixes, to avoid accidentally matching unrelated files.

The cleanup job is causing startup crashes, temporarily disabling
it to get the system stable. Will debug and re-enable later.
@github-actions
Copy link

📊 Test Coverage Report

Coverage: 59% (yellow)

✅ Coverage maintained or improved!

💡 See detailed coverage report in the tests README

- Remove exit on code server health check timeout
- Make nginx startup non-blocking if config test fails
- Remove 'set -e' to allow script to continue despite individual service failures
- This should fix the exit code 1 crashes we've been seeing

The script will now continue startup even if some services
have issues, rather than failing completely.
- Adjusted retention for skipped runs from 12 hours to 1 day (minimum allowed by Dagster)
- Updated retention for successful runs from 6 hours to 1 day (minimum allowed by Dagster)

These changes ensure compliance with Dagster's minimum retention requirements and improve artifact management.
@github-actions
Copy link

📊 Test Coverage Report

Coverage: 59% (yellow)

✅ Coverage maintained or improved!

💡 See detailed coverage report in the tests README

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🔭 Outside diff range comments (2)
scripts/deployment/start.sh (2)

30-42: Health-check now always succeeds → downstream services may start against a dead code-server.

The function now returns 0 unconditionally (Line 41), so the rest of the stack starts even when dagster api grpc-health-check never became healthy.
Since both the web-server and daemon depend on a live code-server, this can mask boot failures and lead to an endless restart loop later.

Two lightweight mitigations:

-    echo "⚠️ Code server health check timed out, but continuing startup..."
-    return 0
+    echo "❌ Code server health check timed out – giving up after $max_retries retries"
+    return 1        # bubble failure

and keep the early exit 1 already present after the code-server start stanza to abort cleanly.


45-73: start_process_with_retry leaks log lines into the variable that is supposed to hold only the PID.

Because every echo goes to STDOUT, the caller captures all lines:

CODE_SERVER_PID=$(start_process_with_retry …)

CODE_SERVER_PID therefore becomes something like:

🔧 Starting Code Server (attempt 1/3)...
Code Server PID: 1234
✅ Code Server started successfully
1234

Later checks (kill -0 $CODE_SERVER_PID) will fail with “illegal pid” errors.

Minimal, backwards-compatible fix: send diagnostics to STDERR and leave only the raw PID on STDOUT.

@@
-        echo "$name PID: $pid"
+        echo "$name PID: $pid" >&2
@@
-            echo "✅ $name started successfully"
-            echo $pid
+            echo "✅ $name started successfully" >&2
+            printf '%s\n' "$pid"
@@
-            echo "⚠️ $name failed to start, retrying..."
+            echo "⚠️ $name failed to start, retrying..." >&2
@@
-    echo "❌ Failed to start $name after $max_retries attempts"
+    echo "❌ Failed to start $name after $max_retries attempts" >&2

No caller changes needed; they will now receive a clean PID string.

🧹 Nitpick comments (1)
scripts/deployment/start.sh (1)

108-115: Good defensive guard around nginx, but propagate the config error to monitoring.

Nice addition of nginx -t before launch. Consider also writing a sentinel file or emitting a structured log line when the config test fails so external health-checks (Fly checks, Prometheus, etc.) can alert operators that the reverse proxy is missing.

Example:

echo "nginx_disabled" > /tmp/startup_warnings
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9c6695f and bdd2dda.

📒 Files selected for processing (2)
  • dagster_fly.yaml (1 hunks)
  • scripts/deployment/start.sh (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • dagster_fly.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: test
🔇 Additional comments (1)
scripts/deployment/start.sh (1)

4-5: Re-evaluate the blanket removal of set -e.

set -e aborts the script on the first non-zero exit status, which is often desirable during container boot to surface fatal mis-configurations quickly.
Without it, latent errors (e.g., failed htpasswd, bad export, missing binaries) will be silently ignored and the container may sit in a half-broken state that is harder to detect and debug.

If the intention is merely to tolerate failure of specific subsystems, consider a surgical approach instead:

# keep global safety net
set -euo pipefail

# …later, for commands you really can ignore:
some_maybe_failing_cmd || echo "non-fatal warning"

This keeps startup robust while still letting non-critical steps fail gracefully.

- Add stronger cache busting in deploy_fly.sh with timestamp+random values
- Make Dockerfile actually use CACHEBUST argument to invalidate Docker layers
- Improve Makefile fresh deployment targets to clear builder cache
- Add explicit dockerfile paths to avoid build confusion

These changes ensure --force-rebuild actually forces a complete rebuild,
preventing cached layers from causing deployment issues like we experienced
with the Dagster config fix.
@github-actions
Copy link

📊 Test Coverage Report

Coverage: 59% (yellow)

✅ Coverage maintained or improved!

💡 See detailed coverage report in the tests README

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
docker/Dockerfile.fly (1)

6-8: Persisting the cache-bust marker adds an unnecessary file to the final image

Writing /tmp/cachebust achieves cache invalidation, but the file remains in every image layer and marginally bloats image size.
Prefer a zero-footprint alternative, e.g. annotate the image with a label:

-# Use CACHEBUST to invalidate cache when needed (this layer changes when CACHEBUST changes)
-RUN echo "Cache bust: $CACHEBUST" > /tmp/cachebust
+# Use CACHEBUST to invalidate cache when needed (layer changes via label)
+LABEL cachebust="$CACHEBUST"

Same cache-breaking effect, no leftover artifact.

scripts/deployment/deploy_fly.sh (2)

238-242: Strip the newline from openssl rand to avoid malformed CACHEBUST value

openssl rand -hex 4 appends \n; the resulting CACHEBUST_VALUE therefore contains an embedded newline, which is passed verbatim to --build-arg.
While Docker tolerates it, some shells or tooling may mis-parse the argument.

-CACHEBUST_VALUE="$(date +%s)-$(openssl rand -hex 4 2>/dev/null || echo $RANDOM)"
+CACHEBUST_VALUE="$(date +%s)-$(openssl rand -hex 4 2>/dev/null | tr -d '\n' || echo $RANDOM)"

(Alternatively, wrap the whole substitution in $( ... | tr -d '\n' ).)


253-255: Consider always passing a CACHEBUST value, even on standard deploys

If the local Docker cache is stale (e.g. after a base-image security update) a normal cached build may silently skip the refresh. Passing a low-entropy, time-based CACHEBUST (e.g. the date truncated to hours) during standard deploys mitigates this while still benefiting from layer caching.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bdd2dda and b16e786.

📒 Files selected for processing (3)
  • Makefile (2 hunks)
  • docker/Dockerfile.fly (1 hunks)
  • scripts/deployment/deploy_fly.sh (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • Makefile
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: test
🔇 Additional comments (1)
scripts/deployment/deploy_fly.sh (1)

244-251: Good defensive mix of cache-busting flags

Combining --no-cache, an explicit Dockerfile path, and a build-arg gives deterministic force rebuilds. Looks solid.

@andrewm4894 andrewm4894 requested a review from Copilot July 30, 2025 21:12
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a comprehensive disk cleanup system for Fly.io instances to address disk space management concerns. The changes include both standalone cleanup capabilities and automated Dagster-based cleanup jobs (currently disabled) to optimize storage usage and prevent disk-full scenarios.

Key changes:

  • New standalone disk cleanup script with normal, aggressive, and dry-run modes
  • Makefile targets for convenient Fly.io disk cleanup operations
  • Dagster cleanup jobs and schedules (temporarily disabled for testing)
  • Enhanced deployment scripts with improved cache busting and error handling
  • More aggressive data retention policies in Fly.io configuration

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
scripts/maintenance/cleanup_disk_space.py New standalone script for comprehensive disk cleanup with multiple modes
Makefile Added three new targets for Fly.io disk cleanup operations
anomstack/jobs/cleanup.py New Dagster cleanup job with ops for artifacts, logs, and database cleanup
scripts/deployment/start.sh Improved startup script resilience by removing strict error handling
scripts/deployment/deploy_fly.sh Enhanced cache busting for Docker builds
dagster_fly.yaml More aggressive retention policies for schedules and sensors
tests/test_main.py Updated test expectations to reflect disabled cleanup functionality
scripts/maintenance/README.md Documentation for the new cleanup script
Makefile.md Documentation for new Makefile cleanup targets



if __name__ == "__main__":
main() No newline at end of file
Copy link

Copilot AI Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove trailing whitespace at the end of the file.

Suggested change
main()
main()

Copilot uses AI. Check for mistakes.

# Export for main.py
cleanup_jobs = [cleanup_job]
cleanup_schedules = [cleanup_schedule] No newline at end of file
Copy link

Copilot AI Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove trailing whitespace at the end of the file.

Suggested change
cleanup_schedules = [cleanup_schedule]
cleanup_schedules = [cleanup_schedule]

Copilot uses AI. Check for mistakes.
if file.endswith(('.log', '.out', '.err')) or 'dagster' in file.lower():
file_path = os.path.join(root, file)
try:
file_time = datetime.fromtimestamp(os.path.getmtime(file_path))
Copy link

Copilot AI Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The os.path.getmtime() call is performed twice for the same file - once here and potentially again at line 126. Consider storing the result in a variable to avoid duplicate system calls.

Suggested change
file_time = datetime.fromtimestamp(os.path.getmtime(file_path))
file_mtime = os.path.getmtime(file_path)
file_time = datetime.fromtimestamp(file_mtime)

Copilot uses AI. Check for mistakes.
Comment on lines +75 to +76
# Remove log files older than 24 hours
if os.path.getmtime(file_path) < (datetime.now() - timedelta(hours=24)).timestamp():
Copy link

Copilot AI Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The datetime.now() - timedelta(hours=24) calculation is performed for every file in the loop. Consider calculating this cutoff timestamp once before the loop for better performance.

Suggested change
# Remove log files older than 24 hours
if os.path.getmtime(file_path) < (datetime.now() - timedelta(hours=24)).timestamp():
if os.path.getmtime(file_path) < cutoff_time:

Copilot uses AI. Check for mistakes.
Comment on lines 2 to +4

# Anomstack Startup Script for Fly.io with improved gRPC connectivity
set -e
# Removed 'set -e' to allow script to continue even if some services fail
Copy link

Copilot AI Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing set -e changes the script's error handling behavior significantly. Consider using more targeted error handling (like || true for specific commands) rather than disabling strict error checking globally.

Copilot uses AI. Check for mistakes.
@andrewm4894 andrewm4894 merged commit 5718065 into main Jul 30, 2025
3 checks passed
@andrewm4894 andrewm4894 deleted the disk-cleanup branch July 30, 2025 21:16
@coderabbitai coderabbitai bot mentioned this pull request Aug 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants