config: reduce job timeout from 60 to 15 minutes

andrewm4894 · andrewm4894 · commit 29e371dbccbd · 2025-07-23T15:03:37.000+01:00
- Update DEFAULT_MINUTES from 60 to 15 in timeout sensor
- Change max_runtime_seconds from 3600 to 900 in all dagster.yaml files
- Update environment variables in .example.env and profiles/demo.env
- Update documentation to reflect new 15-minute default timeout
- Most anomaly detection tasks should complete well under 15 minutes
- Prevents resource exhaustion from stuck long-running jobs

Affected configurations:
- anomstack/sensors/timeout.py: DEFAULT_MINUTES = 15
- dagster*.yaml: max_runtime_seconds = 900
- .example.env: ANOMSTACK_KILL_RUN_AFTER_MINUTES = 15
- .example.env: ANOMSTACK_MAX_RUNTIME_SECONDS_TAG = 900
diff --git a/.example.env b/.example.env
@@ -111,9 +111,9 @@ DAGSTER_CODE_SERVER_HOST=
 
 # max runtime for a job in dagster
 # https://docs.dagster.io/deployment/run-monitoring#general-run-timeouts
-ANOMSTACK_MAX_RUNTIME_SECONDS_TAG=3600
+ANOMSTACK_MAX_RUNTIME_SECONDS_TAG=900
 # kill runs that exceed this many minutes
-ANOMSTACK_KILL_RUN_AFTER_MINUTES=60
+ANOMSTACK_KILL_RUN_AFTER_MINUTES=15
 
 # postgres related env vars
 ANOMSTACK_POSTGRES_USER=postgres_user
diff --git a/README.md b/README.md
@@ -699,7 +699,7 @@ Below you see an example of an LLM alert via email. In this case we add a descri
 
 Sometimes Dagster runs can get stuck. Anomstack ships with a sensor that
 terminates any run exceeding a configurable timeout. By default runs are killed
-after 60 minutes. You can override this in your `dagster.yaml` or via the
+after 15 minutes. You can override this in your `dagster.yaml` or via the
 `ANOMSTACK_KILL_RUN_AFTER_MINUTES` environment variable. You can also invoke the
 cleanup manually with:
 
diff --git a/anomstack/sensors/timeout.py b/anomstack/sensors/timeout.py
@@ -12,7 +12,7 @@
 )
 from dagster._core.errors import DagsterUserCodeUnreachableError
 
-DEFAULT_MINUTES = 60
+DEFAULT_MINUTES = 15
 
 def _load_config_timeout_minutes() -> int:
     env_val = os.getenv("ANOMSTACK_KILL_RUN_AFTER_MINUTES")
diff --git a/dagster.yaml b/dagster.yaml
@@ -11,7 +11,7 @@ run_monitoring:
   enabled: true
   start_timeout_seconds: 300   # 5 minutes to start
   cancel_timeout_seconds: 180  # 3 minutes to cancel
-  max_runtime_seconds: 3600    # 1 hour max runtime per run
+  max_runtime_seconds: 900     # 15 minutes max runtime per run
   poll_interval_seconds: 60    # Check every minute
 
 storage:
diff --git a/dagster_docker.yaml b/dagster_docker.yaml
@@ -72,7 +72,7 @@ run_monitoring:
   enabled: true
   start_timeout_seconds: 300   # 5 minutes to start
   cancel_timeout_seconds: 180  # 3 minutes to cancel
-  max_runtime_seconds: 3600    # 1 hour max runtime per run
+  max_runtime_seconds: 900     # 15 minutes max runtime per run
   max_resume_run_attempts: 2   # Resume runs after worker crashes (DockerRunLauncher only)
   poll_interval_seconds: 60    # Check every minute
 
diff --git a/dagster_fly.yaml b/dagster_fly.yaml
@@ -44,7 +44,7 @@ run_monitoring:
   enabled: true
   start_timeout_seconds: 300   # 5 minutes to start (increased for cold starts)
   cancel_timeout_seconds: 180  # 3 minutes to cancel (increased)
-  max_runtime_seconds: 3600    # 1 hour max runtime per run
+  max_runtime_seconds: 900     # 15 minutes max runtime per run
   poll_interval_seconds: 30    # Check every 30 seconds (more frequent)
 
 # Disable telemetry
diff --git a/docs/docs/configuration/environment-variables.md b/docs/docs/configuration/environment-variables.md
@@ -181,8 +181,8 @@ Lightweight defaults to prevent disk space issues.
 
 | Variable | Required | Description | Default | Example |
 |----------|----------|-------------|---------|---------|
-| `ANOMSTACK_MAX_RUNTIME_SECONDS_TAG` | No | Max job runtime in seconds | `3600` | `7200` |
-| `ANOMSTACK_KILL_RUN_AFTER_MINUTES` | No | Kill long-running jobs after N minutes | `60` | `120` |
+| `ANOMSTACK_MAX_RUNTIME_SECONDS_TAG` | No | Max job runtime in seconds | `900` | `1800` |
+| `ANOMSTACK_KILL_RUN_AFTER_MINUTES` | No | Kill long-running jobs after N minutes | `15` | `30` |
 
 ## 🐳 Docker & Deployment
 
diff --git a/docs/docs/storage-optimization.md b/docs/docs/storage-optimization.md
@@ -60,7 +60,7 @@ run_monitoring:
   enabled: true
   start_timeout_seconds: 300
   cancel_timeout_seconds: 180
-  max_runtime_seconds: 3600
+  max_runtime_seconds: 900
   poll_interval_seconds: 60
 
 # Disabled telemetry to reduce disk writes
diff --git a/profiles/demo.env b/profiles/demo.env
@@ -17,7 +17,7 @@ ANOMSTACK_MODEL_PATH=local:///data/models
 
 # max runtime for a job in dagster
 # https://docs.dagster.io/deployment/run-monitoring#general-run-timeouts
-ANOMSTACK_MAX_RUNTIME_SECONDS_TAG=3600
+ANOMSTACK_MAX_RUNTIME_SECONDS_TAG=900
 
 # Enable Netdata
 ANOMSTACK__NETDATA__INGEST_DEFAULT_SCHEDULE_STATUS=RUNNING

Original file line number	Diff line number	Diff line change
`@@ -12,7 +12,7 @@`
`12`	`12`	`)`
`13`	`13`	`from dagster._core.errors import DagsterUserCodeUnreachableError`
`14`	`14`
`15`		`-DEFAULT_MINUTES = 60`
	`15`	`+DEFAULT_MINUTES = 15`
`16`	`16`
`17`	`17`	`def _load_config_timeout_minutes() -> int:`
`18`	`18`	`env_val = os.getenv("ANOMSTACK_KILL_RUN_AFTER_MINUTES")`