Skip to content

Commit 29e371d

Browse files
committed
config: reduce job timeout from 60 to 15 minutes
- Update DEFAULT_MINUTES from 60 to 15 in timeout sensor - Change max_runtime_seconds from 3600 to 900 in all dagster.yaml files - Update environment variables in .example.env and profiles/demo.env - Update documentation to reflect new 15-minute default timeout - Most anomaly detection tasks should complete well under 15 minutes - Prevents resource exhaustion from stuck long-running jobs Affected configurations: - anomstack/sensors/timeout.py: DEFAULT_MINUTES = 15 - dagster*.yaml: max_runtime_seconds = 900 - .example.env: ANOMSTACK_KILL_RUN_AFTER_MINUTES = 15 - .example.env: ANOMSTACK_MAX_RUNTIME_SECONDS_TAG = 900
1 parent 4665786 commit 29e371d

File tree

9 files changed

+11
-11
lines changed

9 files changed

+11
-11
lines changed

.example.env

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -111,9 +111,9 @@ DAGSTER_CODE_SERVER_HOST=
111111

112112
# max runtime for a job in dagster
113113
# https://docs.dagster.io/deployment/run-monitoring#general-run-timeouts
114-
ANOMSTACK_MAX_RUNTIME_SECONDS_TAG=3600
114+
ANOMSTACK_MAX_RUNTIME_SECONDS_TAG=900
115115
# kill runs that exceed this many minutes
116-
ANOMSTACK_KILL_RUN_AFTER_MINUTES=60
116+
ANOMSTACK_KILL_RUN_AFTER_MINUTES=15
117117

118118
# postgres related env vars
119119
ANOMSTACK_POSTGRES_USER=postgres_user

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -699,7 +699,7 @@ Below you see an example of an LLM alert via email. In this case we add a descri
699699
700700
Sometimes Dagster runs can get stuck. Anomstack ships with a sensor that
701701
terminates any run exceeding a configurable timeout. By default runs are killed
702-
after 60 minutes. You can override this in your `dagster.yaml` or via the
702+
after 15 minutes. You can override this in your `dagster.yaml` or via the
703703
`ANOMSTACK_KILL_RUN_AFTER_MINUTES` environment variable. You can also invoke the
704704
cleanup manually with:
705705

anomstack/sensors/timeout.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@
1212
)
1313
from dagster._core.errors import DagsterUserCodeUnreachableError
1414

15-
DEFAULT_MINUTES = 60
15+
DEFAULT_MINUTES = 15
1616

1717
def _load_config_timeout_minutes() -> int:
1818
env_val = os.getenv("ANOMSTACK_KILL_RUN_AFTER_MINUTES")

dagster.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ run_monitoring:
1111
enabled: true
1212
start_timeout_seconds: 300 # 5 minutes to start
1313
cancel_timeout_seconds: 180 # 3 minutes to cancel
14-
max_runtime_seconds: 3600 # 1 hour max runtime per run
14+
max_runtime_seconds: 900 # 15 minutes max runtime per run
1515
poll_interval_seconds: 60 # Check every minute
1616

1717
storage:

dagster_docker.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -72,7 +72,7 @@ run_monitoring:
7272
enabled: true
7373
start_timeout_seconds: 300 # 5 minutes to start
7474
cancel_timeout_seconds: 180 # 3 minutes to cancel
75-
max_runtime_seconds: 3600 # 1 hour max runtime per run
75+
max_runtime_seconds: 900 # 15 minutes max runtime per run
7676
max_resume_run_attempts: 2 # Resume runs after worker crashes (DockerRunLauncher only)
7777
poll_interval_seconds: 60 # Check every minute
7878

dagster_fly.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ run_monitoring:
4444
enabled: true
4545
start_timeout_seconds: 300 # 5 minutes to start (increased for cold starts)
4646
cancel_timeout_seconds: 180 # 3 minutes to cancel (increased)
47-
max_runtime_seconds: 3600 # 1 hour max runtime per run
47+
max_runtime_seconds: 900 # 15 minutes max runtime per run
4848
poll_interval_seconds: 30 # Check every 30 seconds (more frequent)
4949

5050
# Disable telemetry

docs/docs/configuration/environment-variables.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -181,8 +181,8 @@ Lightweight defaults to prevent disk space issues.
181181

182182
| Variable | Required | Description | Default | Example |
183183
|----------|----------|-------------|---------|---------|
184-
| `ANOMSTACK_MAX_RUNTIME_SECONDS_TAG` | No | Max job runtime in seconds | `3600` | `7200` |
185-
| `ANOMSTACK_KILL_RUN_AFTER_MINUTES` | No | Kill long-running jobs after N minutes | `60` | `120` |
184+
| `ANOMSTACK_MAX_RUNTIME_SECONDS_TAG` | No | Max job runtime in seconds | `900` | `1800` |
185+
| `ANOMSTACK_KILL_RUN_AFTER_MINUTES` | No | Kill long-running jobs after N minutes | `15` | `30` |
186186

187187
## 🐳 Docker & Deployment
188188

docs/docs/storage-optimization.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,7 @@ run_monitoring:
6060
enabled: true
6161
start_timeout_seconds: 300
6262
cancel_timeout_seconds: 180
63-
max_runtime_seconds: 3600
63+
max_runtime_seconds: 900
6464
poll_interval_seconds: 60
6565
6666
# Disabled telemetry to reduce disk writes

profiles/demo.env

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ ANOMSTACK_MODEL_PATH=local:///data/models
1717

1818
# max runtime for a job in dagster
1919
# https://docs.dagster.io/deployment/run-monitoring#general-run-timeouts
20-
ANOMSTACK_MAX_RUNTIME_SECONDS_TAG=3600
20+
ANOMSTACK_MAX_RUNTIME_SECONDS_TAG=900
2121

2222
# Enable Netdata
2323
ANOMSTACK__NETDATA__INGEST_DEFAULT_SCHEDULE_STATUS=RUNNING

0 commit comments

Comments
 (0)