Skip to content

Conversation

@Sahelisaha04
Copy link
Contributor

CRE-2025-0140: Stable Diffusion WebUI CUDA Out of Memory Detection

Overview

This pull request implements a Critical Runtime Event (CRE) detection rule for Stable Diffusion AUTOMATIC1111 WebUI CUDA out of memory errors during batch processing. The rule identifies critical failures where GPU memory exhaustion causes complete generation pipeline crashes.

Problem Statement

Stable Diffusion WebUI systems can experience critical CUDA out of memory errors during batch image generation, creating scenarios where:

  • GPU memory allocation fails during tensor operations
  • Entire batch processing pipelines crash mid-execution
  • Generated images are lost without proper error handling
  • API endpoints fail with memory-related errors
  • Production workflows are interrupted requiring manual intervention

CRE Playground Links

Playground Link

test.log

image
  • Rule File: rules/cre-2025-0140/sd-webui-oom.yaml
  • Test Logs: rules/cre-2025-0140/test.log

Demo Environment

https://github.com/Sahelisaha04/n8n-cre-demo (invitation send)

Screencast.from.2025-08-28.00-56-04.mp4
docker compose -f docker-compose-simple.yml up log-generator
# Test CUDA OOM detection (should detect failure)
cat tests/test.log | preq -r rules/cre-2025-0140/sd-webui-oom.yaml -d


# Test with freshly generated logs
cat tests/generated_failure.log | preq -r rules/cre-2025-0140/sd-webui-oom.yaml -d

References

Fixes #130
/claim #130

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Stable Diffusion Web UI: Reproduce A High-Severity Failure & Write a CRE Rule [Multiple Winners] [Submit by August 31 11:59 pm ET]

1 participant