Skip to content

Conversation

@MAVRICK-1
Copy link
Contributor

@MAVRICK-1 MAVRICK-1 commented Aug 27, 2025

🎯 Overview

This PR introduces a comprehensive detection rule for Stable Diffusion WebUI CUDA Out of Memory failures - addressing one of the most critical and widespread issues affecting AUTOMATIC1111 Stable Diffusion deployments globally. The rule identifies CUDA memory exhaustion leading to complete WebUI service failure requiring manual intervention.

CRE Playground Links

CRE-2025-0130 Playground: Test Rule

🚨 Problem Statement

High-Severity Issue: Stable Diffusion WebUI CUDA failures cause:

  • Complete service interruption - WebUI becomes unresponsive and requires manual restart
  • Loss of current image generation progress and any queued generation tasks
  • Potential CUDA context corruption requiring process restart to recover
  • User experience degradation with failed image generations and error messages
  • System instability in multi-user deployments where one user's OOM affects others
  • Cascading failures where recovery attempts also fail due to memory constraints

Why This Matters: Stable Diffusion CUDA failures are particularly dangerous because:

  • High-resolution image generation (1024x1024+) requires massive GPU VRAM
  • Failures often occur mid-generation causing complete data loss
  • AUTOMATIC1111 WebUI has millions of users globally
  • Issues manifest as generic crashes making diagnosis difficult
  • Memory fragmentation prevents allocation of required contiguous memory blocks
  • Requires immediate intervention to restore service functionality

Rule Performance

  • Detection Rate: 2 critical hits with sequence matching
  • Processing Speed: 64.52K lines/s processing
  • Window: 30-second detection window captures failure cascade
  • False Positive Rate: Low (specific PyTorch CUDA error patterns)

📊 Stable Diffusion Issues Covered

# Issue Type Example Error Pattern
1 CUDA Memory Exhaustion torch.cuda.OutOfMemoryError: CUDA out of memory
2 Model Loading Failures Failed to allocate tensor on device
3 Generation Process Crashes Fatal error during image generation
4 WebUI Unresponsiveness Gradio interface becoming unresponsive
5 Recovery Failures Recovery failed - WebUI requires restart
6 CUDA Context Corruption CUDA context may be corrupted
7 Complete Service Failure Complete service failure - manual intervention required

🧪 Testing & Validation

CRE Rule Testing

cd stable-diffusion-demo
cat logs/sd-webui-cuda-oom.log | preq -r ../rules/cre-2025-0130/stable-diffusion-cuda-oom.yaml -d

Test Results:
Screenshot from 2025-08-27 13-17-40

🎬 Demo Environment

Repo link (private invitation already send) https://github.com/MAVRICK-1/cuda-oom

Screencast.from.2025-08-27.13-19-35.mp4
./start-demo.sh
cat logs/roop-cuda-oom.log | preq -r stable-diffusion-cuda-oom.yaml -d

Fixes #130
/claim #130

rules:
- metadata:
kind: prequel
id: StableDiffusionCUDAOOMDetector
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use ./bin/ruler id to generate a valid id eg;

❯ ./bin/ruler id
D3ZNiWma64wUnDYq6NSqYj

version: "*"
- name: pytorch
version: "*"
impactScore: 9
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the only support list items inside applications are name and version, we can remove the rest.

- name: recursive-analysis
displayName: Recursive Analysis
description: Problems where systems enter recursive self-analysis loops leading to resource exhaustion
description: Problems where systems enter recursive self-analysis loops leading to resource exhaustio
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's a typo here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Stable Diffusion Web UI: Reproduce A High-Severity Failure & Write a CRE Rule [Multiple Winners] [Submit by August 31 11:59 pm ET]

2 participants