-
Notifications
You must be signed in to change notification settings - Fork 51
Stable Diffusion WebUI Critical CUDA Out of Memory Detection Rule #134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
MAVRICK-1
wants to merge
5
commits into
prequel-dev:main
Choose a base branch
from
MAVRICK-1:feat/cre-2025-0130-roop-cuda-oom
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
0827db1
Add rules for handling CUDA out of memory errors and log generation f…
MAVRICK-1 e6a2773
Merge branch 'main' into feat/cre-2025-0130-roop-cuda-oom
MAVRICK-1 88481b5
Merge branch 'main' into feat/cre-2025-0130-roop-cuda-oom
MAVRICK-1 8268f06
Add new rules for handling CUDA out of memory errors in Stable Diffusion
MAVRICK-1 316d38e
Merge branch 'main' into feat/cre-2025-0130-roop-cuda-oom
MAVRICK-1 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,84 @@ | ||
| rules: | ||
| - metadata: | ||
| kind: prequel | ||
| id: StableDiffusionCUDAOOMDetector | ||
| generation: 1 | ||
| cre: | ||
| id: CRE-2025-0330 | ||
| severity: 0 | ||
| title: Stable Diffusion WebUI Critical CUDA Out of Memory Failure | ||
| category: memory-problem | ||
| author: CRE Community | ||
| description: | | ||
| - The Stable Diffusion WebUI (AUTOMATIC1111) is experiencing critical CUDA out of memory errors during image generation. | ||
| - This typically occurs when attempting to generate high-resolution images or large batch sizes that exceed available GPU VRAM. | ||
| - The failure cascades from initial memory allocation errors to complete WebUI unresponsiveness and service failure. | ||
| - This is one of the most common and disruptive failures affecting Stable Diffusion deployments. | ||
| cause: | | ||
| - GPU VRAM exhaustion due to insufficient memory for model loading and tensor operations. | ||
| - High-resolution image generation (e.g., 1024x1024 or larger) requiring more memory than available. | ||
| - Large batch sizes that multiply memory requirements beyond GPU capacity. | ||
| - Memory fragmentation from previous operations that prevents allocation of required contiguous memory blocks. | ||
| - Inefficient model loading or caching that consumes excessive VRAM. | ||
| - Running multiple concurrent generation processes without proper memory management. | ||
| impact: | | ||
| - Complete service interruption - the WebUI becomes unresponsive and requires manual restart. | ||
| - Loss of current generation progress and any queued generation tasks. | ||
| - Potential CUDA context corruption requiring process restart to recover. | ||
| - User experience degradation with failed image generations and error messages. | ||
| - System instability in multi-user deployments where one user's OOM can affect others. | ||
| - Cascading failures where recovery attempts also fail due to memory constraints. | ||
| tags: | ||
| - memory-exhaustion | ||
| - crash | ||
| - errors | ||
| - service | ||
| - python | ||
| - memory | ||
| - oom-kill | ||
| - critical-failure | ||
| - cuda | ||
| - pytorch | ||
| mitigation: | | ||
| - **Immediate Response:** | ||
| - Restart the Stable Diffusion WebUI process to clear CUDA context and reset memory state. | ||
| - Check GPU memory usage with `nvidia-smi` to verify memory is properly released after restart. | ||
| - **Configuration Adjustments:** | ||
| - Add command line arguments: `--medvram` (moderate memory reduction) or `--lowvram` (aggressive memory reduction). | ||
| - Use `--opt-sdp-no-mem-attention` or `--xformers` to enable memory-efficient attention mechanisms. | ||
| - Set environment variable: `PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.6,max_split_size_mb:128`. | ||
| - **Generation Parameter Tuning:** | ||
| - Reduce image resolution (e.g., from 1024x1024 to 768x768 or 512x512). | ||
| - Decrease batch size from 4+ to 1-2 images per generation. | ||
| - Enable "Tiled VAE" extension for high-resolution images to reduce VRAM usage during decoding. | ||
| - **System-Level Solutions:** | ||
| - Upgrade to GPU with more VRAM (12GB+ recommended for high-resolution work). | ||
| - Monitor GPU memory usage proactively and set alerts before reaching 90% capacity. | ||
| - Implement resource limits in multi-user deployments to prevent memory monopolization. | ||
| - **Preventative Measures:** | ||
| - Install memory monitoring extensions like VRAM-ESTIMATOR to track usage in real-time. | ||
| - Educate users on appropriate generation parameters for their hardware. | ||
| - Implement automatic parameter adjustment based on available VRAM. | ||
| references: | ||
| - "https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/16114" | ||
| - "https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/12992" | ||
| - "https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/13878" | ||
| - "https://www.aiarty.com/stable-diffusion-guide/fix-cuda-out-of-memory-stable-diffusion.htm" | ||
| applications: | ||
| - name: stable-diffusion-webui | ||
| processName: python | ||
| repoUrl: "https://github.com/AUTOMATIC1111/stable-diffusion-webui" | ||
| version: "*" | ||
| - name: pytorch | ||
| version: "*" | ||
| impactScore: 9 | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the only support list items inside applications are name and version, we can remove the rest. |
||
| mitigationScore: 3 | ||
| reports: 1 | ||
| rule: | ||
| sequence: | ||
| window: 30s | ||
| event: | ||
| source: cre.log.stable-diffusion | ||
| order: | ||
| - regex: "torch\\.cuda\\.OutOfMemoryError: CUDA out of memory" | ||
| - regex: "Fatal error during image generation|Complete service failure" | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,44 @@ | ||
| [2025-08-27 12:50:17,000] INFO [WebUI] Initializing Stable Diffusion WebUI | ||
| Initializing Stable Diffusion WebUI | ||
| [2025-08-27 12:50:17,000] INFO [ModelLoader] Loading checkpoints/v1-5-pruned.safetensors | ||
| Loading checkpoints/v1-5-pruned.safetensors | ||
| [2025-08-27 12:50:17,000] INFO [TorchCUDA] CUDA available: True | ||
| CUDA available: True | ||
| [2025-08-27 12:50:17,000] INFO [TorchCUDA] CUDA device count: 1 | ||
| CUDA device count: 1 | ||
| [2025-08-27 12:50:17,000] INFO [TorchCUDA] CUDA device: GeForce RTX 3060 (6GB) | ||
| CUDA device: GeForce RTX 3060 (6GB) | ||
| [2025-08-27 12:50:19,000] INFO [WebUI] Starting image generation: 1024x1024, batch_size=4 | ||
| Starting image generation: 1024x1024, batch_size=4 | ||
| [2025-08-27 12:50:19,000] INFO [ModelLoader] Loading model to CUDA device | ||
| Loading model to CUDA device | ||
| [2025-08-27 12:50:20,000] WARN [TorchCUDA] GPU memory usage: 5.8GB/6.0GB (97%) | ||
| GPU memory usage: 5.8GB/6.0GB (97%) | ||
| [2025-08-27 12:50:20,000] WARN [ModelLoader] High memory usage detected during model loading | ||
| High memory usage detected during model loading | ||
| [2025-08-27 12:50:21,000] ERROR [TorchCUDA] torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 6.00 GiB total capacity; 5.63 GiB already allocated; 156.19 MiB free; 5.74 GiB reserved in total by PyTorch) | ||
| torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 6.00 GiB total capacity; 5.63 GiB already allocated; 156.19 MiB free; 5.74 GiB reserved in total by PyTorch) | ||
| [2025-08-27 12:50:21,000] CRITICAL [WebUI] Fatal error during image generation | ||
| Fatal error during image generation | ||
| [2025-08-27 12:50:21,000] ERROR [ModelLoader] Failed to allocate tensor on device | ||
| Failed to allocate tensor on device | ||
| [2025-08-27 12:50:21,000] ERROR [TorchCUDA] RuntimeError: CUDA out of memory | ||
| RuntimeError: CUDA out of memory | ||
| [2025-08-27 12:50:22,000] ERROR [WebUI] Generation process crashed | ||
| Generation process crashed | ||
| [2025-08-27 12:50:22,000] ERROR [WebUI] Gradio interface becoming unresponsive | ||
| Gradio interface becoming unresponsive | ||
| [2025-08-27 12:50:22,000] WARN [WebUI] Multiple failed generation attempts detected | ||
| Multiple failed generation attempts detected | ||
| [2025-08-27 12:50:22,000] ERROR [TorchCUDA] CUDA context may be corrupted | ||
| CUDA context may be corrupted | ||
| [2025-08-27 12:50:23,000] INFO [WebUI] Attempting to recover from OOM error | ||
| Attempting to recover from OOM error | ||
| [2025-08-27 12:50:23,000] WARN [TorchCUDA] Clearing CUDA cache | ||
| Clearing CUDA cache | ||
| [2025-08-27 12:50:23,000] ERROR [WebUI] Recovery failed - WebUI requires restart | ||
| Recovery failed - WebUI requires restart | ||
| [2025-08-27 12:50:24,000] ERROR [TorchCUDA] torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.25 GiB | ||
| torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.25 GiB | ||
| [2025-08-27 12:50:24,000] CRITICAL [WebUI] Complete service failure - manual intervention required | ||
| Complete service failure - manual intervention required |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -848,6 +848,12 @@ tags: | |
| - name: cluster-scaling | ||
| displayName: Cluster Scaling | ||
| description: Problems related to Kubernetes cluster scaling operations and capacity management | ||
| - name: cuda | ||
| displayName: CUDA | ||
| description: Problems related to NVIDIA CUDA GPU computing platform and memory management | ||
| - name: pytorch | ||
| displayName: PyTorch | ||
| description: Problems related to PyTorch deep learning framework and tensor operation | ||
| - name: maxmemory | ||
| displayName: Max Memory | ||
| description: Problems related to Redis maxmemory configuration and memory limits | ||
|
|
@@ -949,7 +955,7 @@ tags: | |
| description: Problems related to OpenAI API services including GPT models | ||
| - name: recursive-analysis | ||
| displayName: Recursive Analysis | ||
| description: Problems where systems enter recursive self-analysis loops leading to resource exhaustion | ||
| description: Problems where systems enter recursive self-analysis loops leading to resource exhaustio | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. there's a typo here |
||
| - name: n8n | ||
| displayName: N8N | ||
| description: Problems related to n8n workflow automation platform | ||
|
|
||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use
./bin/ruler idto generate a valid id eg;