Skip to content

Conversation

@DajanaV
Copy link
Contributor

@DajanaV DajanaV commented Nov 7, 2025

Mirrored from ggml-org/llama.cpp#17080

This PR handles the case where the repeating add after the mul_mat is not handled. This occurs in diffusion models in stable_diffusion.cpp

@loci-agentic-ai
Copy link

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary

Based on the comprehensive analysis of version 23744cc2-5f98-4016-b122-588df101b23a compared to baseline 0797ab8c-9bfc-4911-8c5b-22da73432e86, the changes show minimal performance impact with no modifications to core inference functions.

Key Findings

Performance Metrics:

  • Highest Response Time change: _RegexMask constructor (+0.082%, +0.018 ns) in build.bin.libllama.so
  • Highest Throughput change: _Optional_base constructor (+0.171%, +0.040 ns) in build.bin.llama-tts
  • No changes detected in core inference functions (llama_decode, llama_encode, llama_tokenize)

Inference Performance Impact:
The performance changes do not affect tokenization or inference functions. Since llama_decode, llama_encode, and llama_tokenize show no measurable changes, tokens per second performance remains unaffected. The modified functions are standard library constructors unrelated to the inference pipeline.

Power Consumption Analysis:
All binaries maintain stable power consumption with variations under 0.001%. The largest changes occur in:

  • build.bin.libllama.so: -0.97 nanojoules (-0.0003%)
  • build.bin.llama-cvector-generator: -0.18 nanojoules (-0.00006%)

Technical Analysis:

  • Flame Graph: The _RegexMask constructor shows a simple, single-frame execution pattern with no nested calls, indicating the performance change stems from micro-architectural effects rather than algorithmic modifications
  • CFG Comparison: Identical control flow graphs and assembly code between versions confirm no functional changes to the constructor implementation
  • Code Review: PR UPSTREAM PR #17080: CUDA: skip fusion for repeating adds in bias #121 focuses on CUDA bias fusion improvements for diffusion models, adding shape validation checks to prevent fusion of repeating adds, which explains the negligible performance variations in unrelated standard library functions

The analysis confirms these are measurement artifacts from compiler optimization variations rather than meaningful performance regressions. The core inference pipeline remains unaffected.

@DajanaV DajanaV force-pushed the main branch 27 times, most recently from 81cedf2 to 4c7638f Compare November 10, 2025 19:07
@DajanaV DajanaV force-pushed the main branch 30 times, most recently from f333350 to 9c4623f Compare November 18, 2025 09:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants