UPSTREAM PR #17080: CUDA: skip fusion for repeating adds in bias #121

DajanaV · 2025-11-07T15:36:37Z

This PR handles the case where the repeating add after the mul_mat is not handled. This occurs in diffusion models in stable_diffusion.cpp

loci-agentic-ai · 2025-11-07T16:15:49Z

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary

Based on the comprehensive analysis of version 23744cc2-5f98-4016-b122-588df101b23a compared to baseline 0797ab8c-9bfc-4911-8c5b-22da73432e86, the changes show minimal performance impact with no modifications to core inference functions.

Key Findings

Performance Metrics:

Highest Response Time change: _RegexMask constructor (+0.082%, +0.018 ns) in build.bin.libllama.so
Highest Throughput change: _Optional_base constructor (+0.171%, +0.040 ns) in build.bin.llama-tts
No changes detected in core inference functions (llama_decode, llama_encode, llama_tokenize)

Inference Performance Impact:
The performance changes do not affect tokenization or inference functions. Since llama_decode, llama_encode, and llama_tokenize show no measurable changes, tokens per second performance remains unaffected. The modified functions are standard library constructors unrelated to the inference pipeline.

Power Consumption Analysis:
All binaries maintain stable power consumption with variations under 0.001%. The largest changes occur in:

build.bin.libllama.so: -0.97 nanojoules (-0.0003%)
build.bin.llama-cvector-generator: -0.18 nanojoules (-0.00006%)

Technical Analysis:

Flame Graph: The _RegexMask constructor shows a simple, single-frame execution pattern with no nested calls, indicating the performance change stems from micro-architectural effects rather than algorithmic modifications
CFG Comparison: Identical control flow graphs and assembly code between versions confirm no functional changes to the constructor implementation
Code Review: PR UPSTREAM PR #17080: CUDA: skip fusion for repeating adds in bias #121 focuses on CUDA bias fusion improvements for diffusion models, adding shape validation checks to prevent fusion of repeating adds, which explains the negligible performance variations in unrelated standard library functions

The analysis confirms these are measurement artifacts from compiler optimization variations rather than meaningful performance regressions. The core inference pipeline remains unaffected.

CUDA: skip fusion for repeating adds in bias

083a7f0

DajanaV temporarily deployed to PROD__AL_DEMO November 7, 2025 15:36 — with GitHub Actions Inactive

DajanaV force-pushed the main branch 27 times, most recently from 81cedf2 to 4c7638f Compare November 10, 2025 19:07

DajanaV force-pushed the main branch 30 times, most recently from f333350 to 9c4623f Compare November 18, 2025 09:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UPSTREAM PR #17080: CUDA: skip fusion for repeating adds in bias #121

UPSTREAM PR #17080: CUDA: skip fusion for repeating adds in bias #121

Uh oh!

DajanaV commented Nov 7, 2025

Uh oh!

loci-agentic-ai bot commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

UPSTREAM PR #17080: CUDA: skip fusion for repeating adds in bias #121

Are you sure you want to change the base?

UPSTREAM PR #17080: CUDA: skip fusion for repeating adds in bias #121

Uh oh!

Conversation

DajanaV commented Nov 7, 2025

Uh oh!

loci-agentic-ai bot commented Nov 7, 2025

Performance Analysis Summary

Key Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants