Skip to content

Conversation

@hariharans29
Copy link
Member

@hariharans29 hariharans29 commented Dec 8, 2025

Description

The Silu activation is basically the same as QuickGelu but with the scaling factor (alpha) as 1. In cusomer models containing Silu, the graph optimizer suite correctly fuses the nodes into a QuickGelu with alpha = 1. This optimizes the implementation of QuickGelu when alpha = 1 by avoiding the scaling and vectorizes the subsequent elementwise multiplication.

Tests:
There are already tests for QuickGelu with alpha = 1 and there are no new tests necessary (

// Silu = x*sigmoid(x), i.e., alpha = 1.0f.
)

Performance improvements measured:
Gives about 2.5% throughput boost for a customer model that has a lot of Silu activations.

Motivation and Context

Some low hanging fruit perf improvements that give instant easy perf wins

@hariharans29 hariharans29 changed the title WIP: [MLAS] [DO NOT REVIEW] Implement vectorized Silu operation WIP: [MLAS] [DO NOT REVIEW] Implement vectorized fused Silu operation Dec 9, 2025
@hariharans29 hariharans29 changed the title WIP: [MLAS] [DO NOT REVIEW] Implement vectorized fused Silu operation WIP: [MLAS] Improve performance of Silu activation path within the QuickGelu CPU kernel Dec 10, 2025
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can commit the suggested changes from lintrunner.

@hariharans29 hariharans29 changed the title WIP: [MLAS] Improve performance of Silu activation path within the QuickGelu CPU kernel WIP: [MLAS/CPU EP] Improve performance of Silu activation path within the QuickGelu CPU kernel Dec 10, 2025
Add comment for potential future work
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes the CPU implementation of the QuickGelu activation function for the special case where alpha=1.0 (equivalent to the Silu activation). The optimization avoids unnecessary scaling operations and adds a vectorized element-wise multiplication function to improve performance.

  • Adds vectorized MlasEltwiseMul function for efficient element-wise multiplication
  • Optimizes QuickGelu computation by skipping scaling when alpha=1.0
  • Replaces scalar multiplication loop with vectorized MlasEltwiseMul call

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
onnxruntime/core/mlas/lib/eltwise.cpp Implements vectorized element-wise multiplication function MlasEltwiseMul<float> following the pattern of existing MlasEltwiseAdd
onnxruntime/core/mlas/inc/mlas.h Adds template declaration for MlasEltwiseMul function
onnxruntime/contrib_ops/cpu/activations.h Modifies QuickGelu kernel to branch on alpha value, avoiding scaling for alpha=1.0 and using vectorized multiplication for final step

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@hariharans29 hariharans29 changed the title WIP: [MLAS/CPU EP] Improve performance of Silu activation path within the QuickGelu CPU kernel [MLAS/CPU EP] Improve performance of Silu activation path within the QuickGelu CPU kernel Dec 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants