-
Notifications
You must be signed in to change notification settings - Fork 3.6k
[MLAS/CPU EP] Improve performance of Silu activation path within the QuickGelu CPU kernel #26753
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can commit the suggested changes from lintrunner.
Add comment for potential future work
Comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR optimizes the CPU implementation of the QuickGelu activation function for the special case where alpha=1.0 (equivalent to the Silu activation). The optimization avoids unnecessary scaling operations and adds a vectorized element-wise multiplication function to improve performance.
- Adds vectorized
MlasEltwiseMulfunction for efficient element-wise multiplication - Optimizes QuickGelu computation by skipping scaling when alpha=1.0
- Replaces scalar multiplication loop with vectorized
MlasEltwiseMulcall
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| onnxruntime/core/mlas/lib/eltwise.cpp | Implements vectorized element-wise multiplication function MlasEltwiseMul<float> following the pattern of existing MlasEltwiseAdd |
| onnxruntime/core/mlas/inc/mlas.h | Adds template declaration for MlasEltwiseMul function |
| onnxruntime/contrib_ops/cpu/activations.h | Modifies QuickGelu kernel to branch on alpha value, avoiding scaling for alpha=1.0 and using vectorized multiplication for final step |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <[email protected]>
Description
The
Siluactivation is basically the same asQuickGelubut with the scaling factor (alpha) as 1. In cusomer models containingSilu, the graph optimizer suite correctly fuses the nodes into a QuickGelu with alpha = 1. This optimizes the implementation of QuickGelu when alpha = 1 by avoiding the scaling and vectorizes the subsequent elementwise multiplication.Tests:
There are already tests for QuickGelu with alpha = 1 and there are no new tests necessary (
onnxruntime/onnxruntime/test/contrib_ops/activation_op_test.cc
Line 126 in f98c756
Performance improvements measured:
Gives about 2.5% throughput boost for a customer model that has a lot of Silu activations.
Motivation and Context
Some low hanging fruit perf improvements that give instant easy perf wins