Implement FP32 kleidiai Gemv #26302

JonathanC-ARM · 2025-10-14T15:45:44Z

Description

Implementation of special sgemm path which uses GEMV kernels in cases where M or N are 1

Additionally this pr introduces the usage of a microkernel interface which utilizes typedef's provided by KleidiAI such that we can simplify the code and remove things such as ternary operations for SME1 vs SME2 kernels

Indicative Performance

In Lieu of any production models where gemv was a large contributor of the network. I opted to create a mini model to test which contains thousands of randomized matmul variants. With a distribution of GEMV cases throughout

Using onnxruntime perf test I was able to half the total inference time vs mlas with this model

More Benchmarks to come shortly

hariharans29 · 2025-10-14T19:50:14Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2025-10-14T19:50:34Z

Azure Pipelines successfully started running 4 pipeline(s).

hariharans29 · 2025-10-16T17:10:08Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2025-10-16T17:10:27Z

Azure Pipelines successfully started running 4 pipeline(s).

hariharans29 · 2025-10-22T17:19:31Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2025-10-22T17:19:52Z

Azure Pipelines successfully started running 4 pipeline(s).

edgchen1 · 2025-10-23T15:31:45Z

onnxruntime/core/mlas/lib/kleidiai/sgemm_kleidiai.cpp

+kai_matmul_clamp_f32_f32p_f32p_ukernel sgemm_gemm = GetKleidiAISGemmUKernel();
+kai_matmul_clamp_f32_f32_f32p_ukernel sgemm_gemv = GetKleidiAISGemvUKernel();


GetKleidiAIXUKernel() returns const&. do we need to make a copy here?

Suggested change

kai_matmul_clamp_f32_f32p_f32p_ukernel sgemm_gemm = GetKleidiAISGemmUKernel();

kai_matmul_clamp_f32_f32_f32p_ukernel sgemm_gemv = GetKleidiAISGemvUKernel();

const kai_matmul_clamp_f32_f32p_f32p_ukernel& sgemm_gemm = GetKleidiAISGemmUKernel();

const kai_matmul_clamp_f32_f32_f32p_ukernel& sgemm_gemv = GetKleidiAISGemvUKernel();

updated to const in the latest push

edgchen1 · 2025-10-23T15:38:10Z

onnxruntime/core/mlas/lib/qgemm.cpp

    //No fallback and putting in guards
-    if(MLAS_CPUIDINFO::GetCPUIDInfo().HasArm_SME()){
-    ArmKleidiAI::MlasDynamicQGemmBatch(Shape, DataParams, BatchN, ThreadPool);
+    if(ArmKleidiAI::SMEInfo::CanUseSME2){


there are other places that need to be updated, like:

onnxruntime/onnxruntime/contrib_ops/cpu/quantization/dynamic_quantize_matmul.cc

Line 218 in b3ba580

if (!CPUIDInfo::GetCPUIDInfo().HasArm_SME()) {

onnxruntime/onnxruntime/test/mlas/unittest/test_dynamic_qgemm.cpp

Line 24 in b3ba580

if (!MLAS_CPUIDINFO::GetCPUIDInfo().HasArm_SME()) {

I might be missing some.

I think it would be worth making a helper function like MlasIsDynamicQGemmAvailable that has the appropriate checks and using that instead.

Added in the updated checks in various places like these in the latest push

I think it would be worth making a helper function like MlasIsDynamicQGemmAvailable that has the appropriate checks and using that instead.

to clarify, this was the main suggestion.

hariharans29 · 2025-10-28T20:24:58Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2025-10-28T20:25:16Z

Azure Pipelines successfully started running 4 pipeline(s).

hariharans29 · 2025-10-30T21:14:40Z

onnxruntime/test/mlas/unittest/test_dynamic_qgemm.cpp

  void Test(size_t M, size_t N, size_t K, size_t BatchSize) {
    // Currently, MlasDynamicQGemmBatch() and associated functions require SME or else they are no-ops.
-    if (!MLAS_CPUIDINFO::GetCPUIDInfo().HasArm_SME()) {
+    if (!ArmKleidiAI::SMEInfo::CanUseSME2) {


Nit: I guess the Gtest skip comment needs corresponding update too.

hariharans29 · 2025-10-30T21:22:30Z

onnxruntime/core/mlas/lib/qgemm.cpp

    //No fallback and putting in guards
-    if(MLAS_CPUIDINFO::GetCPUIDInfo().HasArm_SME()){
-    ArmKleidiAI::MlasDynamicQGemmBatch(Shape, DataParams, BatchN, ThreadPool);
+    if(ArmKleidiAI::SMEInfo::CanUseSME2){


I guess after merging #26301, the checks looking for SME2 will go away (i.e.) it can be run on both SME1 and SME2 then ?

Yes thats correct

So one change I've made in the latest push is to remove this structure from our kleidi code specifically and put it into mlasi.h removing the armkleidiai namespacing around it, seemed like a sensible place to put it given that other similar code exists in terms of cpu features

Signed-off-by: Jonathan Clohessy <[email protected]>

hariharans29 · 2025-11-11T17:33:14Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2025-11-11T17:33:34Z

Azure Pipelines successfully started running 4 pipeline(s).

edgchen1 reviewed Oct 23, 2025

View reviewed changes

patryk-kaiser-ARM mentioned this pull request Oct 24, 2025

Fix: Disable KleidiAI on systems with SME1 but not SME2 #26399

Closed

JonathanC-ARM force-pushed the jclohess_kleidiai_gemv_implementation branch 2 times, most recently from a3f4f5b to e8ab1b1 Compare October 24, 2025 15:46

hariharans29 reviewed Oct 30, 2025

View reviewed changes

hariharans29 mentioned this pull request Oct 31, 2025

Implement multithreading in qgemm_kleidi #26301

Open

JonathanC-ARM force-pushed the jclohess_kleidiai_gemv_implementation branch 2 times, most recently from 17d822c to 4afc95c Compare November 11, 2025 14:14

JonathanC-ARM and others added 3 commits November 11, 2025 14:16

Implement FP32 kleidiai Gemv

7560add

Signed-off-by: Jonathan Clohessy <[email protected]>

Update const for kernel interface and sme checks

0504a85

Signed-off-by: Jonathan Clohessy <[email protected]>

Modify SME Detection struct location and logic

1d9b7c8

Signed-off-by: Jonathan Clohessy <[email protected]>

JonathanC-ARM force-pushed the jclohess_kleidiai_gemv_implementation branch from 4afc95c to 1d9b7c8 Compare November 11, 2025 15:53

		kai_matmul_clamp_f32_f32p_f32p_ukernel sgemm_gemm = GetKleidiAISGemmUKernel();
		kai_matmul_clamp_f32_f32_f32p_ukernel sgemm_gemv = GetKleidiAISGemvUKernel();

Implement FP32 kleidiai Gemv #26302

Are you sure you want to change the base?

Implement FP32 kleidiai Gemv #26302

Uh oh!

Conversation

JonathanC-ARM commented Oct 14, 2025

Description

Indicative Performance

Uh oh!

hariharans29 commented Oct 14, 2025

Uh oh!

azure-pipelines bot commented Oct 14, 2025

Uh oh!

hariharans29 commented Oct 16, 2025

Uh oh!

azure-pipelines bot commented Oct 16, 2025

Uh oh!

hariharans29 commented Oct 22, 2025

Uh oh!

azure-pipelines bot commented Oct 22, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hariharans29 commented Oct 28, 2025

Uh oh!

azure-pipelines bot commented Oct 28, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hariharans29 commented Nov 11, 2025

Uh oh!

azure-pipelines bot commented Nov 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants