Commit d6a372a

and

Raghuveer Devulapalli

authored

qgemm: optimize avxvnni QGEMM inner kernel for M=1 (#22952)

Add specialized path for M=1 case that exploits additional available ymm registers for deeper inner kernel loop unrolling. Performance impact (measured on 13th Gen Intel(R) Core(TM) i9-13900K): - 30% improvement in single threaded QGEMM kernels with M = 1 - 7% reduction in average inference time on small quantized model where all kernels have M=1 ``` |--------------------------------------------------------------------+--------+---------+----------+----------+---------+---------| | Benchmark | Time | CPU | Time Old | Time New | CPU Old | CPU New | |--------------------------------------------------------------------+--------+---------+----------+----------+---------+---------| | QGEMM/UnsignedAPackB/M:1/N:512/K:512/Batch:1/Threads:1/real_time | -0.275 | -0.2756 | 4330 | 3137 | 4330 | 3136 | | QGEMM/UnsignedAPackB/M:1/N:512/K:1024/Batch:1/Threads:1/real_time | -0.292 | -0.2927 | 9027 | 6385 | 9027 | 6385 | | QGEMM/UnsignedAPackB/M:1/N:1024/K:1024/Batch:1/Threads:1/real_time | -0.300 | -0.3005 | 17867 | 12499 | 17866 | 12498 | | OVERALL_GEOMEAN | -0.289 | -0.2897 | | | | | |--------------------------------------------------------------------+--------+---------+----------+----------+---------+---------| ``` --------- Co-authored-by: Raghuveer Devulapalli <[email protected]>

1 parent a7df7b1 commit d6a372aCopy full SHA for d6a372a

5 files changed

+558

-99

lines changed

onnxruntime
- core/mlas/lib
  - amd64
    - QgemmU8X8KernelAvx2.asm
    - mlasi.inc
  - x86_64
    - QgemmU8X8KernelAvx2.S
    - asmmacro.h
- test/mlas/bench
  - bench_qgemm.cpp

5 files changed

+558

-99

lines changed

Comments

(0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit d6a372a

5 files changed

5 files changed

File tree

5 files changed

5 files changed

0 commit comments