ggml-hexagon: fix `test-backend-ops` failures on specific binary ops #17042

chraac · 2025-11-06T02:09:34Z

Summary

Fixes test-backend-ops failures in ggml-hexagon by correcting the index calculation for binary operations.

Changes

Fixed index calculation in binary ops to align with the cpu implementation in ggml/src/ggml-cpu/ops.cpp

Testing

Before

[ADD] NMSE = 0.940370531 > 0.000000100   ADD(type=f32,ne=[10,5,4,3],nr=[1,2,1,1],nf=1): �[1;31mFAIL�[0m
[SUB] NMSE = 0.949198773 > 0.000000100   SUB(type=f32,ne=[10,5,4,3],nr=[1,2,1,1],nf=1): �[1;31mFAIL�[0m
[MUL] NMSE = 0.006240991 > 0.000000100   MUL(type=f32,ne=[10,5,4,3],nr=[1,2,1,1],nf=1): �[1;31mFAIL�[0m
  DIV(type=f32,ne=[10,5,4,3],nr=[1,2,1,1],nf=1): not supported [HTP0] 
[ADD] NMSE = 0.713263381 > 0.000000100   ADD(type=f32,ne=[10,5,4,3],nr=[1,1,2,1],nf=1): �[1;31mFAIL�[0m
[SUB] NMSE = 0.699697783 > 0.000000100   SUB(type=f32,ne=[10,5,4,3],nr=[1,1,2,1],nf=1): �[1;31mFAIL�[0m
[MUL] NMSE = 0.004771670 > 0.000000100   MUL(type=f32,ne=[10,5,4,3],nr=[1,1,2,1],nf=1): �[1;31mFAIL�[0m
  DIV(type=f32,ne=[10,5,4,3],nr=[1,1,2,1],nf=1): not supported [HTP0] 
  ADD(type=f32,ne=[10,5,4,3],nr=[1,1,1,2],nf=1): �[1;32mOK�[0m
  SUB(type=f32,ne=[10,5,4,3],nr=[1,1,1,2],nf=1): �[1;32mOK�[0m
  MUL(type=f32,ne=[10,5,4,3],nr=[1,1,1,2],nf=1): �[1;32mOK�[0m
  DIV(type=f32,ne=[10,5,4,3],nr=[1,1,1,2],nf=1): not supported [HTP0] 
[ADD] NMSE = 0.671834829 > 0.000000100   ADD(type=f32,ne=[10,5,4,3],nr=[1,1,2,2],nf=1): �[1;31mFAIL�[0m
[SUB] NMSE = 0.667890500 > 0.000000100   SUB(type=f32,ne=[10,5,4,3],nr=[1,1,2,2],nf=1): �[1;31mFAIL�[0m
[MUL] NMSE = 0.004894060 > 0.000000100   MUL(type=f32,ne=[10,5,4,3],nr=[1,1,2,2],nf=1): �[1;31mFAIL�[0m
  DIV(type=f32,ne=[10,5,4,3],nr=[1,1,2,2],nf=1): not supported [HTP0] 
[ADD] NMSE = 0.882866389 > 0.000000100   ADD(type=f32,ne=[10,5,4,3],nr=[1,2,2,2],nf=1): �[1;31mFAIL�[0m
[SUB] NMSE = 0.889010400 > 0.000000100   SUB(type=f32,ne=[10,5,4,3],nr=[1,2,2,2],nf=1): �[1;31mFAIL�[0m
[MUL] NMSE = 0.005464608 > 0.000000100   MUL(type=f32,ne=[10,5,4,3],nr=[1,2,2,2],nf=1): �[1;31mFAIL�[0m
  DIV(type=f32,ne=[10,5,4,3],nr=[1,2,2,2],nf=1): not supported [HTP0] 
[ADD] NMSE = 0.822931792 > 0.000000100   ADD(type=f32,ne=[10,5,4,3],nr=[2,2,2,2],nf=1): �[1;31mFAIL�[0m
[SUB] NMSE = 0.861861988 > 0.000000100   SUB(type=f32,ne=[10,5,4,3],nr=[2,2,2,2],nf=1): �[1;31mFAIL�[0m
[MUL] NMSE = 0.005597148 > 0.000000100   MUL(type=f32,ne=[10,5,4,3],nr=[2,2,2,2],nf=1): �[1;31mFAIL�[0m

After

  ADD(type=f32,ne=[10,5,4,3],nr=[1,2,1,1],nf=1): �[1;32mOK�[0m
  SUB(type=f32,ne=[10,5,4,3],nr=[1,2,1,1],nf=1): �[1;32mOK�[0m
  MUL(type=f32,ne=[10,5,4,3],nr=[1,2,1,1],nf=1): �[1;32mOK�[0m
  DIV(type=f32,ne=[10,5,4,3],nr=[1,2,1,1],nf=1): not supported [HTP0] 
  ADD(type=f32,ne=[10,5,4,3],nr=[1,1,2,1],nf=1): �[1;32mOK�[0m
  SUB(type=f32,ne=[10,5,4,3],nr=[1,1,2,1],nf=1): �[1;32mOK�[0m
  MUL(type=f32,ne=[10,5,4,3],nr=[1,1,2,1],nf=1): �[1;32mOK�[0m
  DIV(type=f32,ne=[10,5,4,3],nr=[1,1,2,1],nf=1): not supported [HTP0] 
  ADD(type=f32,ne=[10,5,4,3],nr=[1,1,1,2],nf=1): �[1;32mOK�[0m
  SUB(type=f32,ne=[10,5,4,3],nr=[1,1,1,2],nf=1): �[1;32mOK�[0m
  MUL(type=f32,ne=[10,5,4,3],nr=[1,1,1,2],nf=1): �[1;32mOK�[0m
  DIV(type=f32,ne=[10,5,4,3],nr=[1,1,1,2],nf=1): not supported [HTP0] 
  ADD(type=f32,ne=[10,5,4,3],nr=[1,1,2,2],nf=1): �[1;32mOK�[0m
  SUB(type=f32,ne=[10,5,4,3],nr=[1,1,2,2],nf=1): �[1;32mOK�[0m
  MUL(type=f32,ne=[10,5,4,3],nr=[1,1,2,2],nf=1): �[1;32mOK�[0m
  DIV(type=f32,ne=[10,5,4,3],nr=[1,1,2,2],nf=1): not supported [HTP0] 
  ADD(type=f32,ne=[10,5,4,3],nr=[1,2,2,2],nf=1): �[1;32mOK�[0m
  SUB(type=f32,ne=[10,5,4,3],nr=[1,2,2,2],nf=1): �[1;32mOK�[0m
  MUL(type=f32,ne=[10,5,4,3],nr=[1,2,2,2],nf=1): �[1;32mOK�[0m
  DIV(type=f32,ne=[10,5,4,3],nr=[1,2,2,2],nf=1): not supported [HTP0] 
  ADD(type=f32,ne=[10,5,4,3],nr=[2,2,2,2],nf=1): �[1;32mOK�[0m
  SUB(type=f32,ne=[10,5,4,3],nr=[2,2,2,2],nf=1): �[1;32mOK�[0m
  MUL(type=f32,ne=[10,5,4,3],nr=[2,2,2,2],nf=1): �[1;32mOK�[0m

max-krasnyansky · 2025-11-06T02:19:03Z

@chraac Ah. We used to have that code. However the scalar divisions are very expensive on Hexagon (its a function call into a library, etc). So we removed that as part of the optimizations. The challenge is to fix test-backend-ops without introducing divs in the inner loops. Up for it? ;-)
It'd be OK to do a few divs in the outer functions but not in the inner loops.
The change as it is right now adds 5 extra divisions for each loop iteration.

lhez · 2025-11-06T02:39:18Z

ne's are runtime constants. You can take a look at #15872. It converts / and % to bit shift and mul by precalculation on CPU.

chraac · 2025-11-06T04:09:29Z

However the scalar divisions are very expensive on Hexagon (its a function call into a library, etc).

Yeah, noticed division-related symbols in the dynamic library import table. Does this mean Hexagon lacks native hardware div/mod instructions and relies on software library calls instead? If so, that would explain why avoiding these operations is important.

It'd be OK to do a few divs in the outer functions but not in the inner loops.

Intrresting, could have a try then.

max-krasnyansky · 2025-11-06T05:53:57Z

ne's are runtime constants. You can take a look at #15872. It converts / and % to bit shift and mul by precalculation on CPU.

Sweet! I was thinking of getting rid of divs all together (ie have host precompute the inner loop indicies) but this is much more generic.

max-krasnyansky · 2025-11-06T05:56:17Z

However the scalar divisions are very expensive on Hexagon (its a function call into a library, etc).

Yeah, noticed division-related symbols in the dynamic library import table. Does this mean Hexagon lacks native hardware div/mod instructions and relies on software library calls instead? If so, that would explain why avoiding these operations is important.
> It'd be OK to do a few divs in the outer functions but not in the inner loops.
Intrresting, could have a try then.

Yep. No divs in the HW.
That fastdiv()/fastmod() stuff that lhez mentioned looks perfect.

…ulations

chraac · 2025-11-06T11:52:53Z

ggml/src/ggml-hexagon/htp/ops-utils.h

+
+static inline uint32_t fastdiv(uint32_t n, const uint32_t mp, const uint32_t l) {
+    // Compute high 32 bits of n * mp
+    const uint32_t hi = (uint32_t) (((uint64_t) n * mp) >> 32);  // mulhi(n, mp)


thought we can avoid the hi 32bit calc here if can ensure n * mp < std::numeric_limits<uint32_t>::max().

chraac · 2025-11-06T15:40:46Z

ggml/src/ggml-hexagon/htp/binary-ops.c

+    uint32_t mod11[2];
+    init_fastdiv_values(ne13, &mod13[0], &mod13[1]);
+    init_fastdiv_values(ne12, &mod12[0], &mod12[1]);
+    init_fastdiv_values(ne11, &mod11[0], &mod11[1]);


Liked @lhez said, maybe we could move this to cpu once tensor init, and save it at htp_tensor?

Yep. The host would be even better.

We're going to need that for FP16 MatMuls that need to deal with the broadcasting.
So yeah, the host would be best to precompute this. We should be able to take advantage of the graph caching to avoid precomputing it several times.

…tions with fast division

chraac · 2025-11-07T14:22:43Z

ggml/src/ggml-hexagon/ggml-hexagon.cpp

+        h->div21 = init_fastdiv_values(h->ne[2] * h->ne[1]);
+        h->div3  = init_fastdiv_values(h->ne[3]);
+        h->div2  = init_fastdiv_values(h->ne[2]);
+        h->div1  = init_fastdiv_values(h->ne[1]);


We’re computing the fast-div parameters in init_htp_tensor, which runs for every DSP-dispatched op—even though many ops don’t need broadcasting.

Could we initialize them in ggml_backend_hexagon_buffer_init_tensor and store them per tensor to avoid recomputation? The catch is we don’t currently have a per-tensor custom context, so we’d need to add one...

… initialization

chraac · 2025-11-08T15:06:17Z

ggml/src/ggml-hexagon/ggml-hexagon.cpp

+
+    {
+        auto * ctx = static_cast<ggml_backend_hexagon_buffer_context *>(t->buffer->context);
+        const auto &tensor_ctx = ctx->get_tensor_ctx(t);


lazy init the ggml_backend_hexagon_buffer_context.

fix: optimize index calculations for src1 pointer in binary job function

844c9b7

chraac requested review from lhez and max-krasnyansky as code owners November 6, 2025 02:09

chraac marked this pull request as draft November 6, 2025 02:09

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Nov 6, 2025

DajanaV mentioned this pull request Nov 6, 2025

UPSTREAM PR #17042: ggml-hexagon: fix test-backend-ops failures on specific binary ops auroralabs-loci/llama.cpp#100

Open

chraac added 2 commits November 6, 2025 19:39

feat: add fast division and modulo functions for optimized calculations

588a503

feat: optimize binary job function with fast division and modulo calc…

e21f499

…ulations

chraac commented Nov 6, 2025

View reviewed changes

wip

5a770d3

chraac commented Nov 6, 2025

View reviewed changes

chraac added 3 commits November 6, 2025 23:43

Merge branch 'master' into dev-fix-test-failure

c853867

refactor: add fastdiv_values

49dcf46

feat: enhance hexagon tensor initialization and optimize binary opera…

fdfe1e2

…tions with fast division

chraac commented Nov 7, 2025

View reviewed changes

feat: implement hexagon tensor context management and optimize tensor…

4acca6c

… initialization

chraac commented Nov 8, 2025

View reviewed changes

fix reset

59ab60d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml-hexagon: fix `test-backend-ops` failures on specific binary ops #17042

ggml-hexagon: fix `test-backend-ops` failures on specific binary ops #17042

chraac commented Nov 6, 2025 •

edited

Loading

Uh oh!

max-krasnyansky commented Nov 6, 2025 •

edited

Loading

Uh oh!

lhez commented Nov 6, 2025

Uh oh!

chraac commented Nov 6, 2025

Uh oh!

max-krasnyansky commented Nov 6, 2025

Uh oh!

max-krasnyansky commented Nov 6, 2025

Uh oh!

chraac Nov 6, 2025 •

edited

Loading

Uh oh!

chraac Nov 6, 2025

Uh oh!

max-krasnyansky Nov 6, 2025

Uh oh!

max-krasnyansky Nov 6, 2025

Uh oh!

chraac Nov 7, 2025

Uh oh!

chraac Nov 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ggml-hexagon: fix test-backend-ops failures on specific binary ops #17042

Are you sure you want to change the base?

ggml-hexagon: fix test-backend-ops failures on specific binary ops #17042

Conversation

chraac commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Testing

Before

After

Uh oh!

max-krasnyansky commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhez commented Nov 6, 2025

Uh oh!

chraac commented Nov 6, 2025

Uh oh!

max-krasnyansky commented Nov 6, 2025

Uh oh!

max-krasnyansky commented Nov 6, 2025

Uh oh!

chraac Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chraac Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

max-krasnyansky Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

max-krasnyansky Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

chraac Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

chraac Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ggml-hexagon: fix `test-backend-ops` failures on specific binary ops #17042

ggml-hexagon: fix `test-backend-ops` failures on specific binary ops #17042

chraac commented Nov 6, 2025 •

edited

Loading

max-krasnyansky commented Nov 6, 2025 •

edited

Loading

chraac Nov 6, 2025 •

edited

Loading