Skip to content

Conversation

@chraac
Copy link
Contributor

@chraac chraac commented Nov 6, 2025

Summary

Fixes test-backend-ops failures in ggml-hexagon by correcting the index calculation for binary operations.

Changes

  • Fixed index calculation in binary ops to align with the cpu implementation in ggml/src/ggml-cpu/ops.cpp

Testing

Before

[ADD] NMSE = 0.940370531 > 0.000000100   ADD(type=f32,ne=[10,5,4,3],nr=[1,2,1,1],nf=1): �[1;31mFAIL�[0m
[SUB] NMSE = 0.949198773 > 0.000000100   SUB(type=f32,ne=[10,5,4,3],nr=[1,2,1,1],nf=1): �[1;31mFAIL�[0m
[MUL] NMSE = 0.006240991 > 0.000000100   MUL(type=f32,ne=[10,5,4,3],nr=[1,2,1,1],nf=1): �[1;31mFAIL�[0m
  DIV(type=f32,ne=[10,5,4,3],nr=[1,2,1,1],nf=1): not supported [HTP0] 
[ADD] NMSE = 0.713263381 > 0.000000100   ADD(type=f32,ne=[10,5,4,3],nr=[1,1,2,1],nf=1): �[1;31mFAIL�[0m
[SUB] NMSE = 0.699697783 > 0.000000100   SUB(type=f32,ne=[10,5,4,3],nr=[1,1,2,1],nf=1): �[1;31mFAIL�[0m
[MUL] NMSE = 0.004771670 > 0.000000100   MUL(type=f32,ne=[10,5,4,3],nr=[1,1,2,1],nf=1): �[1;31mFAIL�[0m
  DIV(type=f32,ne=[10,5,4,3],nr=[1,1,2,1],nf=1): not supported [HTP0] 
  ADD(type=f32,ne=[10,5,4,3],nr=[1,1,1,2],nf=1): �[1;32mOK�[0m
  SUB(type=f32,ne=[10,5,4,3],nr=[1,1,1,2],nf=1): �[1;32mOK�[0m
  MUL(type=f32,ne=[10,5,4,3],nr=[1,1,1,2],nf=1): �[1;32mOK�[0m
  DIV(type=f32,ne=[10,5,4,3],nr=[1,1,1,2],nf=1): not supported [HTP0] 
[ADD] NMSE = 0.671834829 > 0.000000100   ADD(type=f32,ne=[10,5,4,3],nr=[1,1,2,2],nf=1): �[1;31mFAIL�[0m
[SUB] NMSE = 0.667890500 > 0.000000100   SUB(type=f32,ne=[10,5,4,3],nr=[1,1,2,2],nf=1): �[1;31mFAIL�[0m
[MUL] NMSE = 0.004894060 > 0.000000100   MUL(type=f32,ne=[10,5,4,3],nr=[1,1,2,2],nf=1): �[1;31mFAIL�[0m
  DIV(type=f32,ne=[10,5,4,3],nr=[1,1,2,2],nf=1): not supported [HTP0] 
[ADD] NMSE = 0.882866389 > 0.000000100   ADD(type=f32,ne=[10,5,4,3],nr=[1,2,2,2],nf=1): �[1;31mFAIL�[0m
[SUB] NMSE = 0.889010400 > 0.000000100   SUB(type=f32,ne=[10,5,4,3],nr=[1,2,2,2],nf=1): �[1;31mFAIL�[0m
[MUL] NMSE = 0.005464608 > 0.000000100   MUL(type=f32,ne=[10,5,4,3],nr=[1,2,2,2],nf=1): �[1;31mFAIL�[0m
  DIV(type=f32,ne=[10,5,4,3],nr=[1,2,2,2],nf=1): not supported [HTP0] 
[ADD] NMSE = 0.822931792 > 0.000000100   ADD(type=f32,ne=[10,5,4,3],nr=[2,2,2,2],nf=1): �[1;31mFAIL�[0m
[SUB] NMSE = 0.861861988 > 0.000000100   SUB(type=f32,ne=[10,5,4,3],nr=[2,2,2,2],nf=1): �[1;31mFAIL�[0m
[MUL] NMSE = 0.005597148 > 0.000000100   MUL(type=f32,ne=[10,5,4,3],nr=[2,2,2,2],nf=1): �[1;31mFAIL�[0m

After

  ADD(type=f32,ne=[10,5,4,3],nr=[1,2,1,1],nf=1): �[1;32mOK�[0m
  SUB(type=f32,ne=[10,5,4,3],nr=[1,2,1,1],nf=1): �[1;32mOK�[0m
  MUL(type=f32,ne=[10,5,4,3],nr=[1,2,1,1],nf=1): �[1;32mOK�[0m
  DIV(type=f32,ne=[10,5,4,3],nr=[1,2,1,1],nf=1): not supported [HTP0] 
  ADD(type=f32,ne=[10,5,4,3],nr=[1,1,2,1],nf=1): �[1;32mOK�[0m
  SUB(type=f32,ne=[10,5,4,3],nr=[1,1,2,1],nf=1): �[1;32mOK�[0m
  MUL(type=f32,ne=[10,5,4,3],nr=[1,1,2,1],nf=1): �[1;32mOK�[0m
  DIV(type=f32,ne=[10,5,4,3],nr=[1,1,2,1],nf=1): not supported [HTP0] 
  ADD(type=f32,ne=[10,5,4,3],nr=[1,1,1,2],nf=1): �[1;32mOK�[0m
  SUB(type=f32,ne=[10,5,4,3],nr=[1,1,1,2],nf=1): �[1;32mOK�[0m
  MUL(type=f32,ne=[10,5,4,3],nr=[1,1,1,2],nf=1): �[1;32mOK�[0m
  DIV(type=f32,ne=[10,5,4,3],nr=[1,1,1,2],nf=1): not supported [HTP0] 
  ADD(type=f32,ne=[10,5,4,3],nr=[1,1,2,2],nf=1): �[1;32mOK�[0m
  SUB(type=f32,ne=[10,5,4,3],nr=[1,1,2,2],nf=1): �[1;32mOK�[0m
  MUL(type=f32,ne=[10,5,4,3],nr=[1,1,2,2],nf=1): �[1;32mOK�[0m
  DIV(type=f32,ne=[10,5,4,3],nr=[1,1,2,2],nf=1): not supported [HTP0] 
  ADD(type=f32,ne=[10,5,4,3],nr=[1,2,2,2],nf=1): �[1;32mOK�[0m
  SUB(type=f32,ne=[10,5,4,3],nr=[1,2,2,2],nf=1): �[1;32mOK�[0m
  MUL(type=f32,ne=[10,5,4,3],nr=[1,2,2,2],nf=1): �[1;32mOK�[0m
  DIV(type=f32,ne=[10,5,4,3],nr=[1,2,2,2],nf=1): not supported [HTP0] 
  ADD(type=f32,ne=[10,5,4,3],nr=[2,2,2,2],nf=1): �[1;32mOK�[0m
  SUB(type=f32,ne=[10,5,4,3],nr=[2,2,2,2],nf=1): �[1;32mOK�[0m
  MUL(type=f32,ne=[10,5,4,3],nr=[2,2,2,2],nf=1): �[1;32mOK�[0m

@chraac chraac marked this pull request as draft November 6, 2025 02:09
@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Nov 6, 2025
@max-krasnyansky
Copy link
Collaborator

max-krasnyansky commented Nov 6, 2025

@chraac Ah. We used to have that code. However the scalar divisions are very expensive on Hexagon (its a function call into a library, etc). So we removed that as part of the optimizations. The challenge is to fix test-backend-ops without introducing divs in the inner loops. Up for it? ;-)
It'd be OK to do a few divs in the outer functions but not in the inner loops.
The change as it is right now adds 5 extra divisions for each loop iteration.

@lhez
Copy link
Collaborator

lhez commented Nov 6, 2025

ne's are runtime constants. You can take a look at #15872. It converts / and % to bit shift and mul by precalculation on CPU.

@chraac
Copy link
Contributor Author

chraac commented Nov 6, 2025

However the scalar divisions are very expensive on Hexagon (its a function call into a library, etc).

Yeah, noticed division-related symbols in the dynamic library import table. Does this mean Hexagon lacks native hardware div/mod instructions and relies on software library calls instead? If so, that would explain why avoiding these operations is important.

image

It'd be OK to do a few divs in the outer functions but not in the inner loops.

Intrresting, could have a try then.

@max-krasnyansky
Copy link
Collaborator

ne's are runtime constants. You can take a look at #15872. It converts / and % to bit shift and mul by precalculation on CPU.

Sweet! I was thinking of getting rid of divs all together (ie have host precompute the inner loop indicies) but this is much more generic.

@max-krasnyansky
Copy link
Collaborator

However the scalar divisions are very expensive on Hexagon (its a function call into a library, etc).

Yeah, noticed division-related symbols in the dynamic library import table. Does this mean Hexagon lacks native hardware div/mod instructions and relies on software library calls instead? If so, that would explain why avoiding these operations is important.

image > It'd be OK to do a few divs in the outer functions but not in the inner loops.

Intrresting, could have a try then.

Yep. No divs in the HW.
That fastdiv()/fastmod() stuff that lhez mentioned looks perfect.


static inline uint32_t fastdiv(uint32_t n, const uint32_t mp, const uint32_t l) {
// Compute high 32 bits of n * mp
const uint32_t hi = (uint32_t) (((uint64_t) n * mp) >> 32); // mulhi(n, mp)
Copy link
Contributor Author

@chraac chraac Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought we can avoid the hi 32bit calc here if can ensure n * mp < std::numeric_limits<uint32_t>::max().

uint32_t mod11[2];
init_fastdiv_values(ne13, &mod13[0], &mod13[1]);
init_fastdiv_values(ne12, &mod12[0], &mod12[1]);
init_fastdiv_values(ne11, &mod11[0], &mod11[1]);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Liked @lhez said, maybe we could move this to cpu once tensor init, and save it at htp_tensor?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. The host would be even better.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're going to need that for FP16 MatMuls that need to deal with the broadcasting.
So yeah, the host would be best to precompute this. We should be able to take advantage of the graph caching to avoid precomputing it several times.

h->div21 = init_fastdiv_values(h->ne[2] * h->ne[1]);
h->div3 = init_fastdiv_values(h->ne[3]);
h->div2 = init_fastdiv_values(h->ne[2]);
h->div1 = init_fastdiv_values(h->ne[1]);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We’re computing the fast-div parameters in init_htp_tensor, which runs for every DSP-dispatched op—even though many ops don’t need broadcasting.

Could we initialize them in ggml_backend_hexagon_buffer_init_tensor and store them per tensor to avoid recomputation? The catch is we don’t currently have a per-tensor custom context, so we’d need to add one...


{
auto * ctx = static_cast<ggml_backend_hexagon_buffer_context *>(t->buffer->context);
const auto &tensor_ctx = ctx->get_tensor_ctx(t);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lazy init the ggml_backend_hexagon_buffer_context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants