-
Notifications
You must be signed in to change notification settings - Fork 13.6k
ggml-hexagon: fix test-backend-ops failures on specific binary ops
#17042
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
@chraac Ah. We used to have that code. However the scalar divisions are very expensive on Hexagon (its a function call into a library, etc). So we removed that as part of the optimizations. The challenge is to fix |
|
|
Sweet! I was thinking of getting rid of divs all together (ie have host precompute the inner loop indicies) but this is much more generic. |
|
|
||
| static inline uint32_t fastdiv(uint32_t n, const uint32_t mp, const uint32_t l) { | ||
| // Compute high 32 bits of n * mp | ||
| const uint32_t hi = (uint32_t) (((uint64_t) n * mp) >> 32); // mulhi(n, mp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thought we can avoid the hi 32bit calc here if can ensure n * mp < std::numeric_limits<uint32_t>::max().
| uint32_t mod11[2]; | ||
| init_fastdiv_values(ne13, &mod13[0], &mod13[1]); | ||
| init_fastdiv_values(ne12, &mod12[0], &mod12[1]); | ||
| init_fastdiv_values(ne11, &mod11[0], &mod11[1]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Liked @lhez said, maybe we could move this to cpu once tensor init, and save it at htp_tensor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. The host would be even better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're going to need that for FP16 MatMuls that need to deal with the broadcasting.
So yeah, the host would be best to precompute this. We should be able to take advantage of the graph caching to avoid precomputing it several times.
| h->div21 = init_fastdiv_values(h->ne[2] * h->ne[1]); | ||
| h->div3 = init_fastdiv_values(h->ne[3]); | ||
| h->div2 = init_fastdiv_values(h->ne[2]); | ||
| h->div1 = init_fastdiv_values(h->ne[1]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We’re computing the fast-div parameters in init_htp_tensor, which runs for every DSP-dispatched op—even though many ops don’t need broadcasting.
Could we initialize them in ggml_backend_hexagon_buffer_init_tensor and store them per tensor to avoid recomputation? The catch is we don’t currently have a per-tensor custom context, so we’d need to add one...
|
|
||
| { | ||
| auto * ctx = static_cast<ggml_backend_hexagon_buffer_context *>(t->buffer->context); | ||
| const auto &tensor_ctx = ctx->get_tensor_ctx(t); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lazy init the ggml_backend_hexagon_buffer_context.


Summary
Fixes
test-backend-opsfailures in ggml-hexagon by correcting the index calculation for binary operations.Changes
ggml/src/ggml-cpu/ops.cppTesting
Before
After