Performance 18254756429: Improve hash grouping aggregation parallelism #2729

alexowens90 · 2025-10-23T15:35:30Z

Reference Issues/PRs

What does this implement or fix?

Poor quality hash implementations of integral types, including at least some implementations of std::hash are basically a static cast. e.g. std::hash<int64_t>{}(100) == 100. This is fast, but leads to poor distributions in our bucketing, where we mod the hash with the number of buckets. In particular, if performing a grouping hash on a timeseries where the time points are dates results in all of the rows being partitioned into bucket zero, which then results in no parallelism in the aggregation clause.

Swap to using a consistent hash function across all supported platforms with improved uniformity.

jjerphan · 2025-11-20T11:16:08Z

cpp/arcticdb/processing/test/test_clause.cpp

+    for (auto idx = 0; idx < num_rows; ++idx) {
+        col->set_scalar<int64_t>(idx, idx * 1'000'000'000);
+    }


Memory allocation grows linearly here before the process is SIGKILLed after OOM.

Most of the time seems to be spent here:

The hard-coded constant influences memory allocation; if we decrease it (as shown bellow), the test passes for instance.

diff --git i/cpp/arcticdb/processing/test/test_clause.cpp w/cpp/arcticdb/processing/test/test_clause.cpp index 3bbe66a00..f9f5add0a 100644 --- i/cpp/arcticdb/processing/test/test_clause.cpp +++ w/cpp/arcticdb/processing/test/test_clause.cpp @@ -84,7 +84,7 @@ TEST(Clause, PartitionHashQuality) { make_scalar_type(DataType::INT64), 0, AllocationType::DYNAMIC, Sparsity::NOT_PERMITTED ); for (auto idx = 0; idx < num_rows; ++idx) { - col->set_scalar<int64_t>(idx, idx * 1'000'000'000); + col->set_scalar<int64_t>(idx, idx * 100'000); } seg.add_column("grouping", col); seg.set_row_id(num_rows - 1);

I can reproduce with gcc 11.2 from conda-forge (also used for the PyPI builds).

Yeh I just came to the same conclusion. The loop isn't exiting when idx hits 100.
auto is resolving to int, which is equivalent to int32_t here. Changing the auto to int64_t fixes the issue. Something about the idx * 1'000'000'000 overflowing the max representable value of int32_t is breaking something, which feels like a compiler bug, since both arguments to set_scalar are passed by value.

Yes, time to refresh and old but gold quiz.

alexowens90 self-assigned this Oct 23, 2025

alexowens90 requested review from IvoDD and poodlewars as code owners October 23, 2025 15:35

alexowens90 added patch Small change, should increase patch version performance labels Oct 23, 2025

IvoDD approved these changes Oct 24, 2025

View reviewed changes

jjerphan reviewed Nov 20, 2025

View reviewed changes

Performance 18254756429: Improve hash grouping aggregation parallelism

f65cb5d

alexowens90 force-pushed the perf/18254756429/improve-hash-grouping-aggregation-parallelism branch from e018f74 to f65cb5d Compare November 20, 2025 17:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance 18254756429: Improve hash grouping aggregation parallelism #2729

Performance 18254756429: Improve hash grouping aggregation parallelism #2729

alexowens90 commented Oct 23, 2025

Uh oh!

jjerphan Nov 20, 2025

Uh oh!

jjerphan Nov 20, 2025 •

edited

Loading

Uh oh!

jjerphan Nov 20, 2025

Uh oh!

alexowens90 Nov 20, 2025

Uh oh!

jjerphan Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Performance 18254756429: Improve hash grouping aggregation parallelism #2729

Are you sure you want to change the base?

Performance 18254756429: Improve hash grouping aggregation parallelism #2729

Conversation

alexowens90 commented Oct 23, 2025

Reference Issues/PRs

What does this implement or fix?

Uh oh!

jjerphan Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

jjerphan Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jjerphan Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

alexowens90 Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

jjerphan Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jjerphan Nov 20, 2025 •

edited

Loading