Skip to content

Conversation

@alexowens90
Copy link
Collaborator

Reference Issues/PRs

18254756429

What does this implement or fix?

Poor quality hash implementations of integral types, including at least some implementations of std::hash are basically a static cast. e.g. std::hash<int64_t>{}(100) == 100. This is fast, but leads to poor distributions in our bucketing, where we mod the hash with the number of buckets. In particular, if performing a grouping hash on a timeseries where the time points are dates results in all of the rows being partitioned into bucket zero, which then results in no parallelism in the aggregation clause.

Swap to using a consistent hash function across all supported platforms with improved uniformity.

@alexowens90 alexowens90 self-assigned this Oct 23, 2025
@alexowens90 alexowens90 added patch Small change, should increase patch version performance labels Oct 23, 2025
Comment on lines 86 to 88
for (auto idx = 0; idx < num_rows; ++idx) {
col->set_scalar<int64_t>(idx, idx * 1'000'000'000);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Memory allocation grows linearly here before the process is SIGKILLed after OOM.

Most of the time seems to be spent here:
Image

Copy link
Collaborator

@jjerphan jjerphan Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hard-coded constant influences memory allocation; if we decrease it (as shown bellow), the test passes for instance.

diff --git i/cpp/arcticdb/processing/test/test_clause.cpp w/cpp/arcticdb/processing/test/test_clause.cpp
index 3bbe66a00..f9f5add0a 100644
--- i/cpp/arcticdb/processing/test/test_clause.cpp
+++ w/cpp/arcticdb/processing/test/test_clause.cpp
@@ -84,7 +84,7 @@ TEST(Clause, PartitionHashQuality) {
             make_scalar_type(DataType::INT64), 0, AllocationType::DYNAMIC, Sparsity::NOT_PERMITTED
     );
     for (auto idx = 0; idx < num_rows; ++idx) {
-        col->set_scalar<int64_t>(idx, idx * 1'000'000'000);
+        col->set_scalar<int64_t>(idx, idx * 100'000);
     }
     seg.add_column("grouping", col);
     seg.set_row_id(num_rows - 1);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can reproduce with gcc 11.2 from conda-forge (also used for the PyPI builds).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeh I just came to the same conclusion. The loop isn't exiting when idx hits 100.
auto is resolving to int, which is equivalent to int32_t here. Changing the auto to int64_t fixes the issue. Something about the idx * 1'000'000'000 overflowing the max representable value of int32_t is breaking something, which feels like a compiler bug, since both arguments to set_scalar are passed by value.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, time to refresh and old but gold quiz.

@alexowens90 alexowens90 force-pushed the perf/18254756429/improve-hash-grouping-aggregation-parallelism branch from e018f74 to f65cb5d Compare November 20, 2025 17:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

patch Small change, should increase patch version performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants