Skip to content

Conversation

@cboumalh
Copy link
Contributor

@cboumalh cboumalh commented Nov 5, 2025

What changes were proposed in this pull request?

This PR adds support for Tuple sketches to Spark SQL, based on the Apache DataSketches Tuple library. Tuple sketches extend the functionality of Theta sketches by associating summary values with each unique key, enabling efficient approximate computations for cardinality estimation with aggregated metadata across multiple dimensions. They provide a compact, probabilistic data structure that maintains both distinct key counts and associated summary values with bounded memory usage and strong accuracy guarantees. It introduces 11 new SQL functions, 3 of which are aggregates, and the rest being scalar.

Jira: https://issues.apache.org/jira/browse/SPARK-54179

tuple_sketch_agg(struct(key, summary), lgNomEntries, summaryType, mode) Creates a tuple sketch from key-summary pairs. Parameters:

  • key: The key field for distinct counting
    • Supports: INT, LONG, FLOAT, DOUBLE, STRING, BINARY, ARRAY[INT], ARRAY[LONG]
  • summary: The associated value
    • Types: DOUBLE, INT, or ARRAY[STRING]
  • lgNomEntries (optional, default = 12):
    • Log-base-2 of nominal entries (4–26)
  • summaryType (optional, default = 'double'):
    • Type of summary ('double', 'integer', 'string')
  • mode (optional, default = 'sum'):
    • Aggregation mode ('sum', 'min', 'max', 'alwaysone')

tuple_union_agg(sketch, lgNomEntries, summaryType, mode) - Unions multiple tuple sketch binary representations
tuple_intersection_agg(sketch, summaryType, mode) - Intersects multiple tuple sketch binary representations

Sketch Inspection Functions

tuple_sketch_estimate(sketch, summaryType) - Returns the estimated number of unique keys in the sketch
tuple_sketch_summary(sketch, summaryType, mode) - Aggregates all summary values from the sketch according to the specified mode

Set Operation Functions

tuple_union(sketch1, sketch2, lgNomEntries, summaryType, mode) - Unions two tuple sketches
tuple_union_theta(tupleSketch, thetaSketch, lgNomEntries, summaryType, mode) - Unions a tuple sketch with a theta sketch (theta entries get default summary values)
tuple_intersection(sketch1, sketch2, summaryType, mode) - Intersects two tuple sketches
tuple_intersection_theta(tupleSketch, thetaSketch, summaryType, mode) - Intersects a tuple sketch with a theta sketch
tuple_difference(sketch1, sketch2, summaryType) - Computes A-NOT-B (elements in sketch1 but not in sketch2)
tuple_difference_theta(tupleSketch, thetaSketch, summaryType) - Computes A-NOT-B where B is a theta sketch

Why are the changes needed?

Spark currently lacks support for tuple sketches, which enable approximate computations on key-value data. Tuple sketches provide:

  • O(k) space complexity - Bounded memory usage based on sketch size parameter, not data size
  • High accuracy - Configurable error bounds with proven theoretical guarantees
  • Fast queries - Efficient cardinality and summary estimation
  • Mergeable - Sketches can be combined for distributed aggregation across partitions
  • Multi-dimensional analysis - Track both distinct counts and associated metadata in one structure

Does this PR introduce any user-facing change?

Yes, It introduces 11 new SQL functions.

How was this patch tested?

SQL Golden File Tests: Added tuplesketches.sql with test queries covering:

  • All three summary types (double, integer, string)
  • Multiple key types (INT, LONG, FLOAT, DOUBLE, STRING, BINARY, arrays)
  • All aggregation modes (sum, min, max, alwaysone)
  • NULL value handling (verified NULLs are ignored)
  • Sketch aggregation, union, and intersection operations
  • Set operations (union, intersection, difference) including tuple-theta interoperability
  • Sketch size configuration (lgNomEntries parameter)
  • Approximate result validation using tolerance-based comparisons
  • Negative tests for error conditions (invalid parameters, type mismatches, invalid binary data, incompatible summary types)

Was this patch authored or co-authored using generative AI tooling?

Yes, used claude-sonnet-4.5 for testing mainly.

@cboumalh cboumalh marked this pull request as draft November 5, 2025 00:42
@github-actions github-actions bot added the SQL label Nov 5, 2025
@cboumalh
Copy link
Contributor Author

cboumalh commented Nov 5, 2025

cc @dtenedor @mkaravel @gengliangwang (still WIP)

@cboumalh cboumalh force-pushed the cboumalh-tuple-sketches branch from 3d4c6a2 to de5cbee Compare November 11, 2025 17:43
@cboumalh cboumalh changed the title [WIP][SPARK-54179][SQL] Add Native Support for Apache Tuple Sketches [SPARK-54179][SQL] Add Native Support for Apache Tuple Sketches Nov 12, 2025
@cboumalh cboumalh marked this pull request as ready for review November 12, 2025 15:32
@cboumalh
Copy link
Contributor Author

@dtenedor @gengliangwang @cloud-fan, this PR is ready for review if either of you have the time to take a look! Please let me know if there is anything missing before review.

Copy link
Contributor

@holdenk holdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks interesting, quick comment on the move of thing which made the review there a bit tricky. Broad question: why not code gen (or is their a follow up for codegen)? I'll leave the rest of the review to folks working on aggregates :)

Comment on lines 541 to -556
expression[KllSketchAggDouble]("kll_sketch_agg_double"),
expression[KllSketchToStringBigint]("kll_sketch_to_string_bigint"),
expression[KllSketchToStringFloat]("kll_sketch_to_string_float"),
expression[KllSketchToStringDouble]("kll_sketch_to_string_double"),
expression[KllSketchGetNBigint]("kll_sketch_get_n_bigint"),
expression[KllSketchGetNFloat]("kll_sketch_get_n_float"),
expression[KllSketchGetNDouble]("kll_sketch_get_n_double"),
expression[KllSketchMergeBigint]("kll_sketch_merge_bigint"),
expression[KllSketchMergeFloat]("kll_sketch_merge_float"),
expression[KllSketchMergeDouble]("kll_sketch_merge_double"),
expression[KllSketchGetQuantileBigint]("kll_sketch_get_quantile_bigint"),
expression[KllSketchGetQuantileFloat]("kll_sketch_get_quantile_float"),
expression[KllSketchGetQuantileDouble]("kll_sketch_get_quantile_double"),
expression[KllSketchGetRankBigint]("kll_sketch_get_rank_bigint"),
expression[KllSketchGetRankFloat]("kll_sketch_get_rank_float"),
expression[KllSketchGetRankDouble]("kll_sketch_get_rank_double"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why were these moved? Makes it challenging to make sure no APIs were removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function registry file tries to group the expressions by type. The expressions I moved are not aggregate functions (but are grouped with the aggregates), so I moved them to the scalar sketch function section. I forgot to make the comment on this when i reviewed this pr.

@cboumalh
Copy link
Contributor Author

cboumalh commented Nov 15, 2025

Hi @holdenk, thanks for the comment! I'm not sure if codegen works for the TypedImperativeAggregate class. Can be done for the scalar expressions probably. Could be in a follow up or even in this PR if the ROI is high enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants