[SPARK-54179][SQL] Add Native Support for Apache Tuple Sketches #52883

cboumalh · 2025-11-05T00:42:45Z

What changes were proposed in this pull request?

This PR adds support for Tuple sketches to Spark SQL, based on the Apache DataSketches Tuple library. Tuple sketches extend the functionality of Theta sketches by associating summary values with each unique key, enabling efficient approximate computations for cardinality estimation with aggregated metadata across multiple dimensions. They provide a compact, probabilistic data structure that maintains both distinct key counts and associated summary values with bounded memory usage and strong accuracy guarantees. It introduces 11 new SQL functions, 3 of which are aggregates, and the rest being scalar.

Jira: https://issues.apache.org/jira/browse/SPARK-54179

tuple_sketch_agg(struct(key, summary), lgNomEntries, summaryType, mode) Creates a tuple sketch from key-summary pairs. Parameters:

key: The key field for distinct counting
- Supports: INT, LONG, FLOAT, DOUBLE, STRING, BINARY, ARRAY[INT], ARRAY[LONG]
summary: The associated value
- Types: DOUBLE, INT, or ARRAY[STRING]
lgNomEntries (optional, default = 12):
- Log-base-2 of nominal entries (4–26)
summaryType (optional, default = 'double'):
- Type of summary ('double', 'integer', 'string')
mode (optional, default = 'sum'):
- Aggregation mode ('sum', 'min', 'max', 'alwaysone')

tuple_union_agg(sketch, lgNomEntries, summaryType, mode) - Unions multiple tuple sketch binary representations
tuple_intersection_agg(sketch, summaryType, mode) - Intersects multiple tuple sketch binary representations

Sketch Inspection Functions

tuple_sketch_estimate(sketch, summaryType) - Returns the estimated number of unique keys in the sketch
tuple_sketch_summary(sketch, summaryType, mode) - Aggregates all summary values from the sketch according to the specified mode

Set Operation Functions

tuple_union(sketch1, sketch2, lgNomEntries, summaryType, mode) - Unions two tuple sketches
tuple_union_theta(tupleSketch, thetaSketch, lgNomEntries, summaryType, mode) - Unions a tuple sketch with a theta sketch (theta entries get default summary values)
tuple_intersection(sketch1, sketch2, summaryType, mode) - Intersects two tuple sketches
tuple_intersection_theta(tupleSketch, thetaSketch, summaryType, mode) - Intersects a tuple sketch with a theta sketch
tuple_difference(sketch1, sketch2, summaryType) - Computes A-NOT-B (elements in sketch1 but not in sketch2)
tuple_difference_theta(tupleSketch, thetaSketch, summaryType) - Computes A-NOT-B where B is a theta sketch

Why are the changes needed?

Spark currently lacks support for tuple sketches, which enable approximate computations on key-value data. Tuple sketches provide:

O(k) space complexity - Bounded memory usage based on sketch size parameter, not data size
High accuracy - Configurable error bounds with proven theoretical guarantees
Fast queries - Efficient cardinality and summary estimation
Mergeable - Sketches can be combined for distributed aggregation across partitions
Multi-dimensional analysis - Track both distinct counts and associated metadata in one structure

Does this PR introduce any user-facing change?

Yes, It introduces 11 new SQL functions.

How was this patch tested?

SQL Golden File Tests: Added tuplesketches.sql with test queries covering:

All three summary types (double, integer, string)
Multiple key types (INT, LONG, FLOAT, DOUBLE, STRING, BINARY, arrays)
All aggregation modes (sum, min, max, alwaysone)
NULL value handling (verified NULLs are ignored)
Sketch aggregation, union, and intersection operations
Set operations (union, intersection, difference) including tuple-theta interoperability
Sketch size configuration (lgNomEntries parameter)
Approximate result validation using tolerance-based comparisons
Negative tests for error conditions (invalid parameters, type mismatches, invalid binary data, incompatible summary types)

Was this patch authored or co-authored using generative AI tooling?

Yes, used claude-sonnet-4.5 for testing mainly.

cboumalh · 2025-11-05T00:43:24Z

cc @dtenedor @mkaravel @gengliangwang (still WIP)

cboumalh · 2025-11-12T15:51:36Z

@dtenedor @gengliangwang @cloud-fan, this PR is ready for review if either of you have the time to take a look! Please let me know if there is anything missing before review.

holdenk

This looks interesting, quick comment on the move of thing which made the review there a bit tricky. Broad question: why not code gen (or is their a follow up for codegen)? I'll leave the rest of the review to folks working on aggregates :)

holdenk · 2025-11-14T23:42:32Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala

    expression[KllSketchAggDouble]("kll_sketch_agg_double"),
-    expression[KllSketchToStringBigint]("kll_sketch_to_string_bigint"),
-    expression[KllSketchToStringFloat]("kll_sketch_to_string_float"),
-    expression[KllSketchToStringDouble]("kll_sketch_to_string_double"),
-    expression[KllSketchGetNBigint]("kll_sketch_get_n_bigint"),
-    expression[KllSketchGetNFloat]("kll_sketch_get_n_float"),
-    expression[KllSketchGetNDouble]("kll_sketch_get_n_double"),
-    expression[KllSketchMergeBigint]("kll_sketch_merge_bigint"),
-    expression[KllSketchMergeFloat]("kll_sketch_merge_float"),
-    expression[KllSketchMergeDouble]("kll_sketch_merge_double"),
-    expression[KllSketchGetQuantileBigint]("kll_sketch_get_quantile_bigint"),
-    expression[KllSketchGetQuantileFloat]("kll_sketch_get_quantile_float"),
-    expression[KllSketchGetQuantileDouble]("kll_sketch_get_quantile_double"),
-    expression[KllSketchGetRankBigint]("kll_sketch_get_rank_bigint"),
-    expression[KllSketchGetRankFloat]("kll_sketch_get_rank_float"),
-    expression[KllSketchGetRankDouble]("kll_sketch_get_rank_double"),


Why were these moved? Makes it challenging to make sure no APIs were removed.

The function registry file tries to group the expressions by type. The expressions I moved are not aggregate functions (but are grouped with the aggregates), so I moved them to the scalar sketch function section. I forgot to make the comment on this when i reviewed this pr.

cboumalh · 2025-11-15T00:09:11Z

Hi @holdenk, thanks for the comment! I'm not sure if codegen works for the TypedImperativeAggregate class. Can be done for the scalar expressions probably. Could be in a follow up or even in this PR if the ROI is high enough.

[WIP][SPARK-54179][SQL] Add Native Support for Apache Tuple Sketches

47e1470

cboumalh marked this pull request as draft November 5, 2025 00:42

github-actions bot added the SQL label Nov 5, 2025

Chris Boumalhab added 4 commits November 5, 2025 02:59

format

d2d1300

aggregate functions

38e7c45

fix

c7067a0

added expressions

de5cbee

cboumalh force-pushed the cboumalh-tuple-sketches branch from 3d4c6a2 to de5cbee Compare November 11, 2025 17:43

Chris Boumalhab and others added 7 commits November 11, 2025 20:01

add expression schema

cf043a9

added sql

d233200

Merge branch 'apache:master' into cboumalh-tuple-sketches

3ef0250

small fix

8b614a3

comments

de71f24

fix

6036d56

fix

f3be380

cboumalh changed the title ~~[WIP][SPARK-54179][SQL] Add Native Support for Apache Tuple Sketches~~ [SPARK-54179][SQL] Add Native Support for Apache Tuple Sketches Nov 12, 2025

cboumalh marked this pull request as ready for review November 12, 2025 15:32

Chris Boumalhab added 4 commits November 12, 2025 16:17

comments fix

52a6a27

fix

5fd73f9

fix

8f68ade

fix

94f1d8f

holdenk reviewed Nov 14, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-54179][SQL] Add Native Support for Apache Tuple Sketches #52883

[SPARK-54179][SQL] Add Native Support for Apache Tuple Sketches #52883

cboumalh commented Nov 5, 2025 •

edited

Loading

Uh oh!

cboumalh commented Nov 5, 2025

Uh oh!

cboumalh commented Nov 12, 2025

Uh oh!

holdenk left a comment

Uh oh!

holdenk Nov 14, 2025

Uh oh!

cboumalh Nov 15, 2025

Uh oh!

cboumalh commented Nov 15, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-54179][SQL] Add Native Support for Apache Tuple Sketches #52883

Are you sure you want to change the base?

[SPARK-54179][SQL] Add Native Support for Apache Tuple Sketches #52883

Conversation

cboumalh commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Sketch Inspection Functions

Set Operation Functions

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

cboumalh commented Nov 5, 2025

Uh oh!

cboumalh commented Nov 12, 2025

Uh oh!

holdenk left a comment

Choose a reason for hiding this comment

Uh oh!

holdenk Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

cboumalh Nov 15, 2025

Choose a reason for hiding this comment

Uh oh!

cboumalh commented Nov 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cboumalh commented Nov 5, 2025 •

edited

Loading

cboumalh commented Nov 15, 2025 •

edited

Loading