-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-54179][SQL] Add Native Support for Apache Tuple Sketches #52883
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
cc @dtenedor @mkaravel @gengliangwang (still WIP) |
3d4c6a2 to
de5cbee
Compare
|
@dtenedor @gengliangwang @cloud-fan, this PR is ready for review if either of you have the time to take a look! Please let me know if there is anything missing before review. |
holdenk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks interesting, quick comment on the move of thing which made the review there a bit tricky. Broad question: why not code gen (or is their a follow up for codegen)? I'll leave the rest of the review to folks working on aggregates :)
| expression[KllSketchAggDouble]("kll_sketch_agg_double"), | ||
| expression[KllSketchToStringBigint]("kll_sketch_to_string_bigint"), | ||
| expression[KllSketchToStringFloat]("kll_sketch_to_string_float"), | ||
| expression[KllSketchToStringDouble]("kll_sketch_to_string_double"), | ||
| expression[KllSketchGetNBigint]("kll_sketch_get_n_bigint"), | ||
| expression[KllSketchGetNFloat]("kll_sketch_get_n_float"), | ||
| expression[KllSketchGetNDouble]("kll_sketch_get_n_double"), | ||
| expression[KllSketchMergeBigint]("kll_sketch_merge_bigint"), | ||
| expression[KllSketchMergeFloat]("kll_sketch_merge_float"), | ||
| expression[KllSketchMergeDouble]("kll_sketch_merge_double"), | ||
| expression[KllSketchGetQuantileBigint]("kll_sketch_get_quantile_bigint"), | ||
| expression[KllSketchGetQuantileFloat]("kll_sketch_get_quantile_float"), | ||
| expression[KllSketchGetQuantileDouble]("kll_sketch_get_quantile_double"), | ||
| expression[KllSketchGetRankBigint]("kll_sketch_get_rank_bigint"), | ||
| expression[KllSketchGetRankFloat]("kll_sketch_get_rank_float"), | ||
| expression[KllSketchGetRankDouble]("kll_sketch_get_rank_double"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why were these moved? Makes it challenging to make sure no APIs were removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function registry file tries to group the expressions by type. The expressions I moved are not aggregate functions (but are grouped with the aggregates), so I moved them to the scalar sketch function section. I forgot to make the comment on this when i reviewed this pr.
|
Hi @holdenk, thanks for the comment! I'm not sure if codegen works for the |
What changes were proposed in this pull request?
This PR adds support for Tuple sketches to Spark SQL, based on the Apache DataSketches Tuple library. Tuple sketches extend the functionality of Theta sketches by associating summary values with each unique key, enabling efficient approximate computations for cardinality estimation with aggregated metadata across multiple dimensions. They provide a compact, probabilistic data structure that maintains both distinct key counts and associated summary values with bounded memory usage and strong accuracy guarantees. It introduces 11 new SQL functions, 3 of which are aggregates, and the rest being scalar.
Jira: https://issues.apache.org/jira/browse/SPARK-54179
tuple_sketch_agg(struct(key, summary), lgNomEntries, summaryType, mode)Creates a tuple sketch from key-summary pairs. Parameters:INT,LONG,FLOAT,DOUBLE,STRING,BINARY,ARRAY[INT],ARRAY[LONG]DOUBLE,INT, orARRAY[STRING]4–26)'double','integer','string')'sum','min','max','alwaysone')tuple_union_agg(sketch, lgNomEntries, summaryType, mode)- Unions multiple tuple sketch binary representationstuple_intersection_agg(sketch, summaryType, mode)- Intersects multiple tuple sketch binary representationsSketch Inspection Functions
tuple_sketch_estimate(sketch, summaryType)- Returns the estimated number of unique keys in the sketchtuple_sketch_summary(sketch, summaryType, mode)- Aggregates all summary values from the sketch according to the specified modeSet Operation Functions
tuple_union(sketch1, sketch2, lgNomEntries, summaryType, mode)- Unions two tuple sketchestuple_union_theta(tupleSketch, thetaSketch, lgNomEntries, summaryType, mode)- Unions a tuple sketch with a theta sketch (theta entries get default summary values)tuple_intersection(sketch1, sketch2, summaryType, mode)- Intersects two tuple sketchestuple_intersection_theta(tupleSketch, thetaSketch, summaryType, mode)- Intersects a tuple sketch with a theta sketchtuple_difference(sketch1, sketch2, summaryType)- Computes A-NOT-B (elements in sketch1 but not in sketch2)tuple_difference_theta(tupleSketch, thetaSketch, summaryType)- Computes A-NOT-B where B is a theta sketchWhy are the changes needed?
Spark currently lacks support for tuple sketches, which enable approximate computations on key-value data. Tuple sketches provide:
Does this PR introduce any user-facing change?
Yes, It introduces 11 new SQL functions.
How was this patch tested?
SQL Golden File Tests: Added tuplesketches.sql with test queries covering:
Was this patch authored or co-authored using generative AI tooling?
Yes, used claude-sonnet-4.5 for testing mainly.