-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-54199][SQL] Add DataFrame API support for new KLL quantiles sketch functions #52900
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| * | ||
| * @group agg_funcs | ||
| * @since 4.1.0 | ||
| * @since 4.2.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to add something in the PR description about these versioning changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My mistake on this version change, it looks like 4.1.0 is the correct version after all. Reverted that.
| kll_sketch_get_quantile_bigint.__doc__ = pysparkfuncs.kll_sketch_get_quantile_bigint.__doc__ | ||
|
|
||
|
|
||
| def kll_sketch_get_quantile_float(sketch: "ColumnOrName", rank: "ColumnOrName") -> Column: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
noticed that in the line below, we don't check if the ranks column type is not supported, not sure what would happen in that case. https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/kllExpressions.scala#L488
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this, I think the class mixing in ImplicitCastInputTypes should make sure that both arguments have either the same types or are coercible to the required types. I'll add a couple negative tests just to make sure.
abstract class KllSketchGetQuantileBase
extends BinaryExpression
with CodegenFallback
with ImplicitCastInputTypes {
...
override def inputTypes: Seq[AbstractDataType] =
Seq(
BinaryType,
TypeCollection(
DoubleType,
ArrayType(DoubleType, containsNull = false)))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah then the checkInputDataTypes function probably does the check when the expression is resolved. I missed that.
cboumalh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just have the comment above, but all looks good. Thank you!
|
Thanks for your review! |
|
Hi, @dtenedor . I don't see any committers' approval yet.
|
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @cloud-fan . It would be great if someone from Databricks gave a post-commit approval for this because the code is targeting Apache Spark 4.1.0 like the following. @dtenedor seems to be unable to get a proper chance of community review yet.
/**
* Aggregate function: returns the compact binary representation of the Datasketches
* KllLongsSketch built with the values in the input column. The optional k parameter controls
* the size and accuracy of the sketch (default 200, range 8-65535).
*
* @group agg_funcs
* @since 4.1.0
*/
def kll_sketch_agg_bigint(e: Column, k: Column): Column =
Column.fn("kll_sketch_agg_bigint", e, k)
|
In addition, if this is not targeting for Apache Spark 4.1.0 technically, please update the wrong |
|
The change LGTM, it's just adding dataframe functions to match SQL. Given the related SQL functions are in 4.1, I think it makes sense to backport this to 4.1 as well. @dtenedor what do you think? |
|
Thank you, @cloud-fan ! |

What changes were proposed in this pull request?
This PR adds DataFrame API support for the KLL quantile sketch functions that were previously added to Spark SQL in #52800. This lets users leverage KLL sketches through both Scala and Python DataFrame APIs in addition to the existing SQL interface.
Key additions:
Scala DataFrame API (
sql/api/src/main/scala/org/apache/spark/sql/functions.scala):Columnparameters for computed valuesStringparameters for column namesIntparameters for literal k valuesPython DataFrame API (
python/pyspark/sql/functions/builtin.py):ColumnOrName,Optional[Union[int, Column]])Python Spark Connect Support (
python/pyspark/sql/connect/functions/builtin.py):Why are the changes needed?
While the SQL API for KLL sketches was previously added, DataFrame API support is essential for full usability. Without DataFrame API support, users would be forced to use SQL expressions via
expr()orselectExpr(), which is less ergonomic and type-safe.Does this PR introduce any user-facing change?
Yes, this PR adds DataFrame API support for the 18 KLL sketch functions:
Scala DataFrame API Example:
Python DataFrame API Example:
How was this patch tested?
Scala Unit Tests (
DataFrameAggregateSuite):kll_sketch_agg_{bigint,float,double}with default and explicit k valueskll_sketch_to_stringfunctions for all data typeskll_sketch_get_nfunctions for all data typeskll_sketch_mergeoperationskll_sketch_get_quantilewith single rank and array of rankskll_sketch_get_rankoperationsPython Unit Tests (
test_functions.py):Was this patch authored or co-authored using generative AI tooling?
Yes, IDE assistance used
claude-4.5-sonnetwith manual validation and integration.