[SPARK-54172][SQL] Merge Into Schema Evolution should only add referenced columns #52866

szehon-ho · 2025-11-04T01:36:12Z

What changes were proposed in this pull request?

Change MERGE INTO schema evolution scope. Limit the scope of schema evolution to only add columns/nested fields that exist in source and which are directly assigned to the source column without transformation.

ie,

UPDATE SET new_col = source.new_col 
UPDATE SET struct.new_field = source.struct.new_field
INSERT (old_col, new_col) VALUES (s.old_col, s.new_col)

Why are the changes needed?

#51698 added schema evolution support for MERGE INTO statements. However, it is a bit too broad. In some instances, source table may have many more fields than target tables. But user may only need a few new ones to be added to the target for the MERGE INTO statement.

Does this PR introduce any user-facing change?

No, MERGE INTO schema evolution is not yet released in Spark 4.1.

How was this patch tested?

Added many unit tests in MergeIntoTableSuiteBase

Was this patch authored or co-authored using generative AI tooling?

No

…nced columns

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

szehon-ho · 2025-11-04T20:05:42Z

@cloud-fan @aokolnychyi can you take a look? i think this is an important improvement to get in before we release MERGE INTO WITH SCHEMA EVOLUTION feature in Spark 4.1, thanks!

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala

cloud-fan · 2025-11-05T04:43:00Z

sql/core/src/test/scala/org/apache/spark/sql/connector/MergeIntoTableSuiteBase.scala

+             |USING source s
+             |ON t.pk = s.pk
+             |WHEN MATCHED THEN
+             | UPDATE SET dep='software'


This test is weird, dep is an existing column in the target table, and we for sure do not need to do schema evolution. What was the behavior before this PR?

oh its because source table has more colunns but they are not used..

cloud-fan · 2025-11-05T18:37:00Z

sql/core/src/test/scala/org/apache/spark/sql/connector/MergeIntoTableSuiteBase.scala

+             |ON t.pk = s.pk
+             |WHEN NOT MATCHED THEN
+             | INSERT (pk, info, dep) VALUES (s.pk,
+             |   named_struct('salary', s.info.salary, 'status', 'active'), 'marketing')


why do we trigger schema evolution for this case?

discuss offline, refine the logic to be more selective (direct assignment to source column)

…ignment where value is same name in source

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala

cloud-fan · 2025-11-07T19:26:58Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala

+  private lazy val sourceSchemaForEvolution: StructType =
+    MergeIntoTable.sourceSchemaForSchemaEvolution(this)
+
+  lazy val needSchemaEvolution: Boolean = {


I think the rule ResolveMergeIntoSchemaEvolution should be triggered as long as MergeIntoTable#schemaEvolutionEnabled is true. These complicated logic should be moved into ResolveMergeIntoSchemaEvolution and the rule returns the merge command unchanged if schema evolution is not needed.

To make ResolveMergeIntoSchemaEvolution more reliable about rule orders, we should wait for the merge assignment values to be resolved before entering the rule. At the beginning of the rule, resolve the merge assignment keys again to make sure rule order does not matter. We can stop earlier if the assignment values are not pure field reference and there is no star.

cloud-fan · 2025-11-07T22:42:56Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala

+    val assignmentValueExpr = extractFieldPath(assignment.value)
+    // Valid assignments are: col = s.col or col.nestedField = s.col.nestedField
+    assignmenKeyExpr.length == path.length && isPrefix(assignmenKeyExpr, path) &&
+      isSuffix(path, assignmentValueExpr)


is this only to skip the source table qualifier? it seems wrong to trigger schema evolution for col = wrong_table.col which should fail analysis without schema evolution.

[SPARK-54172][SQL] Merge Into Schema Evolution should only add refere…

3daaf68

…nced columns

github-actions bot added the SQL label Nov 4, 2025

szehon-ho force-pushed the merge_schema_evolution_limit_cols branch 3 times, most recently from 41731d2 to 6c6de51 Compare November 4, 2025 20:02

szehon-ho commented Nov 4, 2025

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala Outdated Show resolved Hide resolved

Refactor and add more test

24b1a51

szehon-ho force-pushed the merge_schema_evolution_limit_cols branch from 6c6de51 to 24b1a51 Compare November 4, 2025 20:06

cloud-fan reviewed Nov 4, 2025

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala Show resolved Hide resolved

cloud-fan reviewed Nov 5, 2025

View reviewed changes

szehon-ho force-pushed the merge_schema_evolution_limit_cols branch from 448bfdf to 8ecc4ad Compare November 6, 2025 22:59

Only allow schema evolution for case where new field is target of ass…

abbeb1e

…ignment where value is same name in source

szehon-ho force-pushed the merge_schema_evolution_limit_cols branch from 8ecc4ad to abbeb1e Compare November 6, 2025 23:02

szehon-ho added 2 commits November 6, 2025 16:14

Add more tests

1461265

minor cleanup

451da3e

cloud-fan reviewed Nov 7, 2025

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala Outdated Show resolved Hide resolved

review comment

cee88a2

cloud-fan reviewed Nov 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-54172][SQL] Merge Into Schema Evolution should only add referenced columns #52866

[SPARK-54172][SQL] Merge Into Schema Evolution should only add referenced columns #52866

szehon-ho commented Nov 4, 2025 •

edited

Loading

Uh oh!

Uh oh!

szehon-ho commented Nov 4, 2025

Uh oh!

Uh oh!

cloud-fan Nov 5, 2025

Uh oh!

szehon-ho Nov 5, 2025

Uh oh!

cloud-fan Nov 5, 2025

Uh oh!

szehon-ho Nov 6, 2025

Uh oh!

Uh oh!

cloud-fan Nov 7, 2025 •

edited

Loading

Uh oh!

cloud-fan Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-54172][SQL] Merge Into Schema Evolution should only add referenced columns #52866

Are you sure you want to change the base?

[SPARK-54172][SQL] Merge Into Schema Evolution should only add referenced columns #52866

Conversation

szehon-ho commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Uh oh!

szehon-ho commented Nov 4, 2025

Uh oh!

Uh oh!

cloud-fan Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

szehon-ho Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

cloud-fan Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

szehon-ho Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cloud-fan Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

szehon-ho commented Nov 4, 2025 •

edited

Loading

cloud-fan Nov 7, 2025 •

edited

Loading