Spark: Support UPDATE statements without subqueries #2193

aokolnychyi · 2021-02-02T04:08:54Z

This PR introduces support for UPDATE statements in Spark. Support for subqueries will come in a follow-up PR.

aokolnychyi · 2021-02-02T04:09:10Z

cc @dilipbiswal @rdblue @RussellSpitzer @mehtaashish23

spark3-extensions/src/main/scala/org/apache/spark/sql/catalyst/analysis/AlignUpdateTable.scala

aokolnychyi · 2021-02-02T04:12:00Z

spark3-extensions/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDelete.scala

  import ExtendedDataSourceV2Implicits._
  import RewriteRowLevelOperationHelper._

-  override def resolver: Resolver = spark.sessionState.conf.resolver


Moved this to parent as I need both conf and resolver now.

aokolnychyi · 2021-02-02T04:12:12Z

spark3-extensions/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteMergeInto.scala

    !(actions.size == 1 && hasUnconditionalDelete(actions.headOption))
  }
-
-  private def buildWritePlan(childPlan: LogicalPlan, table: Table): LogicalPlan = {


Moved to parent.

aokolnychyi · 2021-02-02T04:13:16Z

spark3-extensions/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteUpdate.scala

+
+  import ExtendedDataSourceV2Implicits._
+
+  // TODO: can we do any better for no-op updates? when conditions evaluate to false/true?


Right now, UPDATE t SET ... WHERE false with result in a job and a commit with no changes. We may want to handle such cases differently but I am not sure it is a big deal.

aokolnychyi · 2021-02-02T04:14:05Z

spark3-extensions/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteUpdate.scala

+    // this method relies on the fact that the assignments have been aligned before
+    require(relation.output.size == assignments.size, "assignments must be aligned")
+
+    // Spark is going to execute the condition for each column but it seems we cannot avoid this


Any ideas whether we can avoid this are welcome. I did check the generated code and the condition was evaluated for every column that is subject to change.

@aokolnychyi I guess, if we can try to look at this in the future. I think there is not a good way to know the cost of the condition expression at the time of rewrite. So not sure if an alternate plan would do better in all condition, wdyt ?

You mean using MergeInto node? Let me add a TODO item here. I think we would need to test this out at some reasonable scale.

@dilipbiswal @mehtaashish23 Do you have an existing infra for benchmarks? Will it be easy for you to try this?

Yeah, we could use a custom node like MergeIntoExec that runs the expression and then performs the right projection. I think if we tried to use two projections, the optimizer would collapse them and rewrite everything equivalent to this, where the condition runs for each column.

@rdblue, yes, I confirm your thoughts as I tried that.

@aokolnychyi You mean running the same set of data with

All updates merged via "Merge Into"

All updates updated via "Update table"

And see the benchmarks? I wonder that considering MergeInto will do a join, irrespective of input, so I am not sure whether the If expression can be compared with a shuffle of data. I hope I am getting the expectation right.

I meant if we have an alternate implementation to what is currently proposed, will it be possible to perform tests at some reasonable scale to see if that alternative solution works better?

Roger, we can help with that but would recommend merging any of the approach to be functionally ready. And as we get bandwidth we can run the benchmark between the two approaches.

Sounds good. I'll cc once done.

aokolnychyi · 2021-02-02T04:14:32Z

...ions/src/main/scala/org/apache/spark/sql/catalyst/utils/RewriteRowLevelOperationHelper.scala

    }
  }
+
+  protected def buildWritePlan(childPlan: LogicalPlan, table: Table): LogicalPlan = {


Copied as it was.

For a follow up, should we add a rule that does this for Append and Overwrite plans as well? That would be nice so that we don't have to wait until 3.2.0 to get write distribution and ordering for normal writes.

I think we may consider doing that too.

spark3-extensions/src/main/scala/org/apache/spark/sql/catalyst/analysis/AlignUpdateTable.scala

...n/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeConditionsInRowLevelOperations.scala

dilipbiswal · 2021-02-02T08:53:13Z

...n/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeConditionsInRowLevelOperations.scala

        if !SubqueryExpression.hasSubquery(cond.getOrElse(Literal.TrueLiteral)) && isIcebergRelation(table) =>
      val optimizedCond = optimizeCondition(cond.getOrElse(Literal.TrueLiteral), table)
      d.copy(condition = Some(optimizedCond))
+    case u @ UpdateTable(table, _, cond)


@aokolnychyi Both the cases have same actions. Is there a way we can keep it in once place ?

Took another look. Not sure, actually. We need both branches to get cond, the main shared logic is in a separate method but I'd welcome any ideas I could overlook.

This is short enough that I think it's okay.

dilipbiswal · 2021-02-02T08:54:34Z

spark3-extensions/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteUpdate.scala

+  override def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators {
+    case UpdateTable(r: DataSourceV2Relation, assignments, Some(cond))
+        if isIcebergRelation(r) && SubqueryExpression.hasSubquery(cond) =>
+      throw new AnalysisException("UPDATE statements with subqueries are not currently supported")


@aokolnychyi We are keeping it here and not in checkAnalysis as we are going to open it up in follow-up ?

Yeah, I am already looking into this. I did not want to add a condition to the branch below so that it can stay as is.

dilipbiswal · 2021-02-02T09:24:38Z

spark3-extensions/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteUpdate.scala

+    // this method relies on the fact that the assignments have been aligned before
+    require(relation.output.size == assignments.size, "assignments must be aligned")
+
+    // Spark is going to execute the condition for each column but it seems we cannot avoid this


@aokolnychyi I guess, if we can try to look at this in the future. I think there is not a good way to know the cost of the condition expression at the time of rewrite. So not sure if an alternate plan would do better in all condition, wdyt ?

dilipbiswal · 2021-02-02T09:25:51Z

@aokolnychyi Looks good to me.

spark3-extensions/src/main/scala/org/apache/spark/sql/catalyst/analysis/AlignUpdateTable.scala

aokolnychyi · 2021-02-02T18:46:42Z

The PR is ready for another round.

spark3-extensions/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteUpdate.scala

rdblue · 2021-02-02T20:12:17Z

I see no major blockers with this. There are a few things to consider, but overall I think this can go in and we can improve it from there.

aokolnychyi · 2021-02-03T04:24:20Z

Thanks for reviewing, @dilipbiswal @mehtaashish23 @rdblue!

github-actions bot added the spark label Feb 2, 2021

aokolnychyi force-pushed the update-support branch from 40684b5 to 6c173d1 Compare February 2, 2021 04:10

aokolnychyi commented Feb 2, 2021

View reviewed changes

spark3-extensions/src/main/scala/org/apache/spark/sql/catalyst/analysis/AlignUpdateTable.scala Outdated Show resolved Hide resolved

aokolnychyi commented Feb 2, 2021

View reviewed changes

dilipbiswal reviewed Feb 2, 2021

View reviewed changes

mehtaashish23 reviewed Feb 2, 2021

View reviewed changes

spark3-extensions/src/main/scala/org/apache/spark/sql/catalyst/analysis/AlignUpdateTable.scala Outdated Show resolved Hide resolved

rdblue reviewed Feb 2, 2021

View reviewed changes

spark3-extensions/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteUpdate.scala Outdated Show resolved Hide resolved

rdblue reviewed Feb 2, 2021

View reviewed changes

spark3-extensions/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteUpdate.scala Show resolved Hide resolved

rdblue approved these changes Feb 2, 2021

View reviewed changes

Spark: Support UPDATE statements without subqueries

393b4b9

aokolnychyi force-pushed the update-support branch from bcc0d8e to 393b4b9 Compare February 3, 2021 03:14

aokolnychyi merged commit f229f1a into apache:master Feb 3, 2021

coolderli pushed a commit to coolderli/iceberg that referenced this pull request Apr 26, 2021

Spark: Support UPDATE statements without subqueries (apache#2193)

411be05


		import ExtendedDataSourceV2Implicits._

		// TODO: can we do any better for no-op updates? when conditions evaluate to false/true?

Spark: Support UPDATE statements without subqueries #2193

Spark: Support UPDATE statements without subqueries #2193

Uh oh!

Conversation

aokolnychyi commented Feb 2, 2021

Uh oh!

aokolnychyi commented Feb 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Feb 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mehtaashish23 Feb 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dilipbiswal commented Feb 2, 2021

Uh oh!

Uh oh!

aokolnychyi commented Feb 2, 2021

Uh oh!

Uh oh!

Uh oh!

rdblue commented Feb 2, 2021

Uh oh!

aokolnychyi commented Feb 3, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

aokolnychyi commented Feb 2, 2021 •

edited

Loading

aokolnychyi Feb 2, 2021 •

edited

Loading

mehtaashish23 Feb 2, 2021 •

edited

Loading