Skip to content

Conversation

@aokolnychyi
Copy link
Contributor

This PR introduces support for UPDATE statements in Spark. Support for subqueries will come in a follow-up PR.

@aokolnychyi
Copy link
Contributor Author

aokolnychyi commented Feb 2, 2021

import ExtendedDataSourceV2Implicits._
import RewriteRowLevelOperationHelper._

override def resolver: Resolver = spark.sessionState.conf.resolver
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this to parent as I need both conf and resolver now.

!(actions.size == 1 && hasUnconditionalDelete(actions.headOption))
}

private def buildWritePlan(childPlan: LogicalPlan, table: Table): LogicalPlan = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to parent.


import ExtendedDataSourceV2Implicits._

// TODO: can we do any better for no-op updates? when conditions evaluate to false/true?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, UPDATE t SET ... WHERE false with result in a job and a commit with no changes. We may want to handle such cases differently but I am not sure it is a big deal.

// this method relies on the fact that the assignments have been aligned before
require(relation.output.size == assignments.size, "assignments must be aligned")

// Spark is going to execute the condition for each column but it seems we cannot avoid this
Copy link
Contributor Author

@aokolnychyi aokolnychyi Feb 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any ideas whether we can avoid this are welcome. I did check the generated code and the condition was evaluated for every column that is subject to change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aokolnychyi I guess, if we can try to look at this in the future. I think there is not a good way to know the cost of the condition expression at the time of rewrite. So not sure if an alternate plan would do better in all condition, wdyt ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean using MergeInto node? Let me add a TODO item here. I think we would need to test this out at some reasonable scale.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dilipbiswal @mehtaashish23 Do you have an existing infra for benchmarks? Will it be easy for you to try this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we could use a custom node like MergeIntoExec that runs the expression and then performs the right projection. I think if we tried to use two projections, the optimizer would collapse them and rewrite everything equivalent to this, where the condition runs for each column.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue, yes, I confirm your thoughts as I tried that.

Copy link
Contributor

@mehtaashish23 mehtaashish23 Feb 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aokolnychyi You mean running the same set of data with

  1. All updates merged via "Merge Into"
  2. All updates updated via "Update table"

And see the benchmarks? I wonder that considering MergeInto will do a join, irrespective of input, so I am not sure whether the If expression can be compared with a shuffle of data. I hope I am getting the expectation right.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant if we have an alternate implementation to what is currently proposed, will it be possible to perform tests at some reasonable scale to see if that alternative solution works better?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Roger, we can help with that but would recommend merging any of the approach to be functionally ready. And as we get bandwidth we can run the benchmark between the two approaches.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I'll cc once done.

}
}

protected def buildWritePlan(childPlan: LogicalPlan, table: Table): LogicalPlan = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copied as it was.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a follow up, should we add a rule that does this for Append and Overwrite plans as well? That would be nice so that we don't have to wait until 3.2.0 to get write distribution and ordering for normal writes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may consider doing that too.

if !SubqueryExpression.hasSubquery(cond.getOrElse(Literal.TrueLiteral)) && isIcebergRelation(table) =>
val optimizedCond = optimizeCondition(cond.getOrElse(Literal.TrueLiteral), table)
d.copy(condition = Some(optimizedCond))
case u @ UpdateTable(table, _, cond)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aokolnychyi Both the cases have same actions. Is there a way we can keep it in once place ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took another look. Not sure, actually. We need both branches to get cond, the main shared logic is in a separate method but I'd welcome any ideas I could overlook.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is short enough that I think it's okay.

override def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators {
case UpdateTable(r: DataSourceV2Relation, assignments, Some(cond))
if isIcebergRelation(r) && SubqueryExpression.hasSubquery(cond) =>
throw new AnalysisException("UPDATE statements with subqueries are not currently supported")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aokolnychyi We are keeping it here and not in checkAnalysis as we are going to open it up in follow-up ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I am already looking into this. I did not want to add a condition to the branch below so that it can stay as is.

// this method relies on the fact that the assignments have been aligned before
require(relation.output.size == assignments.size, "assignments must be aligned")

// Spark is going to execute the condition for each column but it seems we cannot avoid this
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aokolnychyi I guess, if we can try to look at this in the future. I think there is not a good way to know the cost of the condition expression at the time of rewrite. So not sure if an alternate plan would do better in all condition, wdyt ?

@dilipbiswal
Copy link
Contributor

@aokolnychyi Looks good to me.

@aokolnychyi
Copy link
Contributor Author

The PR is ready for another round.

@rdblue
Copy link
Contributor

rdblue commented Feb 2, 2021

I see no major blockers with this. There are a few things to consider, but overall I think this can go in and we can improve it from there.

@aokolnychyi aokolnychyi merged commit f229f1a into apache:master Feb 3, 2021
@aokolnychyi
Copy link
Contributor Author

Thanks for reviewing, @dilipbiswal @mehtaashish23 @rdblue!

coolderli pushed a commit to coolderli/iceberg that referenced this pull request Apr 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants