-
Notifications
You must be signed in to change notification settings - Fork 3k
Spark: Support UPDATE statements without subqueries #2193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
40684b5 to
6c173d1
Compare
spark3-extensions/src/main/scala/org/apache/spark/sql/catalyst/analysis/AlignUpdateTable.scala
Outdated
Show resolved
Hide resolved
| import ExtendedDataSourceV2Implicits._ | ||
| import RewriteRowLevelOperationHelper._ | ||
|
|
||
| override def resolver: Resolver = spark.sessionState.conf.resolver |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved this to parent as I need both conf and resolver now.
| !(actions.size == 1 && hasUnconditionalDelete(actions.headOption)) | ||
| } | ||
|
|
||
| private def buildWritePlan(childPlan: LogicalPlan, table: Table): LogicalPlan = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to parent.
|
|
||
| import ExtendedDataSourceV2Implicits._ | ||
|
|
||
| // TODO: can we do any better for no-op updates? when conditions evaluate to false/true? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now, UPDATE t SET ... WHERE false with result in a job and a commit with no changes. We may want to handle such cases differently but I am not sure it is a big deal.
| // this method relies on the fact that the assignments have been aligned before | ||
| require(relation.output.size == assignments.size, "assignments must be aligned") | ||
|
|
||
| // Spark is going to execute the condition for each column but it seems we cannot avoid this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any ideas whether we can avoid this are welcome. I did check the generated code and the condition was evaluated for every column that is subject to change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aokolnychyi I guess, if we can try to look at this in the future. I think there is not a good way to know the cost of the condition expression at the time of rewrite. So not sure if an alternate plan would do better in all condition, wdyt ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean using MergeInto node? Let me add a TODO item here. I think we would need to test this out at some reasonable scale.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dilipbiswal @mehtaashish23 Do you have an existing infra for benchmarks? Will it be easy for you to try this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we could use a custom node like MergeIntoExec that runs the expression and then performs the right projection. I think if we tried to use two projections, the optimizer would collapse them and rewrite everything equivalent to this, where the condition runs for each column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdblue, yes, I confirm your thoughts as I tried that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aokolnychyi You mean running the same set of data with
- All updates merged via "Merge Into"
- All updates updated via "Update table"
And see the benchmarks? I wonder that considering MergeInto will do a join, irrespective of input, so I am not sure whether the If expression can be compared with a shuffle of data. I hope I am getting the expectation right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant if we have an alternate implementation to what is currently proposed, will it be possible to perform tests at some reasonable scale to see if that alternative solution works better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Roger, we can help with that but would recommend merging any of the approach to be functionally ready. And as we get bandwidth we can run the benchmark between the two approaches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. I'll cc once done.
| } | ||
| } | ||
|
|
||
| protected def buildWritePlan(childPlan: LogicalPlan, table: Table): LogicalPlan = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copied as it was.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a follow up, should we add a rule that does this for Append and Overwrite plans as well? That would be nice so that we don't have to wait until 3.2.0 to get write distribution and ordering for normal writes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we may consider doing that too.
spark3-extensions/src/main/scala/org/apache/spark/sql/catalyst/analysis/AlignUpdateTable.scala
Outdated
Show resolved
Hide resolved
...n/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeConditionsInRowLevelOperations.scala
Outdated
Show resolved
Hide resolved
| if !SubqueryExpression.hasSubquery(cond.getOrElse(Literal.TrueLiteral)) && isIcebergRelation(table) => | ||
| val optimizedCond = optimizeCondition(cond.getOrElse(Literal.TrueLiteral), table) | ||
| d.copy(condition = Some(optimizedCond)) | ||
| case u @ UpdateTable(table, _, cond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aokolnychyi Both the cases have same actions. Is there a way we can keep it in once place ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took another look. Not sure, actually. We need both branches to get cond, the main shared logic is in a separate method but I'd welcome any ideas I could overlook.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is short enough that I think it's okay.
| override def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators { | ||
| case UpdateTable(r: DataSourceV2Relation, assignments, Some(cond)) | ||
| if isIcebergRelation(r) && SubqueryExpression.hasSubquery(cond) => | ||
| throw new AnalysisException("UPDATE statements with subqueries are not currently supported") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aokolnychyi We are keeping it here and not in checkAnalysis as we are going to open it up in follow-up ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I am already looking into this. I did not want to add a condition to the branch below so that it can stay as is.
| // this method relies on the fact that the assignments have been aligned before | ||
| require(relation.output.size == assignments.size, "assignments must be aligned") | ||
|
|
||
| // Spark is going to execute the condition for each column but it seems we cannot avoid this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aokolnychyi I guess, if we can try to look at this in the future. I think there is not a good way to know the cost of the condition expression at the time of rewrite. So not sure if an alternate plan would do better in all condition, wdyt ?
|
@aokolnychyi Looks good to me. |
spark3-extensions/src/main/scala/org/apache/spark/sql/catalyst/analysis/AlignUpdateTable.scala
Outdated
Show resolved
Hide resolved
|
The PR is ready for another round. |
spark3-extensions/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteUpdate.scala
Outdated
Show resolved
Hide resolved
spark3-extensions/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteUpdate.scala
Show resolved
Hide resolved
|
I see no major blockers with this. There are a few things to consider, but overall I think this can go in and we can improve it from there. |
bcc0d8e to
393b4b9
Compare
|
Thanks for reviewing, @dilipbiswal @mehtaashish23 @rdblue! |
This PR introduces support for UPDATE statements in Spark. Support for subqueries will come in a follow-up PR.