Skip to content

Conversation

@aokolnychyi
Copy link
Contributor

This PR fixes the system function predicate pushdown in CoW MERGE operations and adds tests for other use cases. Prior to this PR, we did not cover ReplaceData and Join nodes that are participating in CoW operations. We also did not cover IN expressions.

See GroupBasedRowLevelOperationScanPlanning in Spark for more context.

@github-actions github-actions bot added the spark label Mar 5, 2024
} else {
filter.copy(condition = newCondition)
}
plan.transformWithPruning (_.containsAnyPattern(COMMAND, FILTER, JOIN)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note containsAllPatterns became containsAnyPattern. I don't anticipate this being a performance problem, however.

case replace @ ReplaceData(_, cond, _, _, _, _) =>
replaceStaticInvoke(replace, cond, newCond => replace.copy(condition = newCond))

case join @ Join(_, _, _, Some(cond), _) =>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to cover Join because GroupBasedRowLevelOperationScanPlanning in Spark must be able to simplify the join condition by discarding filters evaluated on the Iceberg side.

}

private def replaceStaticInvoke(condition: Expression): Expression = {
condition.transformWithPruning(_.containsAnyPattern(BINARY_COMPARISON, IN, INSET)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cover not only BINARY_COMPARISON but also IN and INSET.
Otherwise, IN expressions are not pushed down.

public static List<Expression> collectExprs(
SparkPlan sparkPlan, Predicate<Expression> predicate) {
Seq<List<Expression>> seq =
SPARK_HELPER.collect(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to logic in PlanUtils but it accounts for AQE plans by relying on SPARK_HELPER.
This class also existed before PlanUtils, so we might want to converge in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these are only used in tests, should we add them to test utils?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already a test util, just the one that existed before PlanUtils. I would probably merge PlanUtils with this class in the future because this one handles AQE plans.


private void checkDelete(RowLevelOperationMode mode, String cond) {
withUnavailableLocations(
findIrrelevantFileLocations(cond),
Copy link
Contributor Author

@aokolnychyi aokolnychyi Mar 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making irrelevant files unavailable ensures the predicate pushdown works correctly.


import org.apache.spark.sql.connector.catalog.functions.ScalarFunction;

abstract class BaseScalarFunction<R> implements ScalarFunction<R> {
Copy link
Contributor Author

@aokolnychyi aokolnychyi Mar 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This must be added for proper equality on the Spark side. Otherwise, post-scan filters cannot be simplified. It is arguable who had to do this but we can't fix Spark right now.

case class ApplyFunctionExpression(
    function: ScalarFunction[_],
    children: Seq[Expression])
  extends Expression with UserDefinedExpression with CodegenFallback with ImplicitCastInputTypes {

@aokolnychyi aokolnychyi merged commit 0432048 into apache:main Mar 11, 2024
@aokolnychyi
Copy link
Contributor Author

Thanks, @rdblue!

tomtongue pushed a commit to tomtongue/iceberg that referenced this pull request Mar 12, 2024
@aokolnychyi aokolnychyi added this to the Iceberg 1.5.1 milestone Apr 17, 2024
amogh-jahagirdar pushed a commit to amogh-jahagirdar/iceberg that referenced this pull request Apr 18, 2024
nastra pushed a commit that referenced this pull request Apr 18, 2024
) (#10170)

Co-authored-by: Anton Okolnychyi <aokolnychyi@apple.com>
szehon-ho pushed a commit to szehon-ho/iceberg that referenced this pull request Sep 16, 2024
zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants