Spark: Sort retained rows in DELETE FROM by file and position #1955

rdblue · 2020-12-18T01:16:52Z

This updates Spark's DELETE FROM command to sort the retained rows by original file and position to ensure that the original data clustering is preserved by the command.

Because Spark does not yet support metadata columns, this exposes _file and _pos by adding them automatically to all merge scans. Projecting both columns was mostly supported, with only minor changes needed to project _file using the constants map supported by Avro, Parquet, and ORC.

This also required refactoring DynamicFileFilter. When projecting _file and _pos but only using file, the optimizer would throw an exception that the node could not be copied because the optimizer was attempting to rewrite the node with a projection to remove the unused _pos_. The fix is to update DynamicFileFilter so that the SupportsFileFilter is passed separately. Then the scan can be passed as a logical plan that can be rewritten by the planner. This also required updating conversion to physical plan because the scan plan may be more complicated than a single scan node. This ensures that the scan is converted to an extended scan by using a new logical plan wrapper so that planLater can be used in conversion like normal.

rdblue · 2020-12-18T01:17:35Z

core/src/main/java/org/apache/iceberg/avro/BuildAvroProjection.java

      } else {
        Preconditions.checkArgument(
-            field.isOptional() || field.fieldId() == MetadataColumns.ROW_POSITION.fieldId(),
+            field.isOptional() || MetadataColumns.metadataFieldIds().contains(field.fieldId()),


Needed to allow projecting _file even though it isn't in the data file.

This reminds me we need to fix that projection bug / selection bug

rdblue · 2020-12-18T01:18:02Z

core/src/main/java/org/apache/iceberg/util/PartitionUtil.java

+    // add _file
+    idToConstant.put(
+        MetadataColumns.FILE_PATH.fieldId(),
+        convertConstant.apply(Types.StringType.get(), task.file().path()));


This adds _file to the constants map so it is set in records like a partition value.

rdblue · 2020-12-18T01:20:08Z

spark3-extensions/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDelete.scala


    val scan = scanBuilder.build()
-    val scanRelation = DataSourceV2ScanRelation(table, scan, output)
+    val scanRelation = DataSourceV2ScanRelation(table, scan, toOutputAttrs(scan.readSchema(), output))


Spark's contract is that the scan's schema is the one that should be used, not the original table schema. This allows the merge scan to return the extra _file and _pos columns and matches the behavior of normal scans that are configured with PushDownUtils.pruneColumns.

aokolnychyi · 2020-12-18T09:08:21Z

It looks like vectorized Parquet is failing, @rdblue.

core/src/main/java/org/apache/iceberg/MetadataColumns.java

aokolnychyi · 2020-12-18T09:24:14Z

spark3-extensions/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDelete.scala

-
    val fileNameCol = findOutputAttr(remainingRowsPlan, FILE_NAME_COL)
+    val rowPosCol = findOutputAttr(remainingRowsPlan, ROW_POS_COL)
+    val order = Seq(SortOrder(fileNameCol, Ascending), SortOrder(rowPosCol, Ascending))


spark3/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java

...nsions/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/ExtendedScanRelation.scala

aokolnychyi · 2020-12-18T12:50:37Z

I confirm tests for the non-vectorized Parquet path are working fine. Seems like we don't populate the position in the vectorized path only.

chenjunjiedada · 2020-12-18T14:15:39Z

You may want this PR: #1356.

rdblue · 2020-12-18T18:39:49Z

Merged #1356. Thanks for fixing that, @chenjunjiedada! Sorry I didn't get back to review that before now.

rdblue · 2020-12-19T00:14:07Z

Rebased and fixed a couple of bugs with _file and _pos in vectorized Parquet reads.

rdblue · 2020-12-19T00:14:55Z

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java

        }
+        for (int i = 0; i < numValsToRead; i += 1) {
+          BitVectorHelper.setValidityBitToOne(vec.getValidityBuffer(), i);
+        }


This is needed for cases where Arrow checks the validity buffer.

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java

rdblue · 2020-12-19T00:16:50Z

spark/src/main/java/org/apache/iceberg/spark/source/BatchDataReader.java

-    } else {
-      idToConstant = ImmutableMap.of();
-    }
+    Map<Integer, ?> idToConstant = PartitionUtil.constantsMap(task, BatchDataReader::convertConstant);


It isn't necessary to check whether there are projected ID columns. The code is shorter if the values are available by default, even if they aren't used. This fixes the problem where there are constants to add (like _file) but no identity partition values are projected.

spark3-extensions/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDelete.scala

...sions/src/test/java/org/apache/iceberg/spark/extensions/SparkRowLevelOperationsTestBase.java

aokolnychyi · 2020-12-21T10:48:12Z

I cloned this PR locally and tests seem to work. Thanks, @chenjunjiedada and @rdblue.

I left a few questions but the rest looks great!

aokolnychyi · 2020-12-21T21:48:41Z

Thanks @rdblue and everyone who reviewed.

rdblue requested a review from aokolnychyi December 18, 2020 01:16

github-actions bot added core spark labels Dec 18, 2020

rdblue commented Dec 18, 2020

View reviewed changes

rdblue mentioned this pull request Dec 18, 2020

Spark MERGE INTO Support (copy-on-write implementation) #1947

Merged

aokolnychyi reviewed Dec 18, 2020

View reviewed changes

core/src/main/java/org/apache/iceberg/MetadataColumns.java Show resolved Hide resolved

aokolnychyi reviewed Dec 18, 2020

View reviewed changes

spark3/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java Show resolved Hide resolved

aokolnychyi reviewed Dec 18, 2020

View reviewed changes

...nsions/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/ExtendedScanRelation.scala Outdated Show resolved Hide resolved

RussellSpitzer approved these changes Dec 18, 2020

View reviewed changes

rdblue force-pushed the add-pos-to-merge branch from 934a375 to 7a87a43 Compare December 19, 2020 00:13

github-actions bot added the arrow label Dec 19, 2020

rdblue commented Dec 19, 2020

View reviewed changes

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java Show resolved Hide resolved

rdblue commented Dec 19, 2020

View reviewed changes

aokolnychyi reviewed Dec 21, 2020

View reviewed changes

spark3-extensions/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDelete.scala Outdated Show resolved Hide resolved

aokolnychyi reviewed Dec 21, 2020

View reviewed changes

...sions/src/test/java/org/apache/iceberg/spark/extensions/SparkRowLevelOperationsTestBase.java Outdated Show resolved Hide resolved

rdblue added 5 commits December 21, 2020 10:10

Spark: Sort retained rows in DELETE FROM by file and position.

6ae2a91

Fix vectorized Parquet _file and _pos.

2bb4199

Fix checkstyle.

545f5db

Fix test parameters from debugging.

eb90c18

Simplify statement in buildScanPlan.

5b3b8d8

rdblue force-pushed the add-pos-to-merge branch from ffee55c to 5b3b8d8 Compare December 21, 2020 18:10

Remove ExtendedScanRelation node.

ab4505c

aokolnychyi approved these changes Dec 21, 2020

View reviewed changes

rdblue merged commit bafda61 into apache:master Dec 21, 2020

Spark: Sort retained rows in DELETE FROM by file and position #1955

Spark: Sort retained rows in DELETE FROM by file and position #1955

Uh oh!

Conversation

rdblue commented Dec 18, 2020

Uh oh!

rdblue Dec 18, 2020

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer Dec 18, 2020

Choose a reason for hiding this comment

Uh oh!

rdblue Dec 18, 2020

Choose a reason for hiding this comment

Uh oh!

rdblue Dec 18, 2020

Choose a reason for hiding this comment

Uh oh!

aokolnychyi commented Dec 18, 2020

Uh oh!

Uh oh!

aokolnychyi Dec 18, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

aokolnychyi commented Dec 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chenjunjiedada commented Dec 18, 2020

Uh oh!

rdblue commented Dec 18, 2020

Uh oh!

rdblue commented Dec 19, 2020

Uh oh!

rdblue Dec 19, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rdblue Dec 19, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

aokolnychyi commented Dec 21, 2020

Uh oh!

aokolnychyi commented Dec 21, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

aokolnychyi commented Dec 18, 2020 •

edited

Loading