Spark 3.3: Implement Position Deletes Table #6716

szehon-ho · 2023-02-01T05:18:11Z

This is the Spark-side changes for: #6365, (or the second part of the initial end to end pr: #4812)

Some explanations:

Because RowReader is instantiated with PositionDeletesTable, but we need the base table's schema, need to expose some API's that expose the base table.
ORC and Parquet readers fixed to handle constant column pushdown. (Previously this code path not exercised because metadata columns are not pushed down by Spark).

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/PositionDeleteRowReader.java

parquet/src/main/java/org/apache/iceberg/parquet/ParquetMetricsRowGroupFilter.java

szehon-ho · 2023-02-02T06:50:45Z

orc/src/main/java/org/apache/iceberg/orc/OrcIterable.java

    addCloseable(orcFileReader);

    TypeDescription fileSchema = orcFileReader.getSchema();
+    Schema schemaWithoutConstantFields =


Previously, schema already pruned without constant columns is passed into ORCIterable for its filter logic.

However, this fails here when trying to bind the constant column filters to it (as they are not in the schema). This makes it so that we prune out constant columns only when we need it, but we keep original schema for the binding.

core/src/main/java/org/apache/iceberg/PositionDeletesTable.java

core/src/main/java/org/apache/iceberg/MetadataTable.java

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/PositionDeleteRowReader.java

parquet/src/main/java/org/apache/iceberg/parquet/ParquetMetricsRowGroupFilter.java

aokolnychyi

This looks almost ready to me. I had some comments in the reader. Let me also think a bit about the Parquet and ORC changes.

core/src/main/java/org/apache/iceberg/PositionDeletesTable.java

core/src/main/java/org/apache/iceberg/SerializableTable.java

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/PositionDeleteRowReader.java

aokolnychyi · 2023-02-13T21:34:33Z

Will take another look today.

aokolnychyi

Apart from one comment in PositionDeletesTable, this looks OK to me. I am still thinking about the best way to deal with residuals.

core/src/main/java/org/apache/iceberg/PositionDeletesTable.java

szehon-ho

@aokolnychyi as discussed offline, reverted the constant column fix from ORC and Parquet readers, and instead use @rdblue 's new method from : #6599

szehon-ho · 2023-02-15T01:55:59Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/PositionDeletesRowReader.java

+    Set<Integer> fields = schema.idToName().keySet();
+    Set<Integer> nonConstantFields =
+        fields.stream()
+            .filter(id -> schema.findField(id).type().isPrimitiveType())


We need to only pass in primitive field ids, because internally ExpressionUtil.extractByIdInclusive uses these to make a dummy Identity partition spec, and those only take primitive fields.

Alternatively we could push this down to method itself, but leaving here so method is same as in: #6599

aokolnychyi · 2023-02-16T05:50:05Z

core/src/main/java/org/apache/iceberg/PositionDeletesTable.java

 * A {@link Table} implementation whose {@link Scan} provides {@link PositionDeletesScanTask}, for
 * reading of position delete files.
 */
 public class PositionDeletesTable extends BaseMetadataTable {


This looks good to me.

aokolnychyi · 2023-02-16T05:50:59Z

core/src/main/java/org/apache/iceberg/BaseMetadataTable.java

+   * @return specs used to rewrite the metadata table filters to partition filters using an
+   *     inclusive projection
+   */
+  static Map<Integer, PartitionSpec> transformSpecs(


Looks good too.

aokolnychyi

Final nits and should be good to go! We should add @rdblue as co-author while merging.

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/PositionDeletesRowReader.java

aokolnychyi · 2023-02-17T21:04:50Z

Thanks, @szehon-ho! It is nice to have this done. I also added @rdblue as a co-author since this PR includes some logic from PR #6599.

szehon-ho · 2023-02-24T00:46:50Z

Thanks, I filed a follow-up issue: #6925 to implement pushdown optimization of queries with filters of constant columns (ie, spec_id, delete_file_path).

Co-authored-by: Ryan Blue <blue@apache.org>

github-actions bot added core data ORC parquet spark labels Feb 1, 2023

szehon-ho force-pushed the position_delete_spark_only_2 branch from cc5f4a7 to 284800b Compare February 2, 2023 01:33

szehon-ho commented Feb 2, 2023

View reviewed changes

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/PositionDeleteRowReader.java Outdated Show resolved Hide resolved

szehon-ho commented Feb 2, 2023

View reviewed changes

parquet/src/main/java/org/apache/iceberg/parquet/ParquetMetricsRowGroupFilter.java Outdated Show resolved Hide resolved

szehon-ho commented Feb 2, 2023

View reviewed changes

aokolnychyi reviewed Feb 2, 2023

View reviewed changes

aokolnychyi reviewed Feb 9, 2023

View reviewed changes

aokolnychyi reviewed Feb 14, 2023

View reviewed changes

core/src/main/java/org/apache/iceberg/PositionDeletesTable.java Outdated Show resolved Hide resolved

core/src/main/java/org/apache/iceberg/PositionDeletesTable.java Outdated Show resolved Hide resolved

szehon-ho added 2 commits February 14, 2023 14:52

Spark 3.3: Implement Position Deletes Table

78ea099

Use Expressionutil.extractByIdInclusive

9eba5b5

szehon-ho force-pushed the position_delete_spark_only_2 branch from a5f00ca to 9eba5b5 Compare February 14, 2023 23:40

github-actions bot added API and removed ORC labels Feb 14, 2023

szehon-ho commented Feb 15, 2023

View reviewed changes

aokolnychyi reviewed Feb 16, 2023

View reviewed changes

Review comments

c61703f

aokolnychyi approved these changes Feb 17, 2023

View reviewed changes

aokolnychyi merged commit bd58ca5 into apache:master Feb 17, 2023

aokolnychyi mentioned this pull request Feb 17, 2023

Core: Add information_schema.namespaces table #6599

Closed

szehon-ho mentioned this pull request Mar 16, 2023

iceberg v2 table cannot expire delete files after rewrite datafile action #5058

Closed

krvikash pushed a commit to krvikash/iceberg that referenced this pull request Mar 16, 2023

Spark 3.3: Support reading position deletes table (apache#6716)

382bf6e

Co-authored-by: Ryan Blue <blue@apache.org>

Spark 3.3: Implement Position Deletes Table #6716

Spark 3.3: Implement Position Deletes Table #6716

Uh oh!

Conversation

szehon-ho commented Feb 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

szehon-ho Feb 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aokolnychyi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aokolnychyi commented Feb 13, 2023

Uh oh!

aokolnychyi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

szehon-ho left a comment

Choose a reason for hiding this comment

Uh oh!

szehon-ho Feb 15, 2023

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Feb 16, 2023

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Feb 16, 2023

Choose a reason for hiding this comment

Uh oh!

aokolnychyi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aokolnychyi commented Feb 17, 2023

Uh oh!

szehon-ho commented Feb 24, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

szehon-ho commented Feb 1, 2023 •

edited

Loading

szehon-ho Feb 2, 2023 •

edited

Loading