Spark: Support vectorized reads with equality deletes #3557

flyrain · 2021-11-15T20:24:58Z

Why do we need this PR?

Vectorized reads with position deletes is supported after #3287. This PR is to support equality deletes.

Tests added

No new unit test is added, but the following test cases are for the change.
testEqualityDeletes, testEqualityDateDeletes, testEqualityDeletesWithRequiredEqColumn, testMixedPositionAndEqualityDeletes, testEqualityDeletesSpanningMultipleDataFiles, testMultipleEqualityDeleteSchemas, testEqualityDeleteByNull, testEqualityDeleteWithFilter, testReadEqualityDeleteRows

Benchmark

Will add benchmark soon.

cc @rdblue @aokolnychyi, @RussellSpitzer, @karuppayya, @szehon-ho

RussellSpitzer · 2021-11-15T20:32:41Z

data/src/main/java/org/apache/iceberg/data/DeleteFilter.java

    return remainingRowsFilter.filter(records);
  }

+  public Predicate<T> eqDeletedRows() {


I'm a little confused with this naming, since I would assume "Rows" would return rows and this returns a filter basically

RussellSpitzer · 2021-11-16T22:39:55Z

Should we have added tests her for mixed equality and positional deletes?

RussellSpitzer · 2021-11-16T22:41:07Z

...k/v3.2/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnarBatchReader.java

+   * Reuse the row Id mapping array to filter out equality deleted rows.
+   */
+  private void applyEqDelete(ColumnarBatch batch, int[] posDeleteRowIdMapping, int[] eqDeleteRowIdMapping) {
+    int[] rowIdMapping = posDeleteRowIdMapping == null ? eqDeleteRowIdMapping : posDeleteRowIdMapping;


It seems like this will only respect one of the two mappings? Or am I wrong there? In general I would try to avoid functions which only use some arguments if possible.

Correct. I will move it outside of the method.

RussellSpitzer · 2021-11-16T22:43:10Z

...k/v3.2/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnarBatchReader.java

    }
+
+    if (deletes != null && deletes.hasEqDeletes()) {
+      applyEqDelete(batch, rowIdMapping == null ? null : rowIdMapping.first(), eqDeleteRowIdMapping);


I would probably try to extract out the logic for choosing which row mapping to use from "applyEqDelete" and do it here instead so it's simpler to read.

flyrain · 2021-11-17T22:53:16Z

Should we have added tests her for mixed equality and positional deletes?

It is there. Check test testMixedPositionAndEqualityDeletes.

Thanks for the review, @RussellSpitzer. Still working on the benchmark. Will post the new commit along with the fixes for your comments.

flyrain · 2021-11-19T21:56:17Z

Resolved the comments and added the benchmark. Here is the benchmark result for 10M rows with different percentage of equality deleted rows. About 40% percent perf gain on average. Less eq delete rows, larger perf gain.

Benchmark                                                    (percentDeleteRow)  Mode  Cnt   Score   Error  Units
IcebergSourceParquetEqDeleteBenchmark.readIceberg                             0    ss    5   3.261 ± 0.210   s/op
IcebergSourceParquetEqDeleteBenchmark.readIceberg                      0.000001    ss    5   3.940 ± 0.253   s/op
IcebergSourceParquetEqDeleteBenchmark.readIceberg                          0.05    ss    5   4.888 ± 0.540   s/op
IcebergSourceParquetEqDeleteBenchmark.readIceberg                          0.25    ss    5   7.369 ± 3.521   s/op
IcebergSourceParquetEqDeleteBenchmark.readIceberg                           0.5    ss    5   9.433 ± 6.369   s/op
IcebergSourceParquetEqDeleteBenchmark.readIceberg                             1    ss    5  20.206 ± 4.298   s/op
IcebergSourceParquetEqDeleteBenchmark.readIcebergVectorized                   0    ss    5   1.721 ± 0.120   s/op
IcebergSourceParquetEqDeleteBenchmark.readIcebergVectorized            0.000001    ss    5   2.172 ± 0.112   s/op
IcebergSourceParquetEqDeleteBenchmark.readIcebergVectorized                0.05    ss    5   3.305 ± 0.301   s/op
IcebergSourceParquetEqDeleteBenchmark.readIcebergVectorized                0.25    ss    5   5.201 ± 3.418   s/op
IcebergSourceParquetEqDeleteBenchmark.readIcebergVectorized                 0.5    ss    5   8.869 ± 6.236   s/op
IcebergSourceParquetEqDeleteBenchmark.readIcebergVectorized                   1    ss    5  15.589 ± 7.211   s/op

RussellSpitzer · 2021-12-07T19:49:19Z

...k/v3.2/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnarBatchReader.java

 import org.apache.spark.sql.catalyst.InternalRow;
 import org.apache.spark.sql.vectorized.ColumnVector;
 import org.apache.spark.sql.vectorized.ColumnarBatch;
+import org.jetbrains.annotations.Nullable;


Probably this should be the javax.annotation.Nullable?

RussellSpitzer · 2021-12-09T02:41:12Z

.../spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnBatchWithRowMapping.java

+  }
+
+  /**
+   * Build a row id mapping inside a batch, which skips delete rows. For example, if the 1st and 3rd rows are deleted in


"skips delete" -> "skips deleted"

RussellSpitzer · 2021-12-09T17:01:10Z

...k/v3.2/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnarBatchReader.java

-        Integer numRows = rowIdMapping.second();
-        arrowColumnVectors[i] = ColumnVectorWithFilter.forHolder(vectorHolders[i], rowIdMap, numRows);
-      }
+      arrowColumnVectors[i] = batch.hasDeletes() ?


I think I would pass the vectors directly into the BatchWithFilte so everything like this check can be done inside that class as well. Then this class just creates the the BatchWith Filter with the proper vectors and then call .batch which would return a final columnar Batch from inside the class?

I experimented that way. It turned out almost nothing left in the class ColumnBatchReader. So I asked myself why I should have another class.

From what I can tell, our main reason we want to divide this is that ColumnarBatchReader can't build this entire object in its constructor. So It's probably fine to keep this as an inner class of ColumnarBatchReader. I just want to keep as much state in a final constructor as possible. Maybe that's a bit weird so if you have other feelings let me know. But basically I was hoping all of the more complicated state gets created inside an immutable class

I mean ... I think even the reader's could be passed as an arg. Then the only things that Columnar Batch Reader is responsible for is for getting the readers and num rows to read?

Changed it to an inner class and moved vector read logic into the inner class.

RussellSpitzer · 2021-12-13T22:57:23Z

...k/v3.2/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnarBatchReader.java

+      loadDataToColumnBatch(numRowsToRead);
+    }
+
+    ColumnarBatch loadDataToColumnBatch(int numRowsToRead) {


private right?

RussellSpitzer · 2021-12-13T22:57:46Z

...k/v3.2/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnarBatchReader.java


-    rowStartPosInBatch += numRowsToRead;
-    ColumnarBatch batch = new ColumnarBatch(arrowColumnVectors);
+    ColumnVector[] readDataToColumnVectors(int numRowsToRead) {


This could be private too

RussellSpitzer

I think this is good to go, There some fields that could. be marked as private, but it's an inner class so ... it probably doesn't matter. If you are good I can merge

flyrain · 2021-12-14T00:27:50Z

I think this is good to go, There some fields that could. be marked as private, but it's an inner class so ... it probably doesn't matter. If you are good I can merge

Thanks @RussellSpitzer for the review. It is a private internal class. Any field and method of the internal class will be public to its outer class, and private to any other classes. So the scope is the same with or without "private" modifier. It should be OK to not make them as private.

RussellSpitzer · 2021-12-14T04:35:42Z

Thanks @flyrain ! You are really taking the vectorization support to a new level here

flyrain · 2021-12-14T18:01:51Z

Thanks a lot for the review, @RussellSpitzer!

Spark: Support vectorized reads with equality deletes

8615c09

github-actions bot added data spark labels Nov 15, 2021

RussellSpitzer reviewed Nov 15, 2021

View reviewed changes

flyrain added 2 commits November 15, 2021 12:33

Simplify the method applyEqDelete

b680dc1

Fix style issue

bded78a

RussellSpitzer reviewed Nov 16, 2021

View reviewed changes

flyrain added 3 commits November 19, 2021 13:19

Add benchmark for eq deletes

0b59ce0

Add benchmark for eq deletes

538e919

Resolve comments.

800c4b2

RussellSpitzer reviewed Dec 7, 2021

View reviewed changes

Refactor

b1e9c8c

RussellSpitzer reviewed Dec 9, 2021

View reviewed changes

Resolve comments.

5eca20d

RussellSpitzer reviewed Dec 13, 2021

View reviewed changes

RussellSpitzer approved these changes Dec 13, 2021

View reviewed changes

RussellSpitzer merged commit 5d37fa3 into apache:master Dec 14, 2021

flyrain mentioned this pull request Jun 27, 2022

Support row-level delete in vectorized reader #3141

Closed

hililiwei added a commit to hililiwei/iceberg that referenced this pull request Aug 9, 2022

Spark 3.1:Port apache#3557 to Spark 3.1

e5df2d0

hililiwei added a commit to hililiwei/iceberg that referenced this pull request Aug 9, 2022

Spark 3.1:Port apache#3557 to Spark 3.1

314d727

hililiwei added a commit to hililiwei/iceberg that referenced this pull request Aug 9, 2022

Spark 3.1:Port apache#3557 to Spark 3.1

a0dfbd7

hililiwei added a commit to hililiwei/iceberg that referenced this pull request Aug 9, 2022

Spark 3.1:Port apache#3557 to Spark 3.1

72e6a73

hililiwei added a commit to hililiwei/iceberg that referenced this pull request Aug 9, 2022

Spark 3.1:Port apache#3557 to Spark 3.1

ee650af

hililiwei added a commit to hililiwei/iceberg that referenced this pull request Aug 10, 2022

Spark 3.1:Port apache#3557 to Spark 3.1

27999aa

hililiwei mentioned this pull request Aug 10, 2022

Spark 3.1: Port some PRs to Spark 3.1 #5479

Closed

hililiwei added a commit to hililiwei/iceberg that referenced this pull request Aug 15, 2022

Spark 3.1:Port apache#3557 to Spark 3.1

f360528

hililiwei mentioned this pull request Aug 15, 2022

Spark 3.1:Port #3557 to Spark 3.1 #5525

Closed

hililiwei added a commit to hililiwei/iceberg that referenced this pull request Aug 15, 2022

Spark 3.1:Port apache#3557 to Spark 3.1

79d1008

Spark: Support vectorized reads with equality deletes #3557

Spark: Support vectorized reads with equality deletes #3557

Uh oh!

Conversation

flyrain commented Nov 15, 2021

Why do we need this PR?

Tests added

Benchmark

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer commented Nov 16, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

flyrain commented Nov 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

flyrain commented Nov 19, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer Dec 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer Dec 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer left a comment

Choose a reason for hiding this comment

Uh oh!

flyrain commented Dec 14, 2021

Uh oh!

RussellSpitzer commented Dec 14, 2021

Uh oh!

flyrain commented Dec 14, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

flyrain commented Nov 17, 2021 •

edited

Loading

RussellSpitzer Dec 9, 2021 •

edited

Loading

RussellSpitzer Dec 9, 2021 •

edited

Loading