Skip to content

Conversation

@flyrain
Copy link
Contributor

@flyrain flyrain commented Nov 15, 2021

Why do we need this PR?

Vectorized reads with position deletes is supported after #3287. This PR is to support equality deletes.

Tests added

No new unit test is added, but the following test cases are for the change.
testEqualityDeletes, testEqualityDateDeletes, testEqualityDeletesWithRequiredEqColumn, testMixedPositionAndEqualityDeletes, testEqualityDeletesSpanningMultipleDataFiles, testMultipleEqualityDeleteSchemas, testEqualityDeleteByNull, testEqualityDeleteWithFilter, testReadEqualityDeleteRows

Benchmark

Will add benchmark soon.

cc @rdblue @aokolnychyi, @RussellSpitzer, @karuppayya, @szehon-ho

return remainingRowsFilter.filter(records);
}

public Predicate<T> eqDeletedRows() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused with this naming, since I would assume "Rows" would return rows and this returns a filter basically

@RussellSpitzer
Copy link
Member

Should we have added tests her for mixed equality and positional deletes?

* Reuse the row Id mapping array to filter out equality deleted rows.
*/
private void applyEqDelete(ColumnarBatch batch, int[] posDeleteRowIdMapping, int[] eqDeleteRowIdMapping) {
int[] rowIdMapping = posDeleteRowIdMapping == null ? eqDeleteRowIdMapping : posDeleteRowIdMapping;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this will only respect one of the two mappings? Or am I wrong there? In general I would try to avoid functions which only use some arguments if possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. I will move it outside of the method.

}

if (deletes != null && deletes.hasEqDeletes()) {
applyEqDelete(batch, rowIdMapping == null ? null : rowIdMapping.first(), eqDeleteRowIdMapping);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably try to extract out the logic for choosing which row mapping to use from "applyEqDelete" and do it here instead so it's simpler to read.

@flyrain
Copy link
Contributor Author

flyrain commented Nov 17, 2021

Should we have added tests her for mixed equality and positional deletes?

It is there. Check test testMixedPositionAndEqualityDeletes.

Thanks for the review, @RussellSpitzer. Still working on the benchmark. Will post the new commit along with the fixes for your comments.

@flyrain
Copy link
Contributor Author

flyrain commented Nov 19, 2021

Resolved the comments and added the benchmark. Here is the benchmark result for 10M rows with different percentage of equality deleted rows. About 40% percent perf gain on average. Less eq delete rows, larger perf gain.

Benchmark                                                    (percentDeleteRow)  Mode  Cnt   Score   Error  Units
IcebergSourceParquetEqDeleteBenchmark.readIceberg                             0    ss    5   3.261 ± 0.210   s/op
IcebergSourceParquetEqDeleteBenchmark.readIceberg                      0.000001    ss    5   3.940 ± 0.253   s/op
IcebergSourceParquetEqDeleteBenchmark.readIceberg                          0.05    ss    5   4.888 ± 0.540   s/op
IcebergSourceParquetEqDeleteBenchmark.readIceberg                          0.25    ss    5   7.369 ± 3.521   s/op
IcebergSourceParquetEqDeleteBenchmark.readIceberg                           0.5    ss    5   9.433 ± 6.369   s/op
IcebergSourceParquetEqDeleteBenchmark.readIceberg                             1    ss    5  20.206 ± 4.298   s/op
IcebergSourceParquetEqDeleteBenchmark.readIcebergVectorized                   0    ss    5   1.721 ± 0.120   s/op
IcebergSourceParquetEqDeleteBenchmark.readIcebergVectorized            0.000001    ss    5   2.172 ± 0.112   s/op
IcebergSourceParquetEqDeleteBenchmark.readIcebergVectorized                0.05    ss    5   3.305 ± 0.301   s/op
IcebergSourceParquetEqDeleteBenchmark.readIcebergVectorized                0.25    ss    5   5.201 ± 3.418   s/op
IcebergSourceParquetEqDeleteBenchmark.readIcebergVectorized                 0.5    ss    5   8.869 ± 6.236   s/op
IcebergSourceParquetEqDeleteBenchmark.readIcebergVectorized                   1    ss    5  15.589 ± 7.211   s/op

import org.apache.spark.sql.catalyst.InternalRow;
import org.apache.spark.sql.vectorized.ColumnVector;
import org.apache.spark.sql.vectorized.ColumnarBatch;
import org.jetbrains.annotations.Nullable;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably this should be the javax.annotation.Nullable?

}

/**
* Build a row id mapping inside a batch, which skips delete rows. For example, if the 1st and 3rd rows are deleted in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"skips delete" -> "skips deleted"

Integer numRows = rowIdMapping.second();
arrowColumnVectors[i] = ColumnVectorWithFilter.forHolder(vectorHolders[i], rowIdMap, numRows);
}
arrowColumnVectors[i] = batch.hasDeletes() ?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would pass the vectors directly into the BatchWithFilte so everything like this check can be done inside that class as well. Then this class just creates the the BatchWith Filter with the proper vectors and then call .batch which would return a final columnar Batch from inside the class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I experimented that way. It turned out almost nothing left in the class ColumnBatchReader. So I asked myself why I should have another class.

Copy link
Member

@RussellSpitzer RussellSpitzer Dec 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can tell, our main reason we want to divide this is that ColumnarBatchReader can't build this entire object in its constructor. So It's probably fine to keep this as an inner class of ColumnarBatchReader. I just want to keep as much state in a final constructor as possible. Maybe that's a bit weird so if you have other feelings let me know. But basically I was hoping all of the more complicated state gets created inside an immutable class

Copy link
Member

@RussellSpitzer RussellSpitzer Dec 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean ... I think even the reader's could be passed as an arg. Then the only things that Columnar Batch Reader is responsible for is for getting the readers and num rows to read?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it to an inner class and moved vector read logic into the inner class.

loadDataToColumnBatch(numRowsToRead);
}

ColumnarBatch loadDataToColumnBatch(int numRowsToRead) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private right?


rowStartPosInBatch += numRowsToRead;
ColumnarBatch batch = new ColumnarBatch(arrowColumnVectors);
ColumnVector[] readDataToColumnVectors(int numRowsToRead) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be private too

Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good to go, There some fields that could. be marked as private, but it's an inner class so ... it probably doesn't matter. If you are good I can merge

@flyrain
Copy link
Contributor Author

flyrain commented Dec 14, 2021

I think this is good to go, There some fields that could. be marked as private, but it's an inner class so ... it probably doesn't matter. If you are good I can merge

Thanks @RussellSpitzer for the review. It is a private internal class. Any field and method of the internal class will be public to its outer class, and private to any other classes. So the scope is the same with or without "private" modifier. It should be OK to not make them as private.

@RussellSpitzer RussellSpitzer merged commit 5d37fa3 into apache:master Dec 14, 2021
@RussellSpitzer
Copy link
Member

Thanks @flyrain ! You are really taking the vectorization support to a new level here

@flyrain
Copy link
Contributor Author

flyrain commented Dec 14, 2021

Thanks a lot for the review, @RussellSpitzer!

hililiwei added a commit to hililiwei/iceberg that referenced this pull request Aug 9, 2022
hililiwei added a commit to hililiwei/iceberg that referenced this pull request Aug 9, 2022
hililiwei added a commit to hililiwei/iceberg that referenced this pull request Aug 9, 2022
hililiwei added a commit to hililiwei/iceberg that referenced this pull request Aug 9, 2022
hililiwei added a commit to hililiwei/iceberg that referenced this pull request Aug 9, 2022
hililiwei added a commit to hililiwei/iceberg that referenced this pull request Aug 10, 2022
hililiwei added a commit to hililiwei/iceberg that referenced this pull request Aug 15, 2022
hililiwei added a commit to hililiwei/iceberg that referenced this pull request Aug 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants