-
Notifications
You must be signed in to change notification settings - Fork 3k
Spark: Support vectorized reads with equality deletes #3557
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| return remainingRowsFilter.filter(records); | ||
| } | ||
|
|
||
| public Predicate<T> eqDeletedRows() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused with this naming, since I would assume "Rows" would return rows and this returns a filter basically
|
Should we have added tests her for mixed equality and positional deletes? |
| * Reuse the row Id mapping array to filter out equality deleted rows. | ||
| */ | ||
| private void applyEqDelete(ColumnarBatch batch, int[] posDeleteRowIdMapping, int[] eqDeleteRowIdMapping) { | ||
| int[] rowIdMapping = posDeleteRowIdMapping == null ? eqDeleteRowIdMapping : posDeleteRowIdMapping; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like this will only respect one of the two mappings? Or am I wrong there? In general I would try to avoid functions which only use some arguments if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. I will move it outside of the method.
| } | ||
|
|
||
| if (deletes != null && deletes.hasEqDeletes()) { | ||
| applyEqDelete(batch, rowIdMapping == null ? null : rowIdMapping.first(), eqDeleteRowIdMapping); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would probably try to extract out the logic for choosing which row mapping to use from "applyEqDelete" and do it here instead so it's simpler to read.
It is there. Check test Thanks for the review, @RussellSpitzer. Still working on the benchmark. Will post the new commit along with the fixes for your comments. |
|
Resolved the comments and added the benchmark. Here is the benchmark result for 10M rows with different percentage of equality deleted rows. About 40% percent perf gain on average. Less eq delete rows, larger perf gain. |
| import org.apache.spark.sql.catalyst.InternalRow; | ||
| import org.apache.spark.sql.vectorized.ColumnVector; | ||
| import org.apache.spark.sql.vectorized.ColumnarBatch; | ||
| import org.jetbrains.annotations.Nullable; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably this should be the javax.annotation.Nullable?
| } | ||
|
|
||
| /** | ||
| * Build a row id mapping inside a batch, which skips delete rows. For example, if the 1st and 3rd rows are deleted in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"skips delete" -> "skips deleted"
| Integer numRows = rowIdMapping.second(); | ||
| arrowColumnVectors[i] = ColumnVectorWithFilter.forHolder(vectorHolders[i], rowIdMap, numRows); | ||
| } | ||
| arrowColumnVectors[i] = batch.hasDeletes() ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would pass the vectors directly into the BatchWithFilte so everything like this check can be done inside that class as well. Then this class just creates the the BatchWith Filter with the proper vectors and then call .batch which would return a final columnar Batch from inside the class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I experimented that way. It turned out almost nothing left in the class ColumnBatchReader. So I asked myself why I should have another class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I can tell, our main reason we want to divide this is that ColumnarBatchReader can't build this entire object in its constructor. So It's probably fine to keep this as an inner class of ColumnarBatchReader. I just want to keep as much state in a final constructor as possible. Maybe that's a bit weird so if you have other feelings let me know. But basically I was hoping all of the more complicated state gets created inside an immutable class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean ... I think even the reader's could be passed as an arg. Then the only things that Columnar Batch Reader is responsible for is for getting the readers and num rows to read?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it to an inner class and moved vector read logic into the inner class.
| loadDataToColumnBatch(numRowsToRead); | ||
| } | ||
|
|
||
| ColumnarBatch loadDataToColumnBatch(int numRowsToRead) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private right?
|
|
||
| rowStartPosInBatch += numRowsToRead; | ||
| ColumnarBatch batch = new ColumnarBatch(arrowColumnVectors); | ||
| ColumnVector[] readDataToColumnVectors(int numRowsToRead) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be private too
RussellSpitzer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is good to go, There some fields that could. be marked as private, but it's an inner class so ... it probably doesn't matter. If you are good I can merge
Thanks @RussellSpitzer for the review. It is a private internal class. Any field and method of the internal class will be public to its outer class, and private to any other classes. So the scope is the same with or without "private" modifier. It should be OK to not make them as private. |
|
Thanks @flyrain ! You are really taking the vectorization support to a new level here |
|
Thanks a lot for the review, @RussellSpitzer! |
Why do we need this PR?
Vectorized reads with position deletes is supported after #3287. This PR is to support equality deletes.
Tests added
No new unit test is added, but the following test cases are for the change.
testEqualityDeletes, testEqualityDateDeletes, testEqualityDeletesWithRequiredEqColumn, testMixedPositionAndEqualityDeletes, testEqualityDeletesSpanningMultipleDataFiles, testMultipleEqualityDeleteSchemas, testEqualityDeleteByNull, testEqualityDeleteWithFilter, testReadEqualityDeleteRows
Benchmark
Will add benchmark soon.
cc @rdblue @aokolnychyi, @RussellSpitzer, @karuppayya, @szehon-ho