-
Notifications
You must be signed in to change notification settings - Fork 3k
Parquet: Add vectorized position reader #1356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parquet: Add vectorized position reader #1356
Conversation
|
@rdblue , This is for vectorized parquet position reader and also include a follow up for skipping reading footer redundantly. Could you please take a look at your convenience? |
| public static VectorizedArrowReader positions() { | ||
| return PositionVectorReader.INSTANCE; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am unsure if returning a singleton instance for PostitionVectorReader is safe, since it contains a member variable rowStart which seems to differ for every row group created. Can there be a possibility of multiple tasks running on the same executor JVM and wanting to refer to different PostionVectorReaders at the same time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your comment.
The INSTANCE is defined to return a new PosistionVectorReader. It doesn't have a class scope field such as instance and the null checking logic, so it is not a singleton.
IIUC, spark will assign the number of spark.executor.cores of tasks per executor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The INSTANCE field is defined is static, so I am guessing new PositionVectorReader() will only be called once when the class is being loaded by the JVM. This seems like mimicking a singleton to me, unless I am reading the things wrongly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right! I forgot the static keyword, it is an eager mode of the singleton. Just remove using singleton in 52f4468.
shardulm94
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comment, but otherwise LGTM!
| } | ||
| } | ||
|
|
||
| public static class PositionVectorHolder extends VectorHolder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like technically this class is redundant since the user can use VectorHolder directly, but is probably good for readability?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I added a private VectorHolder constructor for PositionVectionHolder which I don't want others to use it directly.
| Field arrowField = ArrowSchemaUtil.convert(MetadataColumns.ROW_POSITION); | ||
| FieldVector vec = arrowField.createVector(ArrowAllocation.rootAllocator()); | ||
| ((BigIntVector) vec).allocateNew(numValsToRead); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this follow an approach similar to VectorizedArrowReader and not create a new FieldVector and NullabilityHolder for every invocation?
iceberg/arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java
Lines 123 to 125 in 52f4468
| } else { | |
| vec.setValueCount(0); | |
| nullabilityHolder.reset(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense to me, updated in 73de369
|
@rdblue , Could you please help to take a look on this? |
|
Thanks @chenjunjiedada, this is definitely on my list to review. I'll take a look as soon as I can. |
holdenk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still figuring my way around the code base, so my question might be silly and thanks for working on this :)
| ((BigIntVector) vec).allocateNew(numValsToRead); | ||
| for (int i = 0; i < numValsToRead; i += 1) { | ||
| vec.getDataBuffer().setLong(i * Long.BYTES, rowStart + i); | ||
| nulls = new NullabilityHolder(numValsToRead); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we setting this inside of the for loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @holdenk, This is a problem. Let me move this out of the loop.
949c250 to
a34a9df
Compare
|
@rdblue , Do we want to include this to 0.10.0? This includes an optimization that avoids reading extra footer in case of no position column. |
|
@shardulm94 , Does that ORC fix also valid in parquet side? |
Separate fixes should be added in different PRs. Thanks for fixing this, I'd like to get it in without blocking on vectorization.
I don't think so. We want to get 0.10.0 out and even if the vectorization changes made it in, we wouldn't be able to read delete files in the vectorized path. |
@chenjunjiedada Which fix you are referring to? |
|
@shardulm94 I meant #1706. |
|
@chenjunjiedada That bug was specific to ORC and does not exist in the Parquet reader. |
a34a9df to
ca79a67
Compare
|
Thanks for fixing this, @chenjunjiedada! And thanks to @shardulm94 and @holdenk for reviewing! |
This adds parquet position reader for the vectorized case.