Skip to content

Conversation

@chenjunjiedada
Copy link
Collaborator

@chenjunjiedada chenjunjiedada commented Aug 18, 2020

This adds parquet position reader for the vectorized case.

@chenjunjiedada
Copy link
Collaborator Author

@rdblue , This is for vectorized parquet position reader and also include a follow up for skipping reading footer redundantly. Could you please take a look at your convenience?

Comment on lines 337 to 343
public static VectorizedArrowReader positions() {
return PositionVectorReader.INSTANCE;
}
Copy link
Contributor

@shardulm94 shardulm94 Sep 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unsure if returning a singleton instance for PostitionVectorReader is safe, since it contains a member variable rowStart which seems to differ for every row group created. Can there be a possibility of multiple tasks running on the same executor JVM and wanting to refer to different PostionVectorReaders at the same time?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comment.

The INSTANCE is defined to return a new PosistionVectorReader. It doesn't have a class scope field such as instance and the null checking logic, so it is not a singleton.

IIUC, spark will assign the number of spark.executor.cores of tasks per executor.

Copy link
Contributor

@shardulm94 shardulm94 Sep 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The INSTANCE field is defined is static, so I am guessing new PositionVectorReader() will only be called once when the class is being loaded by the JVM. This seems like mimicking a singleton to me, unless I am reading the things wrongly.

Copy link
Collaborator Author

@chenjunjiedada chenjunjiedada Sep 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right! I forgot the static keyword, it is an eager mode of the singleton. Just remove using singleton in 52f4468.

Copy link
Contributor

@shardulm94 shardulm94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment, but otherwise LGTM!

}
}

public static class PositionVectorHolder extends VectorHolder {
Copy link
Contributor

@shardulm94 shardulm94 Sep 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like technically this class is redundant since the user can use VectorHolder directly, but is probably good for readability?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I added a private VectorHolder constructor for PositionVectionHolder which I don't want others to use it directly.

Comment on lines 368 to 370
Field arrowField = ArrowSchemaUtil.convert(MetadataColumns.ROW_POSITION);
FieldVector vec = arrowField.createVector(ArrowAllocation.rootAllocator());
((BigIntVector) vec).allocateNew(numValsToRead);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this follow an approach similar to VectorizedArrowReader and not create a new FieldVector and NullabilityHolder for every invocation?

} else {
vec.setValueCount(0);
nullabilityHolder.reset();

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense to me, updated in 73de369

@chenjunjiedada
Copy link
Collaborator Author

@rdblue , Could you please help to take a look on this?

@rdblue
Copy link
Contributor

rdblue commented Sep 19, 2020

Thanks @chenjunjiedada, this is definitely on my list to review. I'll take a look as soon as I can.

Copy link
Contributor

@holdenk holdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still figuring my way around the code base, so my question might be silly and thanks for working on this :)

((BigIntVector) vec).allocateNew(numValsToRead);
for (int i = 0; i < numValsToRead; i += 1) {
vec.getDataBuffer().setLong(i * Long.BYTES, rowStart + i);
nulls = new NullabilityHolder(numValsToRead);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we setting this inside of the for loop?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @holdenk, This is a problem. Let me move this out of the loop.

@chenjunjiedada
Copy link
Collaborator Author

@rdblue , Do we want to include this to 0.10.0? This includes an optimization that avoids reading extra footer in case of no position column.

@chenjunjiedada
Copy link
Collaborator Author

@shardulm94 , Does that ORC fix also valid in parquet side?

@rdblue
Copy link
Contributor

rdblue commented Nov 3, 2020

This includes an optimization that avoids reading extra footer in case of no position column.

Separate fixes should be added in different PRs. Thanks for fixing this, I'd like to get it in without blocking on vectorization.

Do we want to include this to 0.10.0?

I don't think so. We want to get 0.10.0 out and even if the vectorization changes made it in, we wouldn't be able to read delete files in the vectorized path.

@shardulm94
Copy link
Contributor

@shardulm94 , Does that ORC fix also valid in parquet side?

@chenjunjiedada Which fix you are referring to?

@chenjunjiedada
Copy link
Collaborator Author

@shardulm94 I meant #1706.

@shardulm94
Copy link
Contributor

@chenjunjiedada That bug was specific to ORC and does not exist in the Parquet reader.

@rdblue
Copy link
Contributor

rdblue commented Dec 18, 2020

Thanks for fixing this, @chenjunjiedada! And thanks to @shardulm94 and @holdenk for reviewing!

@rdblue rdblue merged commit 6379050 into apache:master Dec 18, 2020
@chenjunjiedada chenjunjiedada deleted the add-position-for-parquet-vectorized-reader branch January 4, 2021 12:56
parthchandra pushed a commit to parthchandra/iceberg that referenced this pull request Oct 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants