-
Notifications
You must be signed in to change notification settings - Fork 3k
API, Core: Expose file and data sequence numbers through ContentFile #7555
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Resolves #7449 |
|
Will take a look today. Thanks, @gaborkaszab! |
| } | ||
|
|
||
| /** Returns the data sequence number of the snapshot in which the file should be applied. */ | ||
| default Long dataSequenceNumber() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
based on the spec, we should call this sequenceNumber instead of dataSequenceNumber? Although I know dataSequenceNumber is probably more clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We actually deprecated and removed sequenceNumber in ManifestEntry cause it was not clear what it represents. I'd be inclined to match what we currently have in ManifestEntry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I saw that ManifestEntry.sequenceNumber() was removed, hence I used the same naming in ContentFile to be consistent with the current state of ManifestEntry.
| private byte[] keyMetadata = null; | ||
| private Integer sortOrderId; | ||
|
|
||
| private Long dataSequenceNumber = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any particular reason to have these two as a separate block? Right now, BaseFile only have two blocks to keep optional fields separate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't realise that pattern with the blocks because 'avroSchema' and 'fromProjectionPos/partitionType' also make their own block. Regardless, I merged the new ones with the optional block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these two indeed optional? I'd say they belong more to the block with partitionSpecId. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, giving it a second thought, they indeed aren't optional. Added them to the same block as partitionSpecId.
|
I think this approach is correct, I had mostly minor comments. I also wonder whether we should prohibit passing new data files with a set explicit Thoughts, @jackye1995 @RussellSpitzer @rdblue? |
| V1Assert.assertEquals( | ||
| "Data sequence number should default to 0", 0, entry.dataSequenceNumber().longValue()); | ||
| V1Assert.assertEquals( | ||
| "Data sequence number should default to 0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need assert on file sequence number too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a check for file sequence number.
However, I'm wondering if the checks for dataSequenceNumber are correct. If I'm not mistaken if there was a rewrite then the dataSequenceNumber on the ManifestEntry can differ from the one on the Snapshot. So these checks (even the existing ones) are correct only if there were no rewrites on the table. Is my understanding correct here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gaborkaszab, you are correct that rewrites may preserve the initial data sequence number. This method would not work for such rewrites. I believe we have other tests that pass an iterator of sequence numbers that should be used in those cases.
rdblue
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a reasonable approach to this.
cbce8b2 to
c6a8c0a
Compare
|
Let me take another look in a bit. |
aokolnychyi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation looks good to me. I had one question about the place where to put new variables in BaseFile. I believe @amogh-jahagirdar also had some nits. Otherwise, should be good to go.
Co-authored-by: chenjunjiedada <jimmyjchen@tencent.com>
c6a8c0a to
e2f1e6e
Compare
amogh-jahagirdar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me thanks @gaborkaszab !
|
Merging since we have a few approvals, and no blocking comments. thanks @gaborkaszab for the PR, and @rdblue @aokolnychyi @stevenzwu @jackye1995 for reviews. |
|
Thanks for the reviews, @amogh-jahagirdar @aokolnychyi @rdblue @stevenzwu @jackye1995 ! |
No description provided.