Skip to content

Conversation

@gaborkaszab
Copy link
Collaborator

No description provided.

@gaborkaszab
Copy link
Collaborator Author

Resolves #7449

@aokolnychyi
Copy link
Contributor

Will take a look today. Thanks, @gaborkaszab!

}

/** Returns the data sequence number of the snapshot in which the file should be applied. */
default Long dataSequenceNumber() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on the spec, we should call this sequenceNumber instead of dataSequenceNumber? Although I know dataSequenceNumber is probably more clear.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually deprecated and removed sequenceNumber in ManifestEntry cause it was not clear what it represents. I'd be inclined to match what we currently have in ManifestEntry.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I saw that ManifestEntry.sequenceNumber() was removed, hence I used the same naming in ContentFile to be consistent with the current state of ManifestEntry.

@jackye1995 jackye1995 added this to the Iceberg 1.3.0 milestone May 8, 2023
private byte[] keyMetadata = null;
private Integer sortOrderId;

private Long dataSequenceNumber = null;
Copy link
Contributor

@aokolnychyi aokolnychyi May 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any particular reason to have these two as a separate block? Right now, BaseFile only have two blocks to keep optional fields separate.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't realise that pattern with the blocks because 'avroSchema' and 'fromProjectionPos/partitionType' also make their own block. Regardless, I merged the new ones with the optional block.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these two indeed optional? I'd say they belong more to the block with partitionSpecId. What do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, giving it a second thought, they indeed aren't optional. Added them to the same block as partitionSpecId.

@aokolnychyi
Copy link
Contributor

aokolnychyi commented May 9, 2023

I think this approach is correct, I had mostly minor comments.

I also wonder whether we should prohibit passing new data files with a set explicit fileSequenceNumber, just like we don't allow new manifests with a set snapshot ID. The file sequence number must be assigned at commit.

Thoughts, @jackye1995 @RussellSpitzer @rdblue?

V1Assert.assertEquals(
"Data sequence number should default to 0", 0, entry.dataSequenceNumber().longValue());
V1Assert.assertEquals(
"Data sequence number should default to 0",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need assert on file sequence number too?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a check for file sequence number.

However, I'm wondering if the checks for dataSequenceNumber are correct. If I'm not mistaken if there was a rewrite then the dataSequenceNumber on the ManifestEntry can differ from the one on the Snapshot. So these checks (even the existing ones) are correct only if there were no rewrites on the table. Is my understanding correct here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gaborkaszab, you are correct that rewrites may preserve the initial data sequence number. This method would not work for such rewrites. I believe we have other tests that pass an iterator of sequence numbers that should be used in those cases.

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a reasonable approach to this.

@gaborkaszab gaborkaszab self-assigned this May 11, 2023
@aokolnychyi
Copy link
Contributor

Let me take another look in a bit.

Copy link
Contributor

@aokolnychyi aokolnychyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation looks good to me. I had one question about the place where to put new variables in BaseFile. I believe @amogh-jahagirdar also had some nits. Otherwise, should be good to go.

Co-authored-by: chenjunjiedada <jimmyjchen@tencent.com>
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me thanks @gaborkaszab !

@amogh-jahagirdar
Copy link
Contributor

Merging since we have a few approvals, and no blocking comments. thanks @gaborkaszab for the PR, and @rdblue @aokolnychyi @stevenzwu @jackye1995 for reviews.

@amogh-jahagirdar amogh-jahagirdar merged commit 9907a97 into apache:master May 16, 2023
@gaborkaszab
Copy link
Collaborator Author

Thanks for the reviews, @amogh-jahagirdar @aokolnychyi @rdblue @stevenzwu @jackye1995 !

zhongyujiang pushed a commit to zhongyujiang/iceberg that referenced this pull request Apr 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants