API, Core: Expose file and data sequence numbers through ContentFile #7555

gaborkaszab · 2023-05-08T12:16:30Z

No description provided.

gaborkaszab · 2023-05-08T12:51:34Z

Resolves #7449

aokolnychyi · 2023-05-08T15:50:31Z

Will take a look today. Thanks, @gaborkaszab!

jackye1995 · 2023-05-08T18:10:45Z

api/src/main/java/org/apache/iceberg/ContentFile.java

  }

+  /** Returns the data sequence number of the snapshot in which the file should be applied. */
+  default Long dataSequenceNumber() {


based on the spec, we should call this sequenceNumber instead of dataSequenceNumber? Although I know dataSequenceNumber is probably more clear.

We actually deprecated and removed sequenceNumber in ManifestEntry cause it was not clear what it represents. I'd be inclined to match what we currently have in ManifestEntry.

Yeah, I saw that ManifestEntry.sequenceNumber() was removed, hence I used the same naming in ContentFile to be consistent with the current state of ManifestEntry.

api/src/main/java/org/apache/iceberg/ContentFile.java

aokolnychyi · 2023-05-09T00:44:09Z

core/src/main/java/org/apache/iceberg/BaseFile.java

  private byte[] keyMetadata = null;
  private Integer sortOrderId;

+  private Long dataSequenceNumber = null;


Is there any particular reason to have these two as a separate block? Right now, BaseFile only have two blocks to keep optional fields separate.

I didn't realise that pattern with the blocks because 'avroSchema' and 'fromProjectionPos/partitionType' also make their own block. Regardless, I merged the new ones with the optional block.

Are these two indeed optional? I'd say they belong more to the block with partitionSpecId. What do you think?

Yeah, giving it a second thought, they indeed aren't optional. Added them to the same block as partitionSpecId.

core/src/main/java/org/apache/iceberg/BaseFile.java

api/src/main/java/org/apache/iceberg/ContentFile.java

core/src/test/java/org/apache/iceberg/TableTestBase.java

aokolnychyi · 2023-05-09T02:15:54Z

I think this approach is correct, I had mostly minor comments.

I also wonder whether we should prohibit passing new data files with a set explicit fileSequenceNumber, just like we don't allow new manifests with a set snapshot ID. The file sequence number must be assigned at commit.

Thoughts, @jackye1995 @RussellSpitzer @rdblue?

core/src/main/java/org/apache/iceberg/InheritableMetadataFactory.java

stevenzwu · 2023-05-09T04:46:00Z

core/src/test/java/org/apache/iceberg/TableTestBase.java

        V1Assert.assertEquals(
            "Data sequence number should default to 0", 0, entry.dataSequenceNumber().longValue());
+        V1Assert.assertEquals(
+            "Data sequence number should default to 0",


do we need assert on file sequence number too?

I added a check for file sequence number.

However, I'm wondering if the checks for dataSequenceNumber are correct. If I'm not mistaken if there was a rewrite then the dataSequenceNumber on the ManifestEntry can differ from the one on the Snapshot. So these checks (even the existing ones) are correct only if there were no rewrites on the table. Is my understanding correct here?

@gaborkaszab, you are correct that rewrites may preserve the initial data sequence number. This method would not work for such rewrites. I believe we have other tests that pass an iterator of sequence numbers that should be used in those cases.

rdblue

Looks like a reasonable approach to this.

core/src/test/java/org/apache/iceberg/TestSnapshot.java

core/src/test/java/org/apache/iceberg/TableTestBase.java

core/src/main/java/org/apache/iceberg/BaseFile.java

aokolnychyi · 2023-05-16T00:17:40Z

Let me take another look in a bit.

aokolnychyi

The implementation looks good to me. I had one question about the place where to put new variables in BaseFile. I believe @amogh-jahagirdar also had some nits. Otherwise, should be good to go.

Co-authored-by: chenjunjiedada <jimmyjchen@tencent.com>

amogh-jahagirdar

Looks good to me thanks @gaborkaszab !

amogh-jahagirdar · 2023-05-16T16:02:32Z

Merging since we have a few approvals, and no blocking comments. thanks @gaborkaszab for the PR, and @rdblue @aokolnychyi @stevenzwu @jackye1995 for reviews.

gaborkaszab · 2023-05-16T16:28:33Z

Thanks for the reviews, @amogh-jahagirdar @aokolnychyi @rdblue @stevenzwu @jackye1995 !

…h ContentFile (apache#7555)

github-actions bot added API core labels May 8, 2023

gaborkaszab requested a review from aokolnychyi May 8, 2023 12:23

jackye1995 reviewed May 8, 2023

View reviewed changes

jackye1995 added this to the Iceberg 1.3.0 milestone May 8, 2023