Skip to content

Conversation

@pvary
Copy link
Contributor

@pvary pvary commented May 19, 2023

Flink users often depend on the fact that the IcebergSource reads the Iceberg snapshots in the same order as they were committed. This assumption could be broken on job start or job restart when multiple snapshots are read in one planning cycle.

To ensure that the splits are ordered as expected the PR adds FileSequenceNumberBasedSplitAssignerFactory where the splits are ordered based on the ContentFile.fileSequenceNumber().

@github-actions github-actions bot added the flink label May 19, 2023
@pvary pvary requested a review from stevenzwu May 19, 2023 21:08

/** Simple assigner only tracks unassigned splits */
@Override
public synchronized Collection<IcebergSourceSplitState> state() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we go with the new approach on watermark alignment, IcebergSourceSplitState is probably not needed. this can be Collection<IcebergSourceSplit> pendingSplits(). It can be a separate PR to not distract the purpose of this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't we still need the state to store the current read positions at savepoints/checkpoints?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IcebergSourceSplitState class just wrapped IcebergSourceSplit with a status flag (like UNASSIGNED, ASSIGNED). the flag was introduced for the watermark alignment assigner, as it needs to track the ASSIGNED (and not completed) splits in addition to the pending UNASSIGNED splits. With the new approach of leveraging Flink for watermark tracking/aggregation, we only need to track pending UNASSIGNED splits.

@stevenzwu stevenzwu merged commit b897cd1 into apache:master Jun 16, 2023
@stevenzwu
Copy link
Contributor

thanks @pvary for the contribution

@pvary pvary deleted the seq_order branch June 23, 2023 05:40
@pvary
Copy link
Contributor Author

pvary commented Jun 23, 2023

Thanks @stevenzwu for the review!
Hopefully I will have some time to backport this to the other Flink versions soon

pvary pushed a commit to pvary/iceberg that referenced this pull request Jun 23, 2023
pvary added a commit that referenced this pull request Jun 26, 2023
rodmeneses pushed a commit to rodmeneses/iceberg that referenced this pull request Feb 19, 2024
rodmeneses pushed a commit to rodmeneses/iceberg that referenced this pull request Feb 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants