-
Notifications
You must be signed in to change notification settings - Fork 3k
Flink: Add possibilit of ordering the splits based on the file sequence number #7661
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
.../v1.17/flink/src/main/java/org/apache/iceberg/flink/source/assigner/SortedSplitAssigner.java
Outdated
Show resolved
Hide resolved
|
|
||
| /** Simple assigner only tracks unassigned splits */ | ||
| @Override | ||
| public synchronized Collection<IcebergSourceSplitState> state() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we go with the new approach on watermark alignment, IcebergSourceSplitState is probably not needed. this can be Collection<IcebergSourceSplit> pendingSplits(). It can be a separate PR to not distract the purpose of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't we still need the state to store the current read positions at savepoints/checkpoints?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IcebergSourceSplitState class just wrapped IcebergSourceSplit with a status flag (like UNASSIGNED, ASSIGNED). the flag was introduced for the watermark alignment assigner, as it needs to track the ASSIGNED (and not completed) splits in addition to the pending UNASSIGNED splits. With the new approach of leveraging Flink for watermark tracking/aggregation, we only need to track pending UNASSIGNED splits.
...k/src/main/java/org/apache/iceberg/flink/source/split/FileSequenceNumberBasedComparator.java
Outdated
Show resolved
Hide resolved
...k/src/main/java/org/apache/iceberg/flink/source/split/FileSequenceNumberBasedComparator.java
Outdated
Show resolved
Hide resolved
…itAssigner, checks/messages
...k/src/main/java/org/apache/iceberg/flink/source/split/FileSequenceNumberBasedComparator.java
Outdated
Show resolved
Hide resolved
...link/src/main/java/org/apache/iceberg/flink/source/assigner/DefaultSplitAssignerFactory.java
Outdated
Show resolved
Hide resolved
...k/src/main/java/org/apache/iceberg/flink/source/split/FileSequenceNumberBasedComparator.java
Outdated
Show resolved
Hide resolved
flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/source/IcebergSource.java
Outdated
Show resolved
Hide resolved
flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/source/IcebergSource.java
Show resolved
Hide resolved
...link/src/main/java/org/apache/iceberg/flink/source/assigner/OrderedSplitAssignerFactory.java
Outdated
Show resolved
Hide resolved
...k/src/main/java/org/apache/iceberg/flink/source/split/FileSequenceNumberBasedComparator.java
Outdated
Show resolved
Hide resolved
...flink/src/main/java/org/apache/iceberg/flink/source/assigner/SimpleSplitAssignerFactory.java
Outdated
Show resolved
Hide resolved
...k/src/main/java/org/apache/iceberg/flink/source/split/FileSequenceNumberBasedComparator.java
Outdated
Show resolved
Hide resolved
flink/v1.17/flink/src/test/java/org/apache/iceberg/flink/source/SplitHelpers.java
Outdated
Show resolved
Hide resolved
...1.17/flink/src/test/java/org/apache/iceberg/flink/source/reader/TestIcebergSourceReader.java
Outdated
Show resolved
Hide resolved
...1.17/flink/src/test/java/org/apache/iceberg/flink/source/reader/TestIcebergSourceReader.java
Show resolved
Hide resolved
...k/src/main/java/org/apache/iceberg/flink/source/split/FileSequenceNumberBasedComparator.java
Outdated
Show resolved
Hide resolved
flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/source/split/SplitComparators.java
Show resolved
Hide resolved
...k/src/main/java/org/apache/iceberg/flink/source/split/FileSequenceNumberBasedComparator.java
Outdated
Show resolved
Hide resolved
|
thanks @pvary for the contribution |
|
Thanks @stevenzwu for the review! |
…ts based on the file sequence number (apache#7661)
…ts based on the file sequence number (apache#7661) (apache#7889)
Flink users often depend on the fact that the IcebergSource reads the Iceberg snapshots in the same order as they were committed. This assumption could be broken on job start or job restart when multiple snapshots are read in one planning cycle.
To ensure that the splits are ordered as expected the PR adds FileSequenceNumberBasedSplitAssignerFactory where the splits are ordered based on the ContentFile.fileSequenceNumber().