-
Notifications
You must be signed in to change notification settings - Fork 3k
Flink: Add retry limit for IcebergSource continuous split planning errors #7571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/source/IcebergSource.java
Outdated
Show resolved
Hide resolved
flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/source/IcebergSource.java
Outdated
Show resolved
Hide resolved
flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/FlinkReadOptions.java
Outdated
Show resolved
Hide resolved
...nk/src/main/java/org/apache/iceberg/flink/source/enumerator/ContinuousIcebergEnumerator.java
Outdated
Show resolved
Hide resolved
914a2b5 to
4ac1f6d
Compare
...nk/src/main/java/org/apache/iceberg/flink/source/enumerator/ContinuousIcebergEnumerator.java
Outdated
Show resolved
Hide resolved
...nk/src/main/java/org/apache/iceberg/flink/source/enumerator/ContinuousIcebergEnumerator.java
Outdated
Show resolved
Hide resolved
...nk/src/main/java/org/apache/iceberg/flink/source/enumerator/ContinuousIcebergEnumerator.java
Outdated
Show resolved
Hide resolved
...nk/src/main/java/org/apache/iceberg/flink/source/enumerator/ContinuousIcebergEnumerator.java
Outdated
Show resolved
Hide resolved
...rc/test/java/org/apache/iceberg/flink/source/enumerator/TestContinuousIcebergEnumerator.java
Outdated
Show resolved
Hide resolved
...rc/test/java/org/apache/iceberg/flink/source/enumerator/TestContinuousIcebergEnumerator.java
Show resolved
Hide resolved
...rc/test/java/org/apache/iceberg/flink/source/enumerator/TestContinuousIcebergEnumerator.java
Outdated
Show resolved
Hide resolved
...rc/test/java/org/apache/iceberg/flink/source/enumerator/TestContinuousIcebergEnumerator.java
Outdated
Show resolved
Hide resolved
...rc/test/java/org/apache/iceberg/flink/source/enumerator/TestContinuousIcebergEnumerator.java
Outdated
Show resolved
Hide resolved
...rc/test/java/org/apache/iceberg/flink/source/enumerator/TestContinuousIcebergEnumerator.java
Show resolved
Hide resolved
...rc/test/java/org/apache/iceberg/flink/source/enumerator/TestContinuousIcebergEnumerator.java
Outdated
Show resolved
Hide resolved
...rc/test/java/org/apache/iceberg/flink/source/enumerator/TestContinuousIcebergEnumerator.java
Outdated
Show resolved
Hide resolved
stevenzwu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this PR is very close now. left a few more nit comments.
| LOG.error("Failed to discover new splits", error); | ||
| consecutiveFailures++; | ||
| if (scanContext.maxAllowedPlanningFailures() < 0 | ||
| || consecutiveFailures < scanContext.maxAllowedPlanningFailures()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: should be <=. if maxAllowedPlanningFailures is 3, we should allow 3 consecutive failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed this and the corresponding tests
| enumeratorContext.triggerAllActions(); | ||
| Assert.assertEquals(0, enumerator.snapshotState(2).pendingSplits().size()); | ||
|
|
||
| // Trigger the planning again to recover from the failure, and we get the expected splits |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: avoid we in the comment. maybe like Second scan planning should succeed and discover the expected splits?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Found another we and fixed it as well
| // Can not discover the new split with planning failures | ||
| for (int i = 0; i < expectedFailures; ++i) { | ||
| enumeratorContext.triggerAllActions(); | ||
| pendingSplits = enumerator.snapshotState(2).pendingSplits(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: change the checkpoint id to i.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checked the checkpoint ids and changed them - not really relevant for the tests, but still a valid point
| } | ||
|
|
||
| @Test | ||
| public void testOriginalRetry() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this method name is not informative. what does original refer to? also how is this test different with testTransientPlanningFailure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is testing that we are ignoring failures (-1)
|
thanks @pvary for the contribution and @hililiwei for the review |
…res in IcebergSource before failing the job (apache#7571) to 1.16 and 1.15
|
Thanks @hililiwei and @stevenzwu for the throughout review and the merge! |
We found that if the current implementation of the IcebergSource faces some downstream error then it silently retries the planning again and again until the error persists. The only effect on the job is that there is no new record emitted - but no other alarms are raised.
This is similar how the Flink FileSource works but this is confusing for the users.
Also other sources might implement error handling differently, for example Kafka which fails immediately on an error. This is also not desirable for our user base because we expect more resiliency for our jobs.
Based on our discussion with @zhen-wu2 and @gyula-fora, I have created this PR which adds the possibility to retry the failed planning and to configure the number of retries in a few different ways:
IcebergSource.Builder.planRetryNum- if the source is created from java codeconnector.iceberg.plan-retry-numplan-retry-numThe default value is 3, which means that if the 4th planning is failed then the Flink job is failed.
If the original behaviour is needed, then the value
-1should be used.