Flink: Add retry limit for IcebergSource continuous split planning errors #7571

pvary · 2023-05-09T17:44:24Z

We found that if the current implementation of the IcebergSource faces some downstream error then it silently retries the planning again and again until the error persists. The only effect on the job is that there is no new record emitted - but no other alarms are raised.

This is similar how the Flink FileSource works but this is confusing for the users.
Also other sources might implement error handling differently, for example Kafka which fails immediately on an error. This is also not desirable for our user base because we expect more resiliency for our jobs.

Based on our discussion with @zhen-wu2 and @gyula-fora, I have created this PR which adds the possibility to retry the failed planning and to configure the number of retries in a few different ways:

IcebergSource.Builder.planRetryNum - if the source is created from java code
Using FlinkConfiguration key connector.iceberg.plan-retry-num
Through read option plan-retry-num

The default value is 3, which means that if the 4th planning is failed then the Flink job is failed.
If the original behaviour is needed, then the value -1 should be used.

…rors

flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/source/IcebergSource.java

docs/flink-configuration.md

flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/FlinkReadOptions.java

...nk/src/main/java/org/apache/iceberg/flink/source/enumerator/ContinuousIcebergEnumerator.java

...rc/test/java/org/apache/iceberg/flink/source/enumerator/TestContinuousIcebergEnumerator.java

stevenzwu

this PR is very close now. left a few more nit comments.

stevenzwu · 2023-05-15T18:15:37Z

...nk/src/main/java/org/apache/iceberg/flink/source/enumerator/ContinuousIcebergEnumerator.java

-      LOG.error("Failed to discover new splits", error);
+      consecutiveFailures++;
+      if (scanContext.maxAllowedPlanningFailures() < 0
+          || consecutiveFailures < scanContext.maxAllowedPlanningFailures()) {


nit: should be <=. if maxAllowedPlanningFailures is 3, we should allow 3 consecutive failures.

Changed this and the corresponding tests

stevenzwu · 2023-05-15T18:18:32Z

...rc/test/java/org/apache/iceberg/flink/source/enumerator/TestContinuousIcebergEnumerator.java

+    enumeratorContext.triggerAllActions();
+    Assert.assertEquals(0, enumerator.snapshotState(2).pendingSplits().size());
+
+    // Trigger the planning again to recover from the failure, and we get the expected splits


nit: avoid we in the comment. maybe like Second scan planning should succeed and discover the expected splits?

Done.
Found another we and fixed it as well

stevenzwu · 2023-05-15T18:22:03Z

...rc/test/java/org/apache/iceberg/flink/source/enumerator/TestContinuousIcebergEnumerator.java

+    // Can not discover the new split with planning failures
+    for (int i = 0; i < expectedFailures; ++i) {
+      enumeratorContext.triggerAllActions();
+      pendingSplits = enumerator.snapshotState(2).pendingSplits();


nit: change the checkpoint id to i.

Checked the checkpoint ids and changed them - not really relevant for the tests, but still a valid point

stevenzwu · 2023-05-15T18:23:09Z

...rc/test/java/org/apache/iceberg/flink/source/enumerator/TestContinuousIcebergEnumerator.java

+  }
+
+  @Test
+  public void testOriginalRetry() throws Exception {


nit: this method name is not informative. what does original refer to? also how is this test different with testTransientPlanningFailure?

This is testing that we are ignoring failures (-1)

stevenzwu · 2023-05-17T02:52:56Z

thanks @pvary for the contribution and @hililiwei for the review

…res in IcebergSource before failing the job (apache#7571) to 1.16 and 1.15

pvary · 2023-05-17T08:26:33Z

Thanks @hililiwei and @stevenzwu for the throughout review and the merge!

…res in IcebergSource before failing the job (#7571) to 1.16 and 1.15 (#7629)

Flink: Add retry limit for IcebergSource continuous split planning er…

7fd861f

…rors

github-actions bot added docs flink labels May 9, 2023

pvary requested a review from stevenzwu May 9, 2023 17:47

hililiwei reviewed May 10, 2023

View reviewed changes

flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/source/IcebergSource.java Outdated Show resolved Hide resolved

Use ScanContext instead of adding a new parameter

3c74e66

hililiwei reviewed May 10, 2023

View reviewed changes

flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/source/IcebergSource.java Outdated Show resolved Hide resolved

Reverting not needed changes

11ca7de

stevenzwu reviewed May 11, 2023

View reviewed changes

pvary force-pushed the retry-upstream branch 3 times, most recently from 914a2b5 to 4ac1f6d Compare May 11, 2023 09:08

Renamed planRetryNum to maxAllowedPlanningFailures

53a5f43

pvary force-pushed the retry-upstream branch from 4ac1f6d to 53a5f43 Compare May 11, 2023 09:13

stevenzwu reviewed May 11, 2023

View reviewed changes

...nk/src/main/java/org/apache/iceberg/flink/source/enumerator/ContinuousIcebergEnumerator.java Outdated Show resolved Hide resolved

...nk/src/main/java/org/apache/iceberg/flink/source/enumerator/ContinuousIcebergEnumerator.java Outdated Show resolved Hide resolved

Count the errors instead of retries

f85b1b6

pvary force-pushed the retry-upstream branch from fe84200 to f85b1b6 Compare May 12, 2023 11:13

stevenzwu reviewed May 13, 2023

View reviewed changes

Steven's comments

b833e91

stevenzwu reviewed May 15, 2023

View reviewed changes

Steven's comments

d2de200

stevenzwu approved these changes May 17, 2023

View reviewed changes

stevenzwu merged commit 7cbde14 into apache:master May 17, 2023

pvary pushed a commit to pvary/iceberg that referenced this pull request May 17, 2023

Flink: backport Add config for max allowed consecutive planning failu…

239439d

…res in IcebergSource before failing the job (apache#7571) to 1.16 and 1.15

pvary mentioned this pull request May 17, 2023

Flink: backport Add config for max allowed consecutive planning failures in IcebergSource before failing the job (#7571) to 1.16 and 1.15 #7629

Merged

pvary deleted the retry-upstream branch May 17, 2023 08:28

pvary added a commit that referenced this pull request May 18, 2023

Flink: backport Add config for max allowed consecutive planning failu…

477da36

…res in IcebergSource before failing the job (#7571) to 1.16 and 1.15 (#7629)

Flink: Add retry limit for IcebergSource continuous split planning errors #7571

Flink: Add retry limit for IcebergSource continuous split planning errors #7571

Uh oh!

Conversation

pvary commented May 9, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stevenzwu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stevenzwu commented May 17, 2023

Uh oh!

pvary commented May 17, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants