Skip to content

Conversation

@pvary
Copy link
Contributor

@pvary pvary commented May 9, 2023

We found that if the current implementation of the IcebergSource faces some downstream error then it silently retries the planning again and again until the error persists. The only effect on the job is that there is no new record emitted - but no other alarms are raised.

This is similar how the Flink FileSource works but this is confusing for the users.
Also other sources might implement error handling differently, for example Kafka which fails immediately on an error. This is also not desirable for our user base because we expect more resiliency for our jobs.

Based on our discussion with @zhen-wu2 and @gyula-fora, I have created this PR which adds the possibility to retry the failed planning and to configure the number of retries in a few different ways:

  • IcebergSource.Builder.planRetryNum - if the source is created from java code
  • Using FlinkConfiguration key connector.iceberg.plan-retry-num
  • Through read option plan-retry-num

The default value is 3, which means that if the 4th planning is failed then the Flink job is failed.
If the original behaviour is needed, then the value -1 should be used.

@pvary pvary force-pushed the retry-upstream branch 3 times, most recently from 914a2b5 to 4ac1f6d Compare May 11, 2023 09:08
@pvary pvary force-pushed the retry-upstream branch from 4ac1f6d to 53a5f43 Compare May 11, 2023 09:13
@pvary pvary force-pushed the retry-upstream branch from fe84200 to f85b1b6 Compare May 12, 2023 11:13
Copy link
Contributor

@stevenzwu stevenzwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this PR is very close now. left a few more nit comments.

LOG.error("Failed to discover new splits", error);
consecutiveFailures++;
if (scanContext.maxAllowedPlanningFailures() < 0
|| consecutiveFailures < scanContext.maxAllowedPlanningFailures()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should be <=. if maxAllowedPlanningFailures is 3, we should allow 3 consecutive failures.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed this and the corresponding tests

enumeratorContext.triggerAllActions();
Assert.assertEquals(0, enumerator.snapshotState(2).pendingSplits().size());

// Trigger the planning again to recover from the failure, and we get the expected splits
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: avoid we in the comment. maybe like Second scan planning should succeed and discover the expected splits?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.
Found another we and fixed it as well

// Can not discover the new split with planning failures
for (int i = 0; i < expectedFailures; ++i) {
enumeratorContext.triggerAllActions();
pendingSplits = enumerator.snapshotState(2).pendingSplits();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: change the checkpoint id to i.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked the checkpoint ids and changed them - not really relevant for the tests, but still a valid point

}

@Test
public void testOriginalRetry() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this method name is not informative. what does original refer to? also how is this test different with testTransientPlanningFailure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is testing that we are ignoring failures (-1)

@stevenzwu stevenzwu merged commit 7cbde14 into apache:master May 17, 2023
@stevenzwu
Copy link
Contributor

thanks @pvary for the contribution and @hililiwei for the review

@pvary
Copy link
Contributor Author

pvary commented May 17, 2023

Thanks @hililiwei and @stevenzwu for the throughout review and the merge!

@pvary pvary deleted the retry-upstream branch May 17, 2023 08:28
pvary added a commit that referenced this pull request May 18, 2023
…res in IcebergSource before failing the job (#7571) to 1.16 and 1.15 (#7629)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants