Skip to content

Conversation

@singhpk234
Copy link
Contributor

@singhpk234 singhpk234 commented Jul 21, 2025

About the change

Mark 503 as non retryable in update table, as pointed some of the very common services like Envoy
details : https://lists.apache.org/thread/oqonscy1b4qlmovnjtbcohz38kgprgmq

There seems to be a general alignment on the direction to treat 503 as commit state unknown as the outcomes are severe as if leading to table corruption.

@singhpk234 singhpk234 changed the title SPEC: mark 503 as non retryable error code SPEC: mark 503 as non retryable error code for Update Table Jul 21, 2025
@singhpk234 singhpk234 force-pushed the feature/503-handling branch 2 times, most recently from dd94970 to deba2c6 Compare July 21, 2025 17:59
@singhpk234
Copy link
Contributor Author

@singhpk234 singhpk234 force-pushed the feature/503-handling branch from deba2c6 to 5a9a8d4 Compare July 28, 2025 14:55
Copy link
Contributor

@dennishuo dennishuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for following up on this! LGTM

I guess one thing we might consider if we're worried about the behavior change is to extract a client-configurable setting to list the error codes to consider as "UnknownState" codes on commit operations instead of only the hard-coded switch statement. But that's a tradeoff of how much complexity to expose to the caller.

Either way, I think including 503 as the default is definitely the right thing to do here, especially since #13449 means pure reads won't suffer any reduction in availability during temporary failures.

Copy link
Contributor

@huaxingao huaxingao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

}
503:
$ref: '#/components/responses/ServiceUnavailableResponse'
description:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I agree with this change being localized here. Shouldn't we update the #/components/responses/ServiceUnavailableResponse definition for all usage of 503. I don't see why it would only apply here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, my understanding was that we just have alignment for the 503 in context for update table, since it can lead to corruption with some fairly common tool, if we are fine to interpret whole 503 as a status code where some partitial processing can be done (it doesn't matter for idempotent requests) , happy to update it centrally.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this makes sense as most of the other endpoints don't have a side-effect if retried on 503, so it shouldn't be a problem to assume that they can retry.

throw new CommitFailedException("Commit failed: %s", error.message());
case 500:
case 502:
case 503:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just change this to that any 5XX code throws CommitStateUnknownException?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't do it for all 5xx, 501 means not Implemented, imho we can't say its commit unknown, hence 500, 502, 503, 504 are what we have, am i missing something ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the concern is that we're being conservative on what we consider erroneous and very aggressive about cleaning up in these cases. We've been through this same issue multiple times now around 5XX codes and if we consider the commit state unknown for all 5XX codes, the only downside is that we leave more files around. The side-effect of being too aggressive on cleanup is that we break a table, which is the worst option.

@danielcweeks
Copy link
Contributor

@singhpk234 we probably should host a quick vote on this since we are changing the spec.

@singhpk234
Copy link
Contributor Author

singhpk234 commented Aug 14, 2025

Thank you for the feedbacks @danielcweeks !

Just to be double confirm we want to vote for 503 being marked as non-retryable when retry-after flag is not sent

update :
503 being marked as non-retryable when retry-after flag is not sent for non idempotent request

@singhpk234 singhpk234 force-pushed the feature/503-handling branch from afaaaef to 57454c0 Compare August 15, 2025 17:52
@stevenzwu stevenzwu changed the title SPEC: mark 503 as non retryable error code for Update Table Spec, Core: mark 503 as non retryable error code for Update Table Aug 15, 2025
@singhpk234 singhpk234 force-pushed the feature/503-handling branch from 57454c0 to 4d891a0 Compare August 15, 2025 18:45
Copy link
Contributor

@stevenzwu stevenzwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@stevenzwu stevenzwu added this to the Iceberg 1.10.0 milestone Aug 15, 2025
@mrcnc
Copy link
Contributor

mrcnc commented Aug 21, 2025

Can this be added into the 1.10.0 release?

@singhpk234 singhpk234 merged commit c205503 into apache:main Aug 21, 2025
43 checks passed
@singhpk234 singhpk234 deleted the feature/503-handling branch August 21, 2025 22:59
@singhpk234
Copy link
Contributor Author

Hey @mrcnc, this change will be available for 1.10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants