-
Notifications
You must be signed in to change notification settings - Fork 3k
Description
Apache Iceberg version
1.10.1 (latest release)
Query engine
Spark
Please describe the bug 🐞
Currently, HTTP 503 responses are not retried, yet they are still classified as cleanable failures for CreateTable transactions (stage create + updateTable request). This can lead to table corruption in scenarios where the commit is successfully persisted by the catalog, but an intermediate component returns a 503 to the client.
| errorHandler = ErrorHandlers.tableErrorHandler(); // throws NoSuchTableException |
In our setup, Spark communicates with the catalog through Envoy (acting as a reverse proxy). When Envoy returns a 503 due to a transient downstream issue, the client assumes the commit failed and proceeds with cleanup. However, the catalog may have already committed the transaction successfully. As a result, valid manifest files can be incorrectly cleaned up, leaving the table in an corrupted state.
This behavior makes 503 responses unsafe to treat as cleanable failures, especially in deployments with proxies between the client and the catalog.
Should we use a tableCommitErrorHandler instead of a tableErrorHandler also in case of CREATE updateType and not only for REPLACE and SIMPLE?
Previous related work:
#13619 and thread
Willingness to contribute
- I can contribute a fix for this bug independently
- I would be willing to contribute a fix for this bug with guidance from the Iceberg community
- I cannot contribute a fix for this bug at this time