-
Notifications
You must be signed in to change notification settings - Fork 3k
Core: Add REST API for committing changes against multiple tables #7569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ee59679 to
143bf62
Compare
core/src/main/java/org/apache/iceberg/rest/CatalogHandlers.java
Outdated
Show resolved
Hide resolved
fa5dcc7 to
353277d
Compare
open-api/rest-catalog-open-api.yaml
Outdated
| items: | ||
| $ref: '#/components/schemas/TableUpdate' | ||
|
|
||
| CommitTransactionTableRequest: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CommitTableRequest does already exist in this file unfortunately, hence naming it CommitTransactionTableRequest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could refactor to work around this. Could we just use a list of CommitTableRequest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the existing CommitTableRequest doesn't have an identifier field unfortunately
353277d to
8c17db8
Compare
core/src/main/java/org/apache/iceberg/catalog/BaseSessionCatalog.java
Outdated
Show resolved
Hide resolved
| $ref: '#/components/responses/UnauthorizedResponse' | ||
| 403: | ||
| $ref: '#/components/responses/ForbiddenResponse' | ||
| 404: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this make sense with multiple tables? The route exists. Maybe this is a bad request?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only case where I thought it might make sense to throw a 404 is when the server gets changes for a table that doesn't exist (anymore). Alternatively we could handle this case via a general CommitFailedException, but CommitFailedException is retryable and I don't think we'd want to retry in this particular case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm undecided. On one hand, that would already work and we wouldn't need to find a response code for "concurrent delete" that avoids the retry. On the other, the route doesn't represent that table.
I think I'm inclined to go with what you have here right now. It makes sense and we can always deprecate its use later.
b3fcfd3 to
6a7f07a
Compare
| public interface TableCommit { | ||
| TableIdentifier identifier(); | ||
|
|
||
| TableMetadata base(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I noted above, I think it would be better to use requirements and updates instead of mixing the updates with a TableMetadata.
I think there is a drawback to that approach, though. Any catalog that commits using metadata location rather than requirements and updates would be harder to update for multi-table transactions because those require the base metadata and the new metadata.
Overall, I think I would still prefer the requirements and updates. There are a couple of reasons:
- Implementations of
TableOperationsalmost always refresh immediately before attempting a commit anyway. Loading table metadata should not be a performance problem. - Some operations don't refresh table metadata before attempting to commit because they do not retry. For those cases, we have to address that they are not normally retried and use generic retry logic. That's what requirements/updates already do.
For a concrete example of point 2 above, consider UpdateSchema that does not retry. If the table changes concurrently, then the schema update fails. That's because the operation doesn't validate that it can still apply the schema changes. However, the REST protocol does provide a way to validate a schema update can still be applied, by sending an assert-last-assigned-field-id requirement.
If we were to try to perform a schema update using a multi-table commit method, that method doesn't know what is update is happening and will always retry. To make that retry safe, we should use the requirements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I completely agree with you on just using requirements and updates in TableCommit due to the points you mentioned. The only difficulty is that requirements are currently at the rest-layer and we'd have to refactor them out to make them generally available/usable. I didn't want to overcomplicate this PR with such a refactoring and decided to keep the base metadata in the class for now, which we use to build the requirements from
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this is the primary blocker, so I recommend we get started on the refactor and get some of the other pieces of this PR (like REST updates and request objects) in.
core/src/main/java/org/apache/iceberg/catalog/BaseSessionCatalog.java
Outdated
Show resolved
Hide resolved
| * not do any other conflict detection. Therefore, it does not guarantee true transactional | ||
| * atomicity, which is left to the implementation details of a REST server. | ||
| */ | ||
| public static void commitTransaction(Catalog catalog, CommitTransactionRequest request) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this method should check whether the underlying catalog implements commitTransaction. If it does, then it should delegate to the catalog. If not, it should throw an exception.
The CatalogHandlers act as a reference implementation for the REST protocol, so I think it would be problematic if we included a method that took an atomic multi-table commit and implemented it as a sequence of individual commits.
| } | ||
|
|
||
| @Value.Immutable | ||
| interface CommitTableRequest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems confusing to have a CommitTableRequest and an UpdateTableRequest that are basically the same thing but with or without the identifier. Then there is also the naming problem in the spec. Those issues are a bit of a red flag that we need to address duplication.
I think there are two options. First, we could reuse the UpdateTableRequest directly and pull the identifier to a higher level, like this:
CommitTransactionRequest:
type: array
items:
$ref: '#/components/schemas/TransactionCommit'
TransactionCommit:
type: object
required:
- identifier
- commit
properties:
identifier:
$ref: '#/components/schemas/TableIdentifier'
request:
$ref: '#/components/schemas/CommitTableRequest'
CommitTableRequest:
type: object
required:
- requirements
- updates
properties:
requirements:
type: array
items:
$ref: '#/components/schemas/TableRequirement'
updates:
type: array
items:
$ref: '#/components/schemas/TableUpdate'That's okay, but still fairly awkward. The next option is to reuse the existing schema directly and just add an optional identifier field:
CommitTransactionRequest:
type: array
items:
description: Each table commit request must provide an `identifier`
$ref: '#/components/schemas/CommitTableRequest'
CommitTableRequest:
type: object
required:
- requirements
- updates
properties:
identifier:
description: Table identifier to update; must be present for CommitTransactionRequest
$ref: '#/components/schemas/TableIdentifier'
requirements:
type: array
items:
$ref: '#/components/schemas/TableRequirement'
updates:
type: array
items:
$ref: '#/components/schemas/TableUpdate'I prefer the second option. It's not a problem to add an optional field and ensure it is set when the object is used in a transaction. We'd also want to check that the identifier is either not set or is set to the same table in a normal table update.
| for (TableCommit commit : commits) { | ||
| UpdateTableRequest.Builder updateTableBuilder = UpdateTableRequest.builderFor(commit.base()); | ||
| commit.changes().forEach(updateTableBuilder::update); | ||
| UpdateTableRequest updateTableRequest = updateTableBuilder.build(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we may need to refactor how we produce requirements. This would not work for non-REST catalogs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes I agree, we'd have to pull them out of the REST layer and make them work for non-REST catalogs
core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java
Outdated
Show resolved
Hide resolved
| (BaseTransaction.TransactionTable) transaction.table(); | ||
|
|
||
| // this performs validations and makes temporary commits that are in-memory | ||
| commit(txTable.operations(), updateTableRequest); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see why you'd want to use a transaction to get the bulk of the work done before committing, but I don't think this actually works because you're committing the transactions sequentially below.
I think to test we need to pass the commitTransaction call through and implement it either for an InMemoryCatalog or JDBC.
rdblue
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great progress, @nastra! I think we can get this in fairly quickly. Thanks for getting this ready!
ae56b38 to
794744c
Compare
core/src/main/java/org/apache/iceberg/rest/requests/UpdateTableRequest.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/rest/requests/UpdateTableRequest.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/rest/requests/UpdateTableRequest.java
Show resolved
Hide resolved
rdblue
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nastra, can you move the REST spec updates and request objects (with tests) to a different PR? I think those are about ready to go and this is currently blocked by producing requirements and refactoring that code. That way we can get some of this in rather than waiting.
794744c to
d7abbd5
Compare
d7abbd5 to
90e88db
Compare
90e88db to
f3c9e0c
Compare
| } | ||
|
|
||
| UpdateTableRequest( | ||
| public UpdateTableRequest( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is required to be public so that it can be used in RESTSessionCatalog.
It might be also worth adding a Builder to UpdateTableRequest so that we don't have to make it public here.
In this case we have two options:
- write a manual builder
- generate a builder from the constructor definition (I've opened Core: Generate Builder for UpdateTableRequest #7838 to indicate how this would look like)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should have a create method like TableCommit? Or one that accepts a TableCommit and returns the correct request? That would be simpler than a builder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure that would also work. I'll make the respective changes in the next PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added in #7867
f3c9e0c to
22ec8a8
Compare
|
|
||
| List<MetadataUpdate> updates(); | ||
|
|
||
| static TableCommit create(TableIdentifier identifier, TableMetadata base, TableMetadata updated) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Javadoc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added in #7867
rdblue
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing major that would block this, but I made a couple of comments. Thanks, @nastra! Good to have transactions moving forward.
The main purpose of this PR is to extract the REST-specific changes out of #6948