-
Notifications
You must be signed in to change notification settings - Fork 3k
API: Add snapshot reference create, replace, rename, and rename operations to ManageSnapshots API #4071
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This seems much like the Background on I think these changes might fit well into Also, one of the concerns I have with the current PR is the set of operations that we're exposing. For example, doesn't make sense to me to just have So what I propose is to add these changes to
Then there are a couple of other methods that we might want to add:
The What do you think, @amogh-jahagirdar and @jackye1995? |
About cherrypick i have below question and think. |
|
@rdblue I will also need to think more on this as well. I think ultimately if we work backwards from: 1.) cherryPickAll(String sourceBranch, String targetBranch) is a useful operation to perform for Iceberg users to perform (which I think it is, I envision workflows which can benefit from being able to cherry-pick snapshots from one lineage to another) 2.) Given 1, and the fact that cherryPickAll does a cherry pick of individual snapshots in an atomic manner across all snapshots, I don't see an obvious way around using a transaction. After looking more at ManageSnapshots, I do agree that likely it could be combined with UpdateSnapshotRefs (with the APIs you mentioned), but don't see a way around using the existing transaction abstraction for performing cherryPickAll. Perhaps there is a better model here though. @rdblue For my understanding, is there a reason we are trying to avoid a transaction? Is it mostly for simplicity in the API? |
|
@amogh-jahagirdar, I agree that we will probably want to turn the API into a transaction. I think that's not a big deal. Mainly, I'm wondering if you agree that we should combine the ManageSnapshots tasks with the UpdateSnapshotRefs tasks. They seem to me like things you'd want to combine, but I want to make sure you agree. |
|
Sorry @rdblue was a bit late to reply, I agree with the proposal about: I was thinking to have all the fancy operations like rollback that mixes branch, tag and snapshot ID to a higher level API (and we can extend For the issue that the original Going back to the cherry-picking related operations, the fundamental difference of it from the other ones is that it produces a new snapshot with a new manifest list based on the cherry-picked snapshot information. What exactly is our semantics for cherry-picking? If I map it to Git, it is to reapply the same file changes to a different commit. So in Iceberg, suppose I have 2 snapshots produced by I know we currently have some basic cherry-picking for |
|
I'm pretty aligned with using ManageSnapshots for the create/replace/remove branch/tag, renameBranch. For cherryPickAll, still not fully fleshed out. I am thinking we have some kind of notion of a TransactionContext which we can embed in the API implementation which would allow us to hide transaction details for an API. I still have not thought through what's a good way to abstract the passing/setting of a transaction context, but basically I'm thinking of having a context which can be used internally in API implementations. When the API is called we implicitly add to a transaction created behind the scenes. Then when we commit the SnapshotProducer operation we commit the transaction maintained in the context. Still pretty vague I know :) , Before all that, for cherry-pick I think we should enumerate the cases where we say "This is a conflict we cannot handle". I think for simplicity we handle the AppendFile/DeleteFile cases. it becomes trickier when say we do an Update colA for record1 on main branch, and then we do a update colA to a different value for record1 on dev-branch. Then when we cherry-pick dev-branch onto main, we should fail. This is more of a merge semantic, which we're trying to avoid, but if since we're not directly following git semantics it'll be good to explicitly outline our desired behavior. We may want to enumerate those cases in a separate issue so the community is aligned on what the desired behavior for cherry-pick. @rdblue @jackye1995 let me know your thoughts |
This is a great question. Right now, cherry picking is not like git. Git cherry picking re-applies a diff, but makes no guarantee about the semantic changes. Iceberg cherry picking (so far) gives the same semantics. That's why we currently only support append and overwrites that are replace partitions. For those, we can check that the changes can still be safely applied by re-validating the commit checks. For append, there are no checks so it is safe. For replace partitions, we validate that no new files have been added to the replaced partitions. Iceberg doesn't currently support cherry-picking snapshots that require knowing more about the original operation. For example, DeleteFiles using a filter would need to store the filter and validate that no other data files were added that match the delete filter. Or maybe it would just run the delete filter again. My plan is to add these when people start asking for them. For now, I think cherrypickAll would just run |
|
@amogh-jahagirdar, I think we're in agreement on how to handle create/remove/replace operations. Would you like to adapt this PR to add those to the |
|
@rdblue Sure, I have some local changes for these operations. I can update this PR for adding those to the ManageSnapshots API ! |
dd20823 to
b4faa27
Compare
a14b983 to
a5b1d38
Compare
|
@rdblue @jackye1995 Updated the API definitions, let me know what you think! |
a5b1d38 to
0418901
Compare
|
|
||
| /** | ||
| * Replace the current head snapshot of the given branch with the snapshot with the given id. | ||
| * If the branch with the given name does not exist, a new branch will be created. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should state the behavior of branch properties, like snapshots to keep.
Also, are you sure that this should create or update the branch? I think that makes sense for writing to a branch (since there is a job that has done work) but I'm not sure about doing it for metadata-only operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah (so sorry for the late reply on this), I also missed out on defining APIs for updating reference retention properties. I definitely agree with creating/updating the branch after writing. If someone wants to do an operation to reset an existing branch to a new snapshot they would have to do a drop + create. which is slightly more cumbersome. Perhaps this is not really a meaningful metadata operation and also the explicit drop + create is better for safety (a user who wants to do this know explicitly that they are removing a reference and then creating a new one) to avoid unintentionally clobbering.
So overall, it probably makes sense to not do it for metadata-only operations, but if we do it only for updating reference retention properties, then the replace name seems kind of odd to me compared to UpdateBranchRetention, UpdateTagRetention etc.
@jackye1995 @rdblue let me know what your thoughts are here.
bdc6d9d to
f132776
Compare
|
Sorry for the late update on this! @rdblue @jackye1995 when you get a chance let me know your thoughts. I still see some points where we want to do more validation but wanted to get your takes on the API, and operation structure. After we get clarity there, then I'll write unit tests. Some points: 1.) There is no replaceBranch/replaceTag in this API. From looking at previous threads and offline discussions, the overall desire for "replace" for just metadata operations is to update retention properties. So in this, replace* methods are named to setBranch/setTag retention respectively since I think that name makes more sense, but let me know your thoughts. 2.) The package private operation is just a single operation (ManageSnapshotReferenceOperation) which encapsulates new additions to refs, ref removals. Originally, i had multiple operations but it seemed kind of redundant/overkill to have different internal operations compared to having a single operation just maintain state. Let me know your preference. 3.) For rename, updating retention, these map to multiple MetadataUpdate. We don't have a single "RenameBranchUpdate", this will map to RemoveSnapshotRefUpdate + CreateSnapshotRefUpdate (with the new name). In our operation implementation we ensure the ordering of these (as it seems like with REST Catalog + this Metadata Update approach, the client will expect an ordered list of these updates). My only doubt here is it seems inefficient to have multiple updates for a rename/update retention. |
|
|
||
| // Test chained creating and removal of branch and tag | ||
| table.manageSnapshots() | ||
| .createBranch("branch2", 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should use snapshotId rather than 1. Otherwise the test is brittle and could break if snapshot ID assignment changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
11d47dc to
3699fda
Compare
core/src/main/java/org/apache/iceberg/UpdateSnapshotReferencesOperation.java
Outdated
Show resolved
Hide resolved
d7a2154 to
d61cf96
Compare
…d fastForward operations
d61cf96 to
3521de5
Compare
|
|
||
| table.newAppend() | ||
| .appendFile(FILE_B) | ||
| .set("wap.id", "123456789") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ID isn't really needed. stageOnly is sufficient. This is just a way to get Spark to call stageOnly.
Part of breaking up #3883.
This adds the UpdateSnapshotRefs API. Will add an implementation once Table/TableMetadata API implementations are complete, since an implementation would depend on those changes.
Co-authored-by: @hameizi 1249369293@qq.com
Co-authored-by: @jackye1995 yzhaoqin@amazon.com