Skip to content

Conversation

@amogh-jahagirdar
Copy link
Contributor

Part of breaking up #3883.

This adds the UpdateSnapshotRefs API. Will add an implementation once Table/TableMetadata API implementations are complete, since an implementation would depend on those changes.

Co-authored-by: @hameizi 1249369293@qq.com
Co-authored-by: @jackye1995 yzhaoqin@amazon.com

@github-actions github-actions bot added the API label Feb 9, 2022
@rdblue
Copy link
Contributor

rdblue commented Feb 10, 2022

This seems much like the ManageSnapshots API to me. I'm wondering how we should proceed.

Background on ManageSnapshots: this is the API that we use to cherry-pick snapshots and do rollbacks. These high-level operations are really similar to what's happening here. A rollback is setting the snapshot for the main branch, a cherry-pick moves a snapshot between branches (if possible), and you can also set the current snapshot. One thing that I've wanted to do with this API for a while is to make it possible to perform multiple changes as a transaction. That is, pick more than one snapshot and then commit.

I think these changes might fit well into ManageSnapshots. This could enable doing things like createBranch and then cherry-pick into that branch. Plus, we'll need to update ManageSnapshots for branching and tagging anyway. Things like cherry-pick to a branch other than main and cherry-picking a snapshot by tag rather than by ID.

Also, one of the concerns I have with the current PR is the set of operations that we're exposing. For example, doesn't make sense to me to just have branch without some associated verb, like createBranch, replaceBranch, or removeBranch.

So what I propose is to add these changes to ManageSnapshots and extend the API like this:

  • Add createBranch, replaceBranch, removeBranch, and renameBranch
  • Add createTag, replaceTag and removeTag (not renameTag because it doesn't make much sense, right?)
  • Add rollbackTo(String branchName, long snapshotId)
  • Add cherrypick(String branchName, long snapshotId)

Then there are a couple of other methods that we might want to add:

  • Add cherrypick(String branchName, String tagName)
  • Add cherrypickAll(String branchName, String branchName)

The cherrypickAll method would require us turning this into a transaction and not just a SnapshotUpdate, though. That could require us to do some of these things separately and maybe even deprecate ManageSnapshots and come up with a different name for the combined ManageSnapshots and UpdateSnapshotRefs interface. (Sorry this idea isn't fully baked yet!)

What do you think, @amogh-jahagirdar and @jackye1995?

@hameizi
Copy link
Contributor

hameizi commented Feb 11, 2022

Then there are a couple of other methods that we might want to add:

  • Add cherrypick(String branchName, String tagName)
  • Add cherrypickAll(String branchName, String branchName)

The cherrypickAll method would require us turning this into a transaction and not just a SnapshotUpdate, though. That could require us to do some of these things separately and maybe even deprecate ManageSnapshots and come up with a different name for the combined ManageSnapshots and UpdateSnapshotRefs interface. (Sorry this idea isn't fully baked yet!)

About cherrypick i have below question and think.
1.First of all i think it is clear that data conflict is diffcult to deal. So we just can append changed data between different branch.
2.Base on 1, we should record changed data between different branch use like seqnumber.
3.Base on 2, if we use seqnumber or other anything to record changed between different branch then we must record parent branch to sign where are new branch checkout. We can allow checkout from any branch, but that will cause bother for opeator 2 because different branch inherit different seqnumber from different branch.

@amogh-jahagirdar
Copy link
Contributor Author

amogh-jahagirdar commented Feb 11, 2022

@rdblue I will also need to think more on this as well. I think ultimately if we work backwards from:

1.) cherryPickAll(String sourceBranch, String targetBranch) is a useful operation to perform for Iceberg users to perform (which I think it is, I envision workflows which can benefit from being able to cherry-pick snapshots from one lineage to another)

2.) Given 1, and the fact that cherryPickAll does a cherry pick of individual snapshots in an atomic manner across all snapshots, I don't see an obvious way around using a transaction.

After looking more at ManageSnapshots, I do agree that likely it could be combined with UpdateSnapshotRefs (with the APIs you mentioned), but don't see a way around using the existing transaction abstraction for performing cherryPickAll. Perhaps there is a better model here though.

@rdblue For my understanding, is there a reason we are trying to avoid a transaction? Is it mostly for simplicity in the API?

@rdblue
Copy link
Contributor

rdblue commented Feb 11, 2022

@amogh-jahagirdar, I agree that we will probably want to turn the API into a transaction. I think that's not a big deal. Mainly, I'm wondering if you agree that we should combine the ManageSnapshots tasks with the UpdateSnapshotRefs tasks. They seem to me like things you'd want to combine, but I want to make sure you agree.

@jackye1995
Copy link
Contributor

jackye1995 commented Feb 14, 2022

Sorry @rdblue was a bit late to reply, I agree with the proposal about:

    Add createBranch, replaceBranch, removeBranch, and renameBranch
    Add createTag, replaceTag and removeTag (not renameTag because it doesn't make much sense, right?)
    Add rollbackTo(String branchName, long snapshotId)
    Add cherrypick(String branchName, long snapshotId)
    Add cherrypick(String branchName, String tagName)
    Add cherrypickAll(String branchName, String branchName)

I was thinking to have all the fancy operations like rollback that mixes branch, tag and snapshot ID to a higher level API (and we can extend ManageSnapshots API for anything fit) and have one lower level API (this one) that handles the basic ref-only operations, but having them at the same place also works. But I would say we should be careful about mixing snapshot updates and metadata updates. More is elaborated later.

For the issue that the original branch and tag creation APIs does not have verbs, that's probably my bad English, I thought branch could be used as a verb. Having create, replace, remove for each ref type and rename for branch makes sense. What do you think about the API methods setting tag and branch parameters? I think they are still needed.

Going back to the cherry-picking related operations, the fundamental difference of it from the other ones is that it produces a new snapshot with a new manifest list based on the cherry-picked snapshot information. What exactly is our semantics for cherry-picking? If I map it to Git, it is to reapply the same file changes to a different commit. So in Iceberg, suppose I have 2 snapshots produced by AppendFiles(f1, f2), DeletFiles(f3) at branch dev, then I do cherryPickAll("main", "dev"), then I would do AppendFiles(f1, f2), DeletFiles(f3) to main? Then would cherryPickAll be just a fancy wrapper of

Transaction transaction = table.newTransaction();
transaction.newAppend().appendFiles(f1, f2).toBranch("main").commit();
transaction.newDelete().deleteFiles(f3).toBranch("main").commit();
transaction.commitTransaction();

I know we currently have some basic cherry-picking for APPEND and OVERWRITE (replace partition) type operations, do we want to continue with that approach? Seems like we are trying to reverse-engineer what was done in a commit based on information in different places in the produced snapshot.

@amogh-jahagirdar
Copy link
Contributor Author

amogh-jahagirdar commented Feb 15, 2022

I'm pretty aligned with using ManageSnapshots for the create/replace/remove branch/tag, renameBranch.

public interface ManageSnapshots extends PendingUpdate<Snapshot> {
  ManageSnapshots setCurrentSnapshot(long snapshotId);
  ManageSnapshots rollbackToTime(long timestampMillis);
  ManageSnapshots rollbackTo(long snapshotId);
  ManageSnapshots cherrypick(long snapshotId);
  ManageSnapshots createBranch(String branch, long snapshotId);
  ManageSnapshots renameBranch(String branch, String newBranch);
  ManageSnapshots replaceBranch(String branch, long snapshotId);
  ManageSnapshots removeBranch(String branchName);
  ManageSnapshots createTag(String tag, long snapshotId);
  ManageSnapshots replaceTag(String tag, long snapshotId);
  ManageSnapshots removeTag(String branchName, String branchName);
  ManageSnapshots rollbackTo(String branchName, long snapshotId)
}

For cherryPickAll, still not fully fleshed out. I am thinking we have some kind of notion of a TransactionContext which we can embed in the API implementation which would allow us to hide transaction details for an API. I still have not thought through what's a good way to abstract the passing/setting of a transaction context, but basically I'm thinking of having a context which can be used internally in API implementations. When the API is called we implicitly add to a transaction created behind the scenes. Then when we commit the SnapshotProducer operation we commit the transaction maintained in the context. Still pretty vague I know :) ,

Before all that, for cherry-pick I think we should enumerate the cases where we say "This is a conflict we cannot handle". I think for simplicity we handle the AppendFile/DeleteFile cases. it becomes trickier when say we do an Update colA for record1 on main branch, and then we do a update colA to a different value for record1 on dev-branch. Then when we cherry-pick dev-branch onto main, we should fail. This is more of a merge semantic, which we're trying to avoid, but if since we're not directly following git semantics it'll be good to explicitly outline our desired behavior.

We may want to enumerate those cases in a separate issue so the community is aligned on what the desired behavior for cherry-pick.

@rdblue @jackye1995 let me know your thoughts

@rdblue
Copy link
Contributor

rdblue commented Feb 20, 2022

Going back to the cherry-picking related operations, the fundamental difference of it from the other ones is that it produces a new snapshot with a new manifest list based on the cherry-picked snapshot information. What exactly is our semantics for cherry-picking?

This is a great question. Right now, cherry picking is not like git. Git cherry picking re-applies a diff, but makes no guarantee about the semantic changes. Iceberg cherry picking (so far) gives the same semantics. That's why we currently only support append and overwrites that are replace partitions. For those, we can check that the changes can still be safely applied by re-validating the commit checks. For append, there are no checks so it is safe. For replace partitions, we validate that no new files have been added to the replaced partitions.

Iceberg doesn't currently support cherry-picking snapshots that require knowing more about the original operation. For example, DeleteFiles using a filter would need to store the filter and validate that no other data files were added that match the delete filter. Or maybe it would just run the delete filter again. My plan is to add these when people start asking for them.

For now, I think cherrypickAll would just run cherrypick in a loop. The main benefit would be keeping track of where the branch diverged. But this is probably more advanced than we need for now. In the short term, I'd focus on getting the branching and tagging parts in. Cherry picking a branch is something we can add later.

@rdblue
Copy link
Contributor

rdblue commented Feb 20, 2022

@amogh-jahagirdar, I think we're in agreement on how to handle create/remove/replace operations. Would you like to adapt this PR to add those to the ManageSnapshots API?

@amogh-jahagirdar
Copy link
Contributor Author

@rdblue Sure, I have some local changes for these operations. I can update this PR for adding those to the ManageSnapshots API !

@github-actions github-actions bot added the core label Feb 22, 2022
@amogh-jahagirdar amogh-jahagirdar force-pushed the snapshot-ref-table-api branch 2 times, most recently from a14b983 to a5b1d38 Compare February 22, 2022 17:31
@amogh-jahagirdar
Copy link
Contributor Author

amogh-jahagirdar commented Feb 22, 2022

@rdblue @jackye1995 Updated the API definitions, let me know what you think!

@amogh-jahagirdar amogh-jahagirdar changed the title API: Add UpdateSnapshotRefs to API API: Add snapshot reference create, replace, rename, and rename operations to ManageSnapshots API Feb 22, 2022

/**
* Replace the current head snapshot of the given branch with the snapshot with the given id.
* If the branch with the given name does not exist, a new branch will be created.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should state the behavior of branch properties, like snapshots to keep.

Also, are you sure that this should create or update the branch? I think that makes sense for writing to a branch (since there is a job that has done work) but I'm not sure about doing it for metadata-only operations.

Copy link
Contributor Author

@amogh-jahagirdar amogh-jahagirdar Feb 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah (so sorry for the late reply on this), I also missed out on defining APIs for updating reference retention properties. I definitely agree with creating/updating the branch after writing. If someone wants to do an operation to reset an existing branch to a new snapshot they would have to do a drop + create. which is slightly more cumbersome. Perhaps this is not really a meaningful metadata operation and also the explicit drop + create is better for safety (a user who wants to do this know explicitly that they are removing a reference and then creating a new one) to avoid unintentionally clobbering.

So overall, it probably makes sense to not do it for metadata-only operations, but if we do it only for updating reference retention properties, then the replace name seems kind of odd to me compared to UpdateBranchRetention, UpdateTagRetention etc.

@jackye1995 @rdblue let me know what your thoughts are here.

@amogh-jahagirdar amogh-jahagirdar force-pushed the snapshot-ref-table-api branch 4 times, most recently from bdc6d9d to f132776 Compare March 1, 2022 01:24
@amogh-jahagirdar
Copy link
Contributor Author

Sorry for the late update on this! @rdblue @jackye1995 when you get a chance let me know your thoughts. I still see some points where we want to do more validation but wanted to get your takes on the API, and operation structure. After we get clarity there, then I'll write unit tests. Some points:

1.) There is no replaceBranch/replaceTag in this API. From looking at previous threads and offline discussions, the overall desire for "replace" for just metadata operations is to update retention properties. So in this, replace* methods are named to setBranch/setTag retention respectively since I think that name makes more sense, but let me know your thoughts.
If we want replace to mean take an existing branch and update it to a new snapshot or create a new branch if one does not exist, a user could do drop/create but it's more clunky that way. As @rdblue mentioned, it seems like the main use case is to allow for someone to write data to a branch, and if the branch doesn't exist we create it because performing the write is expensive. Unless I'm mistaken, this would more related to SnapshotProducer changes rather than Metadata changes?
Let me know your thoughts.

2.) The package private operation is just a single operation (ManageSnapshotReferenceOperation) which encapsulates new additions to refs, ref removals. Originally, i had multiple operations but it seemed kind of redundant/overkill to have different internal operations compared to having a single operation just maintain state. Let me know your preference.

3.) For rename, updating retention, these map to multiple MetadataUpdate. We don't have a single "RenameBranchUpdate", this will map to RemoveSnapshotRefUpdate + CreateSnapshotRefUpdate (with the new name). In our operation implementation we ensure the ordering of these (as it seems like with REST Catalog + this Metadata Update approach, the client will expect an ordered list of these updates). My only doubt here is it seems inefficient to have multiple updates for a rename/update retention.


// Test chained creating and removal of branch and tag
table.manageSnapshots()
.createBranch("branch2", 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should use snapshotId rather than 1. Otherwise the test is brittle and could break if snapshot ID assignment changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@amogh-jahagirdar amogh-jahagirdar force-pushed the snapshot-ref-table-api branch 9 times, most recently from 11d47dc to 3699fda Compare April 13, 2022 01:23
@amogh-jahagirdar amogh-jahagirdar force-pushed the snapshot-ref-table-api branch 2 times, most recently from d7a2154 to d61cf96 Compare April 13, 2022 01:39

table.newAppend()
.appendFile(FILE_B)
.set("wap.id", "123456789")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ID isn't really needed. stageOnly is sufficient. This is just a way to get Spark to call stageOnly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants