Skip to content

Conversation

@amogh-jahagirdar
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar commented Aug 29, 2022

Currently, Snapshot expiration has a limitation which prevents file cleanup from being performed if there are multiple branches and tags. That's because for reliable file cleanup in the presence of multiple references, a reachability analysis needs to be performed to determine which files can safely be removed. The existing incremental file cleanup cannot be performed in this case.

This PR introduces a FileCleanupStrategy which is a base strategy pattern that classes override with their own file cleanup logic. There are 2 strategy implementations

1.) IncrementalFileCleanup which is used in RemoveSnapshots procedure in the case there's only 1 reference. This is simply the existing file cleanup logic.
2.) A new ReachableFileCleanup which performs a reachability analysis of reachable manifest lists, manifests, and data files given the previous and current table states.

Closes #5666.

@github-actions github-actions bot added the core label Aug 29, 2022
@amogh-jahagirdar amogh-jahagirdar force-pushed the expire-snapshots-reachability branch from a84f92d to 7e23bd3 Compare August 29, 2022 20:51
@amogh-jahagirdar
Copy link
Contributor Author

Still very incomplete, just starting a draft.

@amogh-jahagirdar amogh-jahagirdar force-pushed the expire-snapshots-reachability branch 5 times, most recently from 8fcb2bc to a6807aa Compare August 30, 2022 00:11
@amogh-jahagirdar amogh-jahagirdar force-pushed the expire-snapshots-reachability branch 2 times, most recently from b9525c7 to be9232c Compare August 30, 2022 03:27
@amogh-jahagirdar amogh-jahagirdar force-pushed the expire-snapshots-reachability branch 6 times, most recently from c9e1ab8 to 302b9eb Compare September 4, 2022 05:51
@amogh-jahagirdar amogh-jahagirdar marked this pull request as ready for review September 4, 2022 05:51
@amogh-jahagirdar amogh-jahagirdar force-pushed the expire-snapshots-reachability branch 3 times, most recently from 324fe11 to 49021ca Compare September 4, 2022 06:04
@amogh-jahagirdar
Copy link
Contributor Author

cc: @rdblue @namrathamyske @singhpk234 @jackye1995 I moved it out of draft state, would be great to get your thoughts on this. Thanks!

@amogh-jahagirdar amogh-jahagirdar force-pushed the expire-snapshots-reachability branch 2 times, most recently from 95cd087 to 8698582 Compare October 1, 2022 21:32
Comment on lines 91 to 87
// Reads and deletes are done using Tasks.foreach(...).suppressFailureWhenFinished to complete
// as much of the delete work as possible and avoid orphaned data or manifest files.
SnapshotRef branchToCleanup = Iterables.getFirst(base.refs().values(), null);
if (branchToCleanup == null) {
return;
}
Copy link
Contributor Author

@amogh-jahagirdar amogh-jahagirdar Oct 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Porting the fix from https://github.com/apache/iceberg/pull/5666/files. This is still needed if we want to support incremental file cleanups in the case that expiration is called on a table with only non-main commits.


protected void deleteMetadataFiles(
Set<String> manifestsToDelete, Set<String> manifestListsToDelete) {
log.warn("Manifests to delete: {}", Joiner.on(", ").join(manifestsToDelete));
Copy link
Contributor Author

@amogh-jahagirdar amogh-jahagirdar Oct 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/RemoveSnapshots.java#L538

This is still warn as it is currently, but I'm doubting if it really needs to be warn level or if we even need this logging? warn level implies that something is in a "not fatally wrong" which is not the case here. We are intentionally deleting these files after they've been correctly computed. It may mislead users. @rdblue @singhpk234 @jackye1995

private Set<ManifestFile> readManifests(Set<Snapshot> snapshots) {
Set<ManifestFile> manifestFiles = Sets.newHashSet();
for (Snapshot snapshot : snapshots) {
try (CloseableIterable<ManifestFile> manifestFilesForSnapshot = readManifestFiles(snapshot)) {
Copy link
Contributor

@rdblue rdblue Oct 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to do this in parallel using Tasks. The results are copied anyway so there isn't much of a cost to it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh missed this when addressing the comments, is this blocking or could I address in a follow on PR along with https://github.com/apache/iceberg/pull/5669/files#r990877250 ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do it in a follow up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, we may want to do something like what you do in findFilesToDelete, where rather than reading a set of current manifest files, you just remove manifests from the candidate set. You could also use the same logic to short circuit if there are no more candidate manifests.

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amogh-jahagirdar, this looks very good. There's just one blocker here: #5669 (comment). There are some minor things as well you may want to clean up but once the blocker is in we can commit this. Thank you!

@amogh-jahagirdar amogh-jahagirdar force-pushed the expire-snapshots-reachability branch from 1f73232 to df49bda Compare October 10, 2022 02:15
@amogh-jahagirdar amogh-jahagirdar force-pushed the expire-snapshots-reachability branch 2 times, most recently from 5ba53a2 to cddc93b Compare October 11, 2022 21:33
@github-actions github-actions bot added the docs label Oct 11, 2022
@amogh-jahagirdar amogh-jahagirdar force-pushed the expire-snapshots-reachability branch from cddc93b to b3e3a47 Compare October 11, 2022 21:34
@Override
@SuppressWarnings({"checkstyle:CyclomaticComplexity", "MethodLength"})
public void cleanFiles(TableMetadata beforeExpiration, TableMetadata afterExpiration) {
if (afterExpiration.refs().size() > 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method seems way too long, wasn't it better to refactor into several private methods with names explaining the intents ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zinking, that's a great idea, but not something we should do in this PR. This code was moved from RemoveSnapshots so we want to keep it as close to the original as possible. Would you like to follow up with a PR to simplify it and get rid of the checkstyle warning?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah agreed @zinking as @rdblue mentioned, in this PR we just opted for keeping the same logic for incremental cleanup that existed in the previous RemoveSnapshots implementation. Would be great to refactor this and remove the checkstyle warning

.onFailure(
(item, exc) ->
LOG.warn(
"Failed to determine live files in manifest {}: this may cause orphaned data files",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't true anymore. It should state that there will be a retry.

});

} catch (Throwable e) {
LOG.warn("Failed to determine the data files to be removed", e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a follow up, I think this message could be a little better. This makes sense if you know what the code is currently doing, but if you don't then I think it will be hard to understand that the error was in reading current manifests, rather than dropped manifests. I'd use something like "Failed to list all reachable files".

@rdblue rdblue merged commit cd68f9c into apache:master Oct 12, 2022
@rdblue
Copy link
Contributor

rdblue commented Oct 12, 2022

Thanks, @amogh-jahagirdar! This looks good.

@rdblue
Copy link
Contributor

rdblue commented Oct 12, 2022

FYI @puneetzaroo and @szehon-ho, this affects incremental table cleanup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants