Skip to content

Conversation

@nastra
Copy link
Contributor

@nastra nastra commented Jun 5, 2025

To conform to the v3 spec we're propagating orphaned DVs down to the actual APIs that rewrite/overwrite data files. That way we can ensure that there are no orphaned DVs left in delete manifests

@nastra nastra marked this pull request as draft June 5, 2025 10:15
@nastra nastra force-pushed the remove-orphaned-dvs branch 5 times, most recently from 7654f0e to 2d1441a Compare June 6, 2025 09:53
@nastra nastra requested a review from aokolnychyi June 6, 2025 09:53
@amogh-jahagirdar amogh-jahagirdar added this to the Iceberg 1.10.0 milestone Jun 23, 2025
@nastra nastra requested a review from amogh-jahagirdar June 23, 2025 16:41
@nastra nastra force-pushed the remove-orphaned-dvs branch from 2d1441a to 176e327 Compare June 25, 2025 07:32
@nastra nastra marked this pull request as ready for review June 25, 2025 07:32
@nastra nastra force-pushed the remove-orphaned-dvs branch 3 times, most recently from bf88ed3 to 6f270de Compare June 25, 2025 14:59
@nastra nastra force-pushed the remove-orphaned-dvs branch from 6f270de to de16550 Compare July 2, 2025 06:40
@nastra nastra requested a review from stevenzwu July 2, 2025 06:44
.collect(Collectors.toCollection(DataFileSet::create));
}

public Set<DeleteFile> orphanedDVs() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we call this dangling deletes previously.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I previously called this rewritableDeletes but based on feedback I've renamed this to orphanedDVs()

Copy link
Contributor

@stevenzwu stevenzwu Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

danglingDVs is a good idea for consistency with previous naming convention of dangling. I wasn't aware of it before.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK either way with a slight preference for orphaned; it is inconsistent with the previous naming convention of dangling but I do think it's a much clearer name indicating the state of the DV.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these DVs be physically removed like orphaned files now? I think they are just removed from the current snapshot.

Copy link
Contributor

@stevenzwu stevenzwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

I like Manu's naming suggestion of dangling for consistency with existing code

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nastra on a second pass one thing I realized is we should probably double check the metadata only delete path for Spark to make sure we're cleaning up orphan DVs in that case too (where the predicates of the delete operation can completely be applied to metadata); in that case we're using the DeleteFiles API which hasn't been updated to remove the orphan DVs. In that case we'd be not cleaning up the orphaned delete files unless I'm missing something. The current changes look good in general though, just think there may be an additional case we need to address to make sure spec compliant metadata is produced.

cc @aokolnychyi @stevenzwu

.collect(Collectors.toCollection(DataFileSet::create));
}

public Set<DeleteFile> orphanedDVs() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK either way with a slight preference for orphaned; it is inconsistent with the previous naming convention of dangling but I do think it's a much clearer name indicating the state of the DV.

.execute();

assertThat(result.rewrittenDataFilesCount()).isEqualTo(numDataFiles);
assertThat(result.removedDeleteFilesCount()).isEqualTo(numDataFiles);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I know numDataFiles == numDeleteFiles in this test case but it may be clearer to just extract a separate variable for numDeleteFiles. Or just an inline comment that we expect 1 delete file per data file in this test case

@stevenzwu
Copy link
Contributor

In that case too (where the predicates of the delete operation can completely be applied to metadata); in that case we're using the DeleteFiles API which hasn't been updated to remove the orphan DVs.

ok. that should be a separate PR than this.

@amogh-jahagirdar
Copy link
Contributor

@stevenzwu Yeah I think I'm good with it being in a separate PR since it's a different path, but I do consider it a requirement for valid v3 metadata so it's a release blocker imo

@nastra nastra force-pushed the remove-orphaned-dvs branch from de16550 to 7296798 Compare July 3, 2025 05:54
@nastra
Copy link
Contributor Author

nastra commented Jul 3, 2025

Thanks @nastra on a second pass one thing I realized is we should probably double check the metadata only delete path for Spark to make sure we're cleaning up orphan DVs in that case too (where the predicates of the delete operation can completely be applied to metadata); in that case we're using the DeleteFiles API which hasn't been updated to remove the orphan DVs. In that case we'd be not cleaning up the orphaned delete files unless I'm missing something. The current changes look good in general though, just think there may be an additional case we need to address to make sure spec compliant metadata is produced.

cc @aokolnychyi @stevenzwu

Yes I'm aware of that and we won't be able to solve this with this PR. This is going to be addressed in #13222, which modifies the MergingSnapshotProducer and passes the files to be deleted to the delete manifest filter manager, which in turn will then detect orphaned DVs

@nastra
Copy link
Contributor Author

nastra commented Jul 3, 2025

thanks everyone for the reviews. I'll go ahead and merge this and I'll also rebase #13222 (which will address the issue when doing metadata-only deletes)

@nastra nastra merged commit 8033b32 into apache:main Jul 3, 2025
43 checks passed
@nastra nastra deleted the remove-orphaned-dvs branch July 3, 2025 07:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants