-
Notifications
You must be signed in to change notification settings - Fork 3k
Core, Spark: Propagate orphaned delete files when rewriting data files #13245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
7654f0e to
2d1441a
Compare
core/src/main/java/org/apache/iceberg/actions/RewriteFileGroup.java
Outdated
Show resolved
Hide resolved
spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java
Outdated
Show resolved
Hide resolved
...k/v3.4/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSparkAction.java
Outdated
Show resolved
Hide resolved
2d1441a to
176e327
Compare
bf88ed3 to
6f270de
Compare
core/src/main/java/org/apache/iceberg/actions/RewriteFileGroup.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/actions/RewriteDataFilesCommitManager.java
Outdated
Show resolved
Hide resolved
...k/v3.4/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSparkAction.java
Outdated
Show resolved
Hide resolved
spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java
Outdated
Show resolved
Hide resolved
spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java
Outdated
Show resolved
Hide resolved
spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java
Outdated
Show resolved
Hide resolved
6f270de to
de16550
Compare
| .collect(Collectors.toCollection(DataFileSet::create)); | ||
| } | ||
|
|
||
| public Set<DeleteFile> orphanedDVs() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we call this dangling deletes previously.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I previously called this rewritableDeletes but based on feedback I've renamed this to orphanedDVs()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
danglingDVs is a good idea for consistency with previous naming convention of dangling. I wasn't aware of it before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm OK either way with a slight preference for orphaned; it is inconsistent with the previous naming convention of dangling but I do think it's a much clearer name indicating the state of the DV.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can these DVs be physically removed like orphaned files now? I think they are just removed from the current snapshot.
stevenzwu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
I like Manu's naming suggestion of dangling for consistency with existing code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @nastra on a second pass one thing I realized is we should probably double check the metadata only delete path for Spark to make sure we're cleaning up orphan DVs in that case too (where the predicates of the delete operation can completely be applied to metadata); in that case we're using the DeleteFiles API which hasn't been updated to remove the orphan DVs. In that case we'd be not cleaning up the orphaned delete files unless I'm missing something. The current changes look good in general though, just think there may be an additional case we need to address to make sure spec compliant metadata is produced.
| .collect(Collectors.toCollection(DataFileSet::create)); | ||
| } | ||
|
|
||
| public Set<DeleteFile> orphanedDVs() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm OK either way with a slight preference for orphaned; it is inconsistent with the previous naming convention of dangling but I do think it's a much clearer name indicating the state of the DV.
| .execute(); | ||
|
|
||
| assertThat(result.rewrittenDataFilesCount()).isEqualTo(numDataFiles); | ||
| assertThat(result.removedDeleteFilesCount()).isEqualTo(numDataFiles); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I know numDataFiles == numDeleteFiles in this test case but it may be clearer to just extract a separate variable for numDeleteFiles. Or just an inline comment that we expect 1 delete file per data file in this test case
ok. that should be a separate PR than this. |
|
@stevenzwu Yeah I think I'm good with it being in a separate PR since it's a different path, but I do consider it a requirement for valid v3 metadata so it's a release blocker imo |
de16550 to
7296798
Compare
Yes I'm aware of that and we won't be able to solve this with this PR. This is going to be addressed in #13222, which modifies the |
|
thanks everyone for the reviews. I'll go ahead and merge this and I'll also rebase #13222 (which will address the issue when doing metadata-only deletes) |
To conform to the v3 spec we're propagating orphaned DVs down to the actual APIs that rewrite/overwrite data files. That way we can ensure that there are no orphaned DVs left in delete manifests