-
Notifications
You must be signed in to change notification settings - Fork 3k
Spark: Detect dangling DVs properly #12270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
0f05ccc to
1c036fe
Compare
| // dvs pointing to non-existing data files | ||
| .or( | ||
| col("data_file.file_format") | ||
| .equalTo(FileFormat.PUFFIN.name()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
file-paths with s3://bucket/.parquet and s3a://bucket/.parquet would be still the same file, it would nice if we can handle these scenario as well like we do in orphan file clean up
#4346 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry I'm not fully following your comment. Can you please elaborate?
We mainly care about DVs here (that are puffin files) which point to non-existing data files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the confusion, this comment was meant to be in the line below, essentially where we matching the data file path with the file path puffin is pointing to.
can having an exact equality check lead to miss ? for ex consider in the table if file_path 's3://<tbl_location>/filea.parquet' exists but Puffin files point to 's3a://<tbl_location>/filea.parquet' since we do exact not eq check this case can be missed as only diff is S3 and S3a but the file is there ?
Hence was recommending the above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @singhpk234 is pointing out the potential for scheme mismatch. The handling done in orphan files is here
We might be able to normalize the schemes as part of the projection prior to join in order to avoid these issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is actually an issue. Throughout the codebase the DeleteFile#referencedDataFile() is always referring to a DataFile's location, such as here:
| String path = dv.referencedDataFile(); |
Also when the DeleteFile is created, it always references the location of the DataFile:
| .withReferencedDataFile(referencedDataFile) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, and this is not a blocker, just brought up to see if we can handle this. While this may not be an issue for java impl for iceberg DV but can this happen or be potentially missed in other language implementation. If we put this in a diff way like this, does iceberg consider a file with 's3://<tbl_location>/data-filea.parquet' and 's3a://<tbl_location>/data-filea.parquet' same, should we consider they are handled same way through out ?
...5/spark/src/main/java/org/apache/iceberg/spark/actions/RemoveDanglingDeletesSparkAction.java
Outdated
Show resolved
Hide resolved
1c036fe to
6faf168
Compare
6faf168 to
9cba8a1
Compare
9cba8a1 to
c4cc499
Compare
danielcweeks
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 @nastra my only concern is to double check that we don't physically delete the file after removing it. That would be safe for PosDel files but not DVs. I couldn't find a place where that was happening, but we should make sure that there's nothing in the snapshot expiration or other paths that physically remove the files after we mark the entries deleted.
thanks for the review @danielcweeks. I'll double check this and will follow up in a separate PR on this |
this backports apache#12270 to Spark 3.4
this backports apache#12270 to Spark 3.4
this backports apache#12270 to Spark 3.4
this backports #12270 to Spark 3.4
this backports apache#12270 to Spark 3.4
No description provided.