Spark: Detect dangling DVs properly #12270

nastra · 2025-02-14T12:36:47Z

No description provided.

singhpk234 · 2025-02-14T17:56:26Z

...5/spark/src/main/java/org/apache/iceberg/spark/actions/RemoveDanglingDeletesSparkAction.java

+            // dvs pointing to non-existing data files
+            .or(
+                col("data_file.file_format")
+                    .equalTo(FileFormat.PUFFIN.name())


file-paths with s3://bucket/.parquet and s3a://bucket/.parquet would be still the same file, it would nice if we can handle these scenario as well like we do in orphan file clean up
#4346 (comment)

sorry I'm not fully following your comment. Can you please elaborate?
We mainly care about DVs here (that are puffin files) which point to non-existing data files

Apologies for the confusion, this comment was meant to be in the line below, essentially where we matching the data file path with the file path puffin is pointing to.

can having an exact equality check lead to miss ? for ex consider in the table if file_path 's3://<tbl_location>/filea.parquet' exists but Puffin files point to 's3a://<tbl_location>/filea.parquet' since we do exact not eq check this case can be missed as only diff is S3 and S3a but the file is there ?

Hence was recommending the above

I think @singhpk234 is pointing out the potential for scheme mismatch. The handling done in orphan files is here

We might be able to normalize the schemes as part of the projection prior to join in order to avoid these issues.

I'm not sure this is actually an issue. Throughout the codebase the DeleteFile#referencedDataFile() is always referring to a DataFile's location, such as here:

iceberg/core/src/main/java/org/apache/iceberg/DeleteFileIndex.java

Line 498 in 5bd314b

String path = dv.referencedDataFile();

Also when the DeleteFile is created, it always references the location of the DataFile:

iceberg/core/src/main/java/org/apache/iceberg/deletes/BaseDVFileWriter.java

Line 130 in f186be7

.withReferencedDataFile(referencedDataFile)

I agree, and this is not a blocker, just brought up to see if we can handle this. While this may not be an issue for java impl for iceberg DV but can this happen or be potentially missed in other language implementation. If we put this in a diff way like this, does iceberg consider a file with 's3://<tbl_location>/data-filea.parquet' and 's3a://<tbl_location>/data-filea.parquet' same, should we consider they are handled same way through out ?

...5/spark/src/main/java/org/apache/iceberg/spark/actions/RemoveDanglingDeletesSparkAction.java

danielcweeks

+1 @nastra my only concern is to double check that we don't physically delete the file after removing it. That would be safe for PosDel files but not DVs. I couldn't find a place where that was happening, but we should make sure that there's nothing in the snapshot expiration or other paths that physically remove the files after we mark the entries deleted.

nastra · 2025-03-20T06:40:22Z

+1 @nastra my only concern is to double check that we don't physically delete the file after removing it. That would be safe for PosDel files but not DVs. I couldn't find a place where that was happening, but we should make sure that there's nothing in the snapshot expiration or other paths that physically remove the files after we mark the entries deleted.

thanks for the review @danielcweeks. I'll double check this and will follow up in a separate PR on this

this backports apache#12270 to Spark 3.4

this backports #12270 to Spark 3.4

this backports apache#12270 to Spark 3.4

github-actions bot added the spark label Feb 14, 2025

nastra marked this pull request as draft February 14, 2025 14:51

nastra force-pushed the dangling-dvs-after-compaction branch 2 times, most recently from 0f05ccc to 1c036fe Compare February 14, 2025 15:31

nastra requested a review from danielcweeks February 14, 2025 16:22

singhpk234 reviewed Feb 15, 2025

View reviewed changes

nastra force-pushed the dangling-dvs-after-compaction branch from 1c036fe to 6faf168 Compare February 26, 2025 09:19

nastra marked this pull request as ready for review February 26, 2025 09:20

nastra force-pushed the dangling-dvs-after-compaction branch from 6faf168 to 9cba8a1 Compare March 7, 2025 14:57

nastra marked this pull request as draft March 10, 2025 10:05

Spark: Detect dangling DVs properly

c4cc499

nastra force-pushed the dangling-dvs-after-compaction branch from 9cba8a1 to c4cc499 Compare March 10, 2025 15:55

RussellSpitzer self-assigned this Mar 11, 2025

danielcweeks approved these changes Mar 19, 2025

View reviewed changes

nastra marked this pull request as ready for review March 20, 2025 06:38

nastra merged commit 017559e into apache:main Mar 20, 2025
27 checks passed

nastra deleted the dangling-dvs-after-compaction branch March 20, 2025 06:40

nastra added a commit to nastra/iceberg that referenced this pull request Mar 21, 2025

Spark 3.4: Detect dangling DVs properly

2cdc0f4

this backports apache#12270 to Spark 3.4

nastra mentioned this pull request Mar 21, 2025

Spark 3.4: Rewrite V2 deletes to V3 DVs / Detect dangling DVs properly #12606

Merged

nastra added a commit to nastra/iceberg that referenced this pull request Mar 21, 2025

Spark 3.4: Detect dangling DVs properly

27cb429

this backports apache#12270 to Spark 3.4

nastra added a commit to nastra/iceberg that referenced this pull request Mar 21, 2025

Spark 3.4: Detect dangling DVs properly

538e0d6

this backports apache#12270 to Spark 3.4

amogh-jahagirdar pushed a commit that referenced this pull request Mar 22, 2025

Spark 3.4: Detect dangling DVs properly

c2a7d9f

this backports #12270 to Spark 3.4

lliangyu-lin pushed a commit to lliangyu-lin/iceberg that referenced this pull request Mar 23, 2025

Spark 3.4: Detect dangling DVs properly

4d09de9

this backports apache#12270 to Spark 3.4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark: Detect dangling DVs properly #12270

Spark: Detect dangling DVs properly #12270

Uh oh!

nastra commented Feb 14, 2025

Uh oh!

singhpk234 Feb 14, 2025

Uh oh!

nastra Feb 17, 2025

Uh oh!

singhpk234 Feb 19, 2025

Uh oh!

danielcweeks Feb 19, 2025 •

edited

Loading

Uh oh!

nastra Feb 25, 2025

Uh oh!

singhpk234 Feb 26, 2025 •

edited

Loading

Uh oh!

Uh oh!

danielcweeks left a comment

Uh oh!

nastra commented Mar 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Spark: Detect dangling DVs properly #12270

Spark: Detect dangling DVs properly #12270

Uh oh!

Conversation

nastra commented Feb 14, 2025

Uh oh!

singhpk234 Feb 14, 2025

Choose a reason for hiding this comment

Uh oh!

nastra Feb 17, 2025

Choose a reason for hiding this comment

Uh oh!

singhpk234 Feb 19, 2025

Choose a reason for hiding this comment

Uh oh!

danielcweeks Feb 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nastra Feb 25, 2025

Choose a reason for hiding this comment

Uh oh!

singhpk234 Feb 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

danielcweeks left a comment

Choose a reason for hiding this comment

Uh oh!

nastra commented Mar 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

danielcweeks Feb 19, 2025 •

edited

Loading

singhpk234 Feb 26, 2025 •

edited

Loading