Skip to content

Conversation

@nastra
Copy link
Contributor

@nastra nastra commented Feb 14, 2025

No description provided.

@github-actions github-actions bot added the spark label Feb 14, 2025
@nastra nastra marked this pull request as draft February 14, 2025 14:51
@nastra nastra force-pushed the dangling-dvs-after-compaction branch 2 times, most recently from 0f05ccc to 1c036fe Compare February 14, 2025 15:31
@nastra nastra requested a review from danielcweeks February 14, 2025 16:22
// dvs pointing to non-existing data files
.or(
col("data_file.file_format")
.equalTo(FileFormat.PUFFIN.name())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

file-paths with s3://bucket/.parquet and s3a://bucket/.parquet would be still the same file, it would nice if we can handle these scenario as well like we do in orphan file clean up
#4346 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry I'm not fully following your comment. Can you please elaborate?
We mainly care about DVs here (that are puffin files) which point to non-existing data files

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the confusion, this comment was meant to be in the line below, essentially where we matching the data file path with the file path puffin is pointing to.

can having an exact equality check lead to miss ? for ex consider in the table if file_path 's3://<tbl_location>/filea.parquet' exists but Puffin files point to 's3a://<tbl_location>/filea.parquet' since we do exact not eq check this case can be missed as only diff is S3 and S3a but the file is there ?

Hence was recommending the above

Copy link
Contributor

@danielcweeks danielcweeks Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @singhpk234 is pointing out the potential for scheme mismatch. The handling done in orphan files is here

We might be able to normalize the schemes as part of the projection prior to join in order to avoid these issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is actually an issue. Throughout the codebase the DeleteFile#referencedDataFile() is always referring to a DataFile's location, such as here:

String path = dv.referencedDataFile();

Also when the DeleteFile is created, it always references the location of the DataFile:

.withReferencedDataFile(referencedDataFile)

Copy link
Contributor

@singhpk234 singhpk234 Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, and this is not a blocker, just brought up to see if we can handle this. While this may not be an issue for java impl for iceberg DV but can this happen or be potentially missed in other language implementation. If we put this in a diff way like this, does iceberg consider a file with 's3://<tbl_location>/data-filea.parquet' and 's3a://<tbl_location>/data-filea.parquet' same, should we consider they are handled same way through out ?

@nastra nastra force-pushed the dangling-dvs-after-compaction branch from 1c036fe to 6faf168 Compare February 26, 2025 09:19
@nastra nastra marked this pull request as ready for review February 26, 2025 09:20
@nastra nastra force-pushed the dangling-dvs-after-compaction branch from 6faf168 to 9cba8a1 Compare March 7, 2025 14:57
@nastra nastra marked this pull request as draft March 10, 2025 10:05
@nastra nastra force-pushed the dangling-dvs-after-compaction branch from 9cba8a1 to c4cc499 Compare March 10, 2025 15:55
@RussellSpitzer RussellSpitzer self-assigned this Mar 11, 2025
Copy link
Contributor

@danielcweeks danielcweeks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 @nastra my only concern is to double check that we don't physically delete the file after removing it. That would be safe for PosDel files but not DVs. I couldn't find a place where that was happening, but we should make sure that there's nothing in the snapshot expiration or other paths that physically remove the files after we mark the entries deleted.

@nastra nastra marked this pull request as ready for review March 20, 2025 06:38
@nastra
Copy link
Contributor Author

nastra commented Mar 20, 2025

+1 @nastra my only concern is to double check that we don't physically delete the file after removing it. That would be safe for PosDel files but not DVs. I couldn't find a place where that was happening, but we should make sure that there's nothing in the snapshot expiration or other paths that physically remove the files after we mark the entries deleted.

thanks for the review @danielcweeks. I'll double check this and will follow up in a separate PR on this

@nastra nastra merged commit 017559e into apache:main Mar 20, 2025
27 checks passed
@nastra nastra deleted the dangling-dvs-after-compaction branch March 20, 2025 06:40
nastra added a commit to nastra/iceberg that referenced this pull request Mar 21, 2025
nastra added a commit to nastra/iceberg that referenced this pull request Mar 21, 2025
nastra added a commit to nastra/iceberg that referenced this pull request Mar 21, 2025
amogh-jahagirdar pushed a commit that referenced this pull request Mar 22, 2025
lliangyu-lin pushed a commit to lliangyu-lin/iceberg that referenced this pull request Mar 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants