Skip to content

Conversation

@Guosmilesmile
Copy link
Contributor

Currently, Flink may support DeleteOrphans (#13302), and some of the code can be reused from Spark.
The main goal of this PR is to refactor the commonly used code out of the Spark code into the core, so that Flink can reuse it.

@Guosmilesmile
Copy link
Contributor Author

I have already created separate PRs for Spark and Core. I hope you can find some time to review the corresponding parts of the code. @pvary @RussellSpitzer @szehon-ho @liziyan-lzy

Thank you very much!
GuoYu

@pvary
Copy link
Contributor

pvary commented Jul 1, 2025

Hi @Guosmilesmile,
This highlights, what I have missed during the review of the main PR. We need to have unit tests for the new API. Could you please create them?
Thanks, Peter

@Guosmilesmile
Copy link
Contributor Author

@pvary I will submit the corresponding UT as soon as possible.

@Guosmilesmile Guosmilesmile force-pushed the delete_orphans_spark branch from a42064b to ea05aba Compare July 1, 2025 08:51

List<String> foundFiles = Lists.newArrayList();
Predicate<FileInfo> fileFilter = fileInfo -> fileInfo.location().endsWith(".txt");
FileSystemWalker.listDirRecursivelyWithFileIO(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make this test parametrized and run this with listDirRecursivelyWithHadoop too?
Like:

  @ParameterizedTest
  @ValueSource(booleans = {true, false})
  public void testPartitionAwareHiddenPathFilter(boolean hadoop) throws IOException {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I have change it.

Comment on lines +84 to +90
public boolean schemeMatch(FileURI another) {
return uriComponentMatch(scheme, another.getScheme());
}

public boolean authorityMatch(FileURI another) {
return uriComponentMatch(authority, another.getAuthority());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we create a few small unit tests for this as well?
With, or without schemas, matching and not matching... etc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I have do it .

}

@Test
public void testListDirRecursivelyWithFileIO() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a test to verify the behavior of the FileIO prefix listing for the following scenario:
path_a: /base_path/normal_dir
path_b: /base_path/normal_dir_1
The test should ensure that when listing files using the FileIO prefix listing for path_a, files located under path_b are not included in the results.

Copy link
Contributor Author

@Guosmilesmile Guosmilesmile Jul 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I have added the corresponding ut.

@Guosmilesmile
Copy link
Contributor Author

Hey @RussellSpitzer @szehon-ho , if you have time, please take a moment to help review this.

Thank you very much!
GuoYu

@Guosmilesmile
Copy link
Contributor Author

Gentle ping to reviewers @RussellSpitzer @szehon-ho


import java.net.URI;
import java.util.Map;
import org.apache.hadoop.fs.Path;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any chance we can leave Hadoop dependencies out of this class since it's essentially just acting as a container class? For example we could provide public FileUri(URI uri), then a caller can use new org.apache.hadoop.fs.Path(uriString).

Just don't want to spread Hadoop-classes around the project if possible

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modified the constructor to remove the Hadoop dependency from this class.


Iterable<FileInfo> files = io.listPrefix(listPath);
for (FileInfo file : files) {
Path path = new Path(file.location());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can save removing the Path dependency here for later :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RussellSpitzer Sorry for the delayed inquiry. In the current FileSystemWalker, is it okay to keep the Hadoop dependency for now, as there isn’t a suitable way to remove it ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep!

List<String> foundFiles = Lists.newArrayList();
List<String> remainingDirs = Lists.newArrayList();
String path = basePath + "/normal_dir";
if (useHadoop) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may just be me personally but I would prefer we extract out this branch into a private function to avoid having the complexity repeated in the tests?

I generally want each test to have as little setup as possible enumerated in the test body. Here the actual method execution I think is not that important but it takes up most of the method body.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I extracted this part into a separate method.

Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, I have a few nits about testing and trying to reduce the number of new classes with hadoop deps but I think this is generally a pretty clean way of pulling out the DeleteOrphanFiles Code for general use.

*
* @param io File system interface supporting prefix operations
* @param dir Base directory to start recursive listing
* @param specs Map of {@link PartitionSpec partition specs} for this table.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should do a little more explanation than this. The specs object is used to create a filter to decide which directories to investigate and we should probably note that here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix

* Recursively lists files in the specified directory that satisfy the given conditions. Use
* {@link PartitionAwareHiddenPathFilter} to filter out hidden paths.
*
* @param io File system interface supporting prefix operations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a File System, a FileIO implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

* @param io File system interface supporting prefix operations
* @param dir Base directory to start recursive listing
* @param specs Map of {@link PartitionSpec partition specs} for this table.
* @param filter Additional filter condition for files
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More explanation here, are these files that that we ignore? or consume? In addition to what?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

* </ul>
*
* @param dir The starting directory path to traverse
* @param specs Map of {@link PartitionSpec partition specs} for this table.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same note as above for Spec needing more details

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

}

/**
* Recursively traverses the specified directory using Hadoop API to collect file paths that meet
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Recursively traverses the specified directory using Hadoop API to collect file paths that meet
* Recursively traverses the specified directory using Hadoop FileSystem API to collect file paths that meet

* @param dir The starting directory path to traverse
* @param specs Map of {@link PartitionSpec partition specs} for this table.
* @param filter File filter condition, only files satisfying this condition will be collected
* @param conf Hadoop conf
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hadoop configuration used to load the FileSystem

@pvary pvary merged commit 3a4215d into apache:main Aug 11, 2025
42 checks passed
@pvary
Copy link
Contributor

pvary commented Aug 11, 2025

Merged to main.
Thanks for the PR @Guosmilesmile and @RussellSpitzer for the review!

@pvary
Copy link
Contributor

pvary commented Aug 11, 2025

@Guosmilesmile: Please port the changes to the other supported Spark versions.
Thanks, Peter

@JeonDaehong
Copy link
Contributor

Sounds great! I’ve been working on orphan file removal recently, so I’ll take this into account in my development. 😄

@Guosmilesmile Guosmilesmile deleted the delete_orphans_spark branch August 12, 2025 00:46
Guosmilesmile added a commit to Guosmilesmile/iceberg that referenced this pull request Aug 12, 2025
Guosmilesmile added a commit to Guosmilesmile/iceberg that referenced this pull request Aug 12, 2025
Guosmilesmile added a commit to Guosmilesmile/iceberg that referenced this pull request Aug 12, 2025
huaxingao pushed a commit that referenced this pull request Aug 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants