Spark,Core: Refactor Delete OrphanFiles by moving common code from Spark to core #13429

Guosmilesmile · 2025-06-30T11:36:20Z

Currently, Flink may support DeleteOrphans (#13302), and some of the code can be reused from Spark.
The main goal of this PR is to refactor the commonly used code out of the Spark code into the core, so that Flink can reuse it.

…ark to core

Guosmilesmile · 2025-06-30T12:58:45Z

I have already created separate PRs for Spark and Core. I hope you can find some time to review the corresponding parts of the code. @pvary @RussellSpitzer @szehon-ho @liziyan-lzy

Thank you very much!
GuoYu

pvary · 2025-07-01T05:37:02Z

Hi @Guosmilesmile,
This highlights, what I have missed during the review of the main PR. We need to have unit tests for the new API. Could you please create them?
Thanks, Peter

Guosmilesmile · 2025-07-01T06:15:36Z

@pvary I will submit the corresponding UT as soon as possible.

core/src/test/java/org/apache/iceberg/util/TestFileSystemWalker.java

pvary · 2025-07-01T14:31:25Z

core/src/test/java/org/apache/iceberg/util/TestFileSystemWalker.java

+
+    List<String> foundFiles = Lists.newArrayList();
+    Predicate<FileInfo> fileFilter = fileInfo -> fileInfo.location().endsWith(".txt");
+    FileSystemWalker.listDirRecursivelyWithFileIO(


Could we make this test parametrized and run this with listDirRecursivelyWithHadoop too?
Like:

@ParameterizedTest @ValueSource(booleans = {true, false}) public void testPartitionAwareHiddenPathFilter(boolean hadoop) throws IOException {

Ok, I have change it.

pvary · 2025-07-01T14:34:40Z

core/src/main/java/org/apache/iceberg/actions/FileURI.java

+  public boolean schemeMatch(FileURI another) {
+    return uriComponentMatch(scheme, another.getScheme());
+  }
+
+  public boolean authorityMatch(FileURI another) {
+    return uriComponentMatch(authority, another.getAuthority());
+  }


Could we create a few small unit tests for this as well?
With, or without schemas, matching and not matching... etc

Ok, I have do it .

liziyan-lzy · 2025-07-07T06:54:36Z

core/src/test/java/org/apache/iceberg/util/TestFileSystemWalker.java

+  }
+
+  @Test
+  public void testListDirRecursivelyWithFileIO() {


Could we add a test to verify the behavior of the FileIO prefix listing for the following scenario:
path_a: /base_path/normal_dir
path_b: /base_path/normal_dir_1
The test should ensure that when listing files using the FileIO prefix listing for path_a, files located under path_b are not included in the results.

Okay, I have added the corresponding ut.

Guosmilesmile · 2025-07-10T02:29:21Z

Hey @RussellSpitzer @szehon-ho , if you have time, please take a moment to help review this.

Thank you very much!
GuoYu

Guosmilesmile · 2025-07-24T11:39:04Z

Gentle ping to reviewers @RussellSpitzer @szehon-ho

core/src/main/java/org/apache/iceberg/util/FileSystemWalker.java

RussellSpitzer · 2025-08-06T21:38:41Z

core/src/main/java/org/apache/iceberg/actions/FileURI.java

+
+import java.net.URI;
+import java.util.Map;
+import org.apache.hadoop.fs.Path;


Any chance we can leave Hadoop dependencies out of this class since it's essentially just acting as a container class? For example we could provide public FileUri(URI uri), then a caller can use new org.apache.hadoop.fs.Path(uriString).

Just don't want to spread Hadoop-classes around the project if possible

I modified the constructor to remove the Hadoop dependency from this class.

RussellSpitzer · 2025-08-06T21:46:37Z

core/src/main/java/org/apache/iceberg/util/FileSystemWalker.java

+
+    Iterable<FileInfo> files = io.listPrefix(listPath);
+    for (FileInfo file : files) {
+      Path path = new Path(file.location());


We can save removing the Path dependency here for later :)

@RussellSpitzer Sorry for the delayed inquiry. In the current FileSystemWalker, is it okay to keep the Hadoop dependency for now, as there isn’t a suitable way to remove it ?

RussellSpitzer · 2025-08-06T22:19:03Z

core/src/test/java/org/apache/iceberg/util/TestFileSystemWalker.java

+    List<String> foundFiles = Lists.newArrayList();
+    List<String> remainingDirs = Lists.newArrayList();
+    String path = basePath + "/normal_dir";
+    if (useHadoop) {


This may just be me personally but I would prefer we extract out this branch into a private function to avoid having the complexity repeated in the tests?

I generally want each test to have as little setup as possible enumerated in the test body. Here the actual method execution I think is not that important but it takes up most of the method body.

Right. I extracted this part into a separate method.

RussellSpitzer

Looks good to me, I have a few nits about testing and trying to reduce the number of new classes with hadoop deps but I think this is generally a pretty clean way of pulling out the DeleteOrphanFiles Code for general use.

RussellSpitzer · 2025-08-06T22:27:12Z

core/src/main/java/org/apache/iceberg/util/FileSystemWalker.java

+   *
+   * @param io File system interface supporting prefix operations
+   * @param dir Base directory to start recursive listing
+   * @param specs Map of {@link PartitionSpec partition specs} for this table.


We should do a little more explanation than this. The specs object is used to create a filter to decide which directories to investigate and we should probably note that here.

RussellSpitzer · 2025-08-06T22:27:56Z

core/src/main/java/org/apache/iceberg/util/FileSystemWalker.java

+   * Recursively lists files in the specified directory that satisfy the given conditions. Use
+   * {@link PartitionAwareHiddenPathFilter} to filter out hidden paths.
+   *
+   * @param io File system interface supporting prefix operations


Not a File System, a FileIO implementation.

RussellSpitzer · 2025-08-06T22:28:57Z

core/src/main/java/org/apache/iceberg/util/FileSystemWalker.java

+   * @param io File system interface supporting prefix operations
+   * @param dir Base directory to start recursive listing
+   * @param specs Map of {@link PartitionSpec partition specs} for this table.
+   * @param filter Additional filter condition for files


More explanation here, are these files that that we ignore? or consume? In addition to what?

RussellSpitzer · 2025-08-06T22:29:32Z

core/src/main/java/org/apache/iceberg/util/FileSystemWalker.java

+   * </ul>
+   *
+   * @param dir The starting directory path to traverse
+   * @param specs Map of {@link PartitionSpec partition specs} for this table.


Same note as above for Spec needing more details

RussellSpitzer · 2025-08-06T22:29:54Z

core/src/main/java/org/apache/iceberg/util/FileSystemWalker.java

+  }
+
+  /**
+   * Recursively traverses the specified directory using Hadoop API to collect file paths that meet


Suggested change

* Recursively traverses the specified directory using Hadoop API to collect file paths that meet

* Recursively traverses the specified directory using Hadoop FileSystem API to collect file paths that meet

RussellSpitzer · 2025-08-06T22:30:42Z

core/src/main/java/org/apache/iceberg/util/FileSystemWalker.java

+   * @param dir The starting directory path to traverse
+   * @param specs Map of {@link PartitionSpec partition specs} for this table.
+   * @param filter File filter condition, only files satisfying this condition will be collected
+   * @param conf Hadoop conf


Hadoop configuration used to load the FileSystem

pvary · 2025-08-11T17:29:52Z

Merged to main.
Thanks for the PR @Guosmilesmile and @RussellSpitzer for the review!

pvary · 2025-08-11T17:31:24Z

@Guosmilesmile: Please port the changes to the other supported Spark versions.
Thanks, Peter

JeonDaehong · 2025-08-11T22:21:45Z

Sounds great! I’ve been working on orphan file removal recently, so I’ll take this into account in my development. 😄

Spark,Core: Refactor Delete OrphanFiles by moving common code from Sp…

50ef68c

…ark to core

github-actions bot added spark core labels Jun 30, 2025

Guosmilesmile added 2 commits July 1, 2025 16:38

add ut for FileSystemWalker

c3a270e

fix checkstyle

ea05aba

Guosmilesmile force-pushed the delete_orphans_spark branch from a42064b to ea05aba Compare July 1, 2025 08:51

pvary reviewed Jul 1, 2025

View reviewed changes

core/src/test/java/org/apache/iceberg/util/TestFileSystemWalker.java Show resolved Hide resolved

pvary reviewed Jul 1, 2025

View reviewed changes

core/src/test/java/org/apache/iceberg/util/TestFileSystemWalker.java Show resolved Hide resolved

add ut

00a27ba

pvary reviewed Jul 1, 2025

View reviewed changes

add ut

73766c5

Guosmilesmile mentioned this pull request Jul 7, 2025

Flink: Supports delete orphan files in TableMaintenance #13302

Merged

liziyan-lzy reviewed Jul 7, 2025

View reviewed changes

Add testListDirRecursivelyNotInclude

0dcb016

pvary reviewed Aug 5, 2025

View reviewed changes

core/src/main/java/org/apache/iceberg/util/FileSystemWalker.java Outdated Show resolved Hide resolved

pvary reviewed Aug 5, 2025

View reviewed changes

core/src/main/java/org/apache/iceberg/util/FileSystemWalker.java Outdated Show resolved Hide resolved

pvary reviewed Aug 5, 2025

View reviewed changes

core/src/main/java/org/apache/iceberg/util/FileSystemWalker.java Outdated Show resolved Hide resolved

pvary reviewed Aug 5, 2025

View reviewed changes

core/src/main/java/org/apache/iceberg/util/FileSystemWalker.java Outdated Show resolved Hide resolved

pvary reviewed Aug 5, 2025

View reviewed changes

core/src/main/java/org/apache/iceberg/util/FileSystemWalker.java Outdated Show resolved Hide resolved

Address Peter's Comments

efcdcda

RussellSpitzer reviewed Aug 6, 2025

View reviewed changes

RussellSpitzer approved these changes Aug 6, 2025

View reviewed changes

RussellSpitzer reviewed Aug 6, 2025

View reviewed changes

Address Russell's Comments

95865eb

pvary approved these changes Aug 11, 2025

View reviewed changes

pvary merged commit 3a4215d into apache:main Aug 11, 2025
42 checks passed

Guosmilesmile deleted the delete_orphans_spark branch August 12, 2025 00:46

Guosmilesmile added a commit to Guosmilesmile/iceberg that referenced this pull request Aug 12, 2025

Spark: Backport apache#13429 to Spark 4.0 and 3.4

bc2ec86

Guosmilesmile added a commit to Guosmilesmile/iceberg that referenced this pull request Aug 12, 2025

Spark: Backport apache#13429 to Spark 4.0 and 3.4

383b47c

Guosmilesmile added a commit to Guosmilesmile/iceberg that referenced this pull request Aug 12, 2025

Spark: Backport apache#13429 to Spark 4.0 and 3.4

6a81428

Guosmilesmile mentioned this pull request Aug 12, 2025

Spark: Backport #13429 to Spark 4.0 and 3.4 #13789

Merged

huaxingao pushed a commit that referenced this pull request Aug 12, 2025

Spark: Backport #13429 to Spark 4.0 and 3.4 (#13789)

e8c9a11

	* Recursively traverses the specified directory using Hadoop API to collect file paths that meet
	* Recursively traverses the specified directory using Hadoop FileSystem API to collect file paths that meet

Spark,Core: Refactor Delete OrphanFiles by moving common code from Spark to core #13429

Spark,Core: Refactor Delete OrphanFiles by moving common code from Spark to core #13429

Uh oh!

Conversation

Guosmilesmile commented Jun 30, 2025

Uh oh!

Guosmilesmile commented Jun 30, 2025

Uh oh!

pvary commented Jul 1, 2025

Uh oh!

Guosmilesmile commented Jul 1, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Guosmilesmile Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Guosmilesmile commented Jul 10, 2025

Uh oh!

Guosmilesmile commented Jul 24, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pvary commented Aug 11, 2025

Uh oh!

pvary commented Aug 11, 2025

Guosmilesmile Jul 7, 2025 •

edited

Loading