-
Notifications
You must be signed in to change notification settings - Fork 3k
Spark,Core: Refactor Delete OrphanFiles by moving common code from Spark to core #13429
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I have already created separate PRs for Spark and Core. I hope you can find some time to review the corresponding parts of the code. @pvary @RussellSpitzer @szehon-ho @liziyan-lzy Thank you very much! |
|
Hi @Guosmilesmile, |
|
@pvary I will submit the corresponding UT as soon as possible. |
a42064b to
ea05aba
Compare
|
|
||
| List<String> foundFiles = Lists.newArrayList(); | ||
| Predicate<FileInfo> fileFilter = fileInfo -> fileInfo.location().endsWith(".txt"); | ||
| FileSystemWalker.listDirRecursivelyWithFileIO( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we make this test parametrized and run this with listDirRecursivelyWithHadoop too?
Like:
@ParameterizedTest
@ValueSource(booleans = {true, false})
public void testPartitionAwareHiddenPathFilter(boolean hadoop) throws IOException {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I have change it.
| public boolean schemeMatch(FileURI another) { | ||
| return uriComponentMatch(scheme, another.getScheme()); | ||
| } | ||
|
|
||
| public boolean authorityMatch(FileURI another) { | ||
| return uriComponentMatch(authority, another.getAuthority()); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we create a few small unit tests for this as well?
With, or without schemas, matching and not matching... etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I have do it .
| } | ||
|
|
||
| @Test | ||
| public void testListDirRecursivelyWithFileIO() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add a test to verify the behavior of the FileIO prefix listing for the following scenario:
path_a: /base_path/normal_dir
path_b: /base_path/normal_dir_1
The test should ensure that when listing files using the FileIO prefix listing for path_a, files located under path_b are not included in the results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I have added the corresponding ut.
|
Hey @RussellSpitzer @szehon-ho , if you have time, please take a moment to help review this. Thank you very much! |
|
Gentle ping to reviewers @RussellSpitzer @szehon-ho |
core/src/main/java/org/apache/iceberg/util/FileSystemWalker.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/util/FileSystemWalker.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/util/FileSystemWalker.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/util/FileSystemWalker.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/util/FileSystemWalker.java
Outdated
Show resolved
Hide resolved
|
|
||
| import java.net.URI; | ||
| import java.util.Map; | ||
| import org.apache.hadoop.fs.Path; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any chance we can leave Hadoop dependencies out of this class since it's essentially just acting as a container class? For example we could provide public FileUri(URI uri), then a caller can use new org.apache.hadoop.fs.Path(uriString).
Just don't want to spread Hadoop-classes around the project if possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I modified the constructor to remove the Hadoop dependency from this class.
|
|
||
| Iterable<FileInfo> files = io.listPrefix(listPath); | ||
| for (FileInfo file : files) { | ||
| Path path = new Path(file.location()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can save removing the Path dependency here for later :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RussellSpitzer Sorry for the delayed inquiry. In the current FileSystemWalker, is it okay to keep the Hadoop dependency for now, as there isn’t a suitable way to remove it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep!
| List<String> foundFiles = Lists.newArrayList(); | ||
| List<String> remainingDirs = Lists.newArrayList(); | ||
| String path = basePath + "/normal_dir"; | ||
| if (useHadoop) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may just be me personally but I would prefer we extract out this branch into a private function to avoid having the complexity repeated in the tests?
I generally want each test to have as little setup as possible enumerated in the test body. Here the actual method execution I think is not that important but it takes up most of the method body.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. I extracted this part into a separate method.
RussellSpitzer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, I have a few nits about testing and trying to reduce the number of new classes with hadoop deps but I think this is generally a pretty clean way of pulling out the DeleteOrphanFiles Code for general use.
| * | ||
| * @param io File system interface supporting prefix operations | ||
| * @param dir Base directory to start recursive listing | ||
| * @param specs Map of {@link PartitionSpec partition specs} for this table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should do a little more explanation than this. The specs object is used to create a filter to decide which directories to investigate and we should probably note that here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix
| * Recursively lists files in the specified directory that satisfy the given conditions. Use | ||
| * {@link PartitionAwareHiddenPathFilter} to filter out hidden paths. | ||
| * | ||
| * @param io File system interface supporting prefix operations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a File System, a FileIO implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
| * @param io File system interface supporting prefix operations | ||
| * @param dir Base directory to start recursive listing | ||
| * @param specs Map of {@link PartitionSpec partition specs} for this table. | ||
| * @param filter Additional filter condition for files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More explanation here, are these files that that we ignore? or consume? In addition to what?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
| * </ul> | ||
| * | ||
| * @param dir The starting directory path to traverse | ||
| * @param specs Map of {@link PartitionSpec partition specs} for this table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same note as above for Spec needing more details
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
| } | ||
|
|
||
| /** | ||
| * Recursively traverses the specified directory using Hadoop API to collect file paths that meet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * Recursively traverses the specified directory using Hadoop API to collect file paths that meet | |
| * Recursively traverses the specified directory using Hadoop FileSystem API to collect file paths that meet |
| * @param dir The starting directory path to traverse | ||
| * @param specs Map of {@link PartitionSpec partition specs} for this table. | ||
| * @param filter File filter condition, only files satisfying this condition will be collected | ||
| * @param conf Hadoop conf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hadoop configuration used to load the FileSystem
|
Merged to main. |
|
@Guosmilesmile: Please port the changes to the other supported Spark versions. |
|
Sounds great! I’ve been working on orphan file removal recently, so I’ll take this into account in my development. 😄 |
Currently, Flink may support DeleteOrphans (#13302), and some of the code can be reused from Spark.
The main goal of this PR is to refactor the commonly used code out of the Spark code into the core, so that Flink can reuse it.