Spark: Use native table FileIO instead of Hadoop to save file list in RewriteTablePath #13459

NikitaMatskevich · 2025-07-03T17:29:54Z

RewriteTablePath leverages iceberg native table IO for everything except for writing a "file list" file. For this task it currently leverages a Spark writer, which in turn uses Hadoop. This might become a limitation in some cloud environments: for instance, Hadoop does not support modern auth strategies of Azure, and thus enforces unwanted auth paths where it is not required by the cloud provider itself. Also it requires the user to manage 2 configurations for table IO and for spark IO. We suggest to use native table io instead.

See corresponding issue here:
#13458

szehon-ho

the idea sounds fine, had a question

szehon-ho · 2025-07-03T18:49:20Z

...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteTablePathSparkAction.java

+
+  private FileIO fetchResolvingFileIO() {
+    FileIO tableIO = table.io();
+    ResolvingFileIO resolvingIO = (ResolvingFileIO) CatalogUtil.loadFileIO(


why do we need this? The rest of the code just gets io from table, and we can configure all that from the catalog/table?

This is for (rare) edge case when stagingDir location argument is specified and it points to another filesystem. For example, the table might be getting migrated from Azure to AWS, and stagingDir might be s3://... I know it is not the most straightforward usecase for this action, but the old hadoop code allowed it, so using table.io directly would have been a potential breaking change

After looking at the code again I realized that stagingDir couldn't have been used with another filesystem. I will simplify this part as you said. Thank you for the review!

...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteTablePathSparkAction.java

singhpk234 · 2025-07-03T21:29:40Z

...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteTablePathSparkAction.java

-        .mode(SaveMode.Overwrite)
-        .format("csv")
-        .save(fileListPath);
+    OutputFile fileList = table.io().newOutputFile(fileListPath);


[doubt] staging location default value is table metadata path, but can be set to anything ?

if thats the case :

what if table's fileIO didn't had the credentials to write to staging directory but spark did, would this cause failures ?

when we are using the local disk to write this, but tables file IO was pointing to object store ? would now that workloads fail ?

Hi, thank you for the review!

As I checked after the review, this case is impossible: in this case the action would fail earlier, because it uses table IO to create a copy of the metadata already, and the copies of the metadata files should reside in a staging dir as well

Impossible for the same reason.

You can find examples in th method rewriteVersionFile or similar

Thanks for checking, it seems like then we might need to update the docs https://iceberg.apache.org/docs/nightly/spark-procedures/#rewrite_table_path on the restriction of the staging location, not part of this effort though !

kevinjqliu

Thanks for adding this! Generally LGTM

I added a few comments about testing and making the same changes to other version of spark

kevinjqliu · 2025-07-07T15:47:04Z

...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteTablePathSparkAction.java

    return saveFileList(copyPlan);
  }

  private String saveFileList(Set<Pair<String, String>> filesToMove) {


this exists in 3.4 and 4.0 as well, should we also make changes to those files?
https://github.com/search?q=repo%3Aapache%2Ficeberg%20private%20String%20saveFileList&type=code

good catch, yea we typically do the latest version first (spark 4.0). then its optionally whether we want to do all the sparks in one pr or separate.

kevinjqliu · 2025-07-07T15:50:01Z

...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteTablePathSparkAction.java

    return saveFileList(copyPlan);
  }

  private String saveFileList(Set<Pair<String, String>> filesToMove) {


nit: can we add a test for this in https://github.com/apache/iceberg/blob/main/spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteTablePathsAction.java

@NikitaMatskevich can you take a look if its possible

Could you share your thoughts on the nature of the test that you would like to see for this change @kevinjqliu @szehon-ho ?

If you want me to reproduce the real-life issue that we've had, then I will need to implement the setup in the iceberg-spark test module where native iceberg file io has enough permissions to the filesystem and hadoop doesn't. It should be something very close to the azure setup with managed identities that we've had, or maybe there are other complex edge-cases that I am not aware of. Anyway, do you think that such effort would be justified for this tiny PR ? It does not introduce any important changes. Usage of native table IO in this context cannot really be bug-prone by itself, because the modified class already uses exactly the same file io instance to run other steps of the action.

If you are concerned with the test coverage of this method, in TestRewriteTablePathsAction.java there are many good tests, and every one of them passes through the step of saving a file list somewhere and reading the files from this file list to copy them afterwards. I can see that the file list is created allright on my machine and if I compare it to the spark-generated one, it has exactly the same structure and file contents.

every one of them passes through the step of saving a file list somewhere and reading the files from this file list to copy them afterwards

great! that was my original motivation for the test, to make sure that the file list is properly materialized to storage. 👍

kevinjqliu · 2025-07-08T14:59:21Z

The following files had format violations:

CI failed on formatting

kevinjqliu

LGTM!

kevinjqliu · 2025-07-08T15:46:21Z

@szehon-ho since we made changes to spark 3.4, 3.5, and 4.0, are there any special deployment steps we need to go through?

szehon-ho · 2025-07-09T19:00:38Z

@kevinjqliu not as far as I know, i guess it will get bundled correctly in the appropriate spark-iceberg jar.

@NikitaMatskevich thanks it looks good to me

szehon-ho · 2025-07-09T19:01:37Z

thanks @NikitaMatskevich , also @kevinjqliu , @singhpk234 for thoughtful reviews!

use resolving file io in rewrite table path

53f3933

github-actions bot added the spark label Jul 3, 2025

NikitaMatskevich mentioned this pull request Jul 3, 2025

Spark: Use ResolvingFileIO to save file list in RewriteTablePath #13458

Closed

3 tasks

szehon-ho reviewed Jul 3, 2025

View reviewed changes

NikitaMatskevich added 2 commits July 3, 2025 21:18

simply reuse table.io()

1508ec1

overwrite mode was used in the previous code

b6847cc

NikitaMatskevich changed the title ~~Spark: Use ResolvingFileIO to save file list in RewriteTablePath~~ Spark: Use native table FileIO instead of Hadoop to save file list in RewriteTablePath Jul 3, 2025

szehon-ho reviewed Jul 3, 2025

View reviewed changes

...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteTablePathSparkAction.java Outdated Show resolved Hide resolved

szehon-ho approved these changes Jul 3, 2025

View reviewed changes

singhpk234 reviewed Jul 3, 2025

View reviewed changes

NikitaMatskevich added 2 commits July 4, 2025 00:16

rename variable

9b3f6ab

format

109ac25

kevinjqliu reviewed Jul 7, 2025

View reviewed changes

analogous change for other spark versions

457c9be

NikitaMatskevich force-pushed the nmh/use-resolving-file-io-in-rewrite-table-path branch from 3b6d49a to 457c9be Compare July 8, 2025 15:22

kevinjqliu approved these changes Jul 8, 2025

View reviewed changes

szehon-ho merged commit 3f22677 into apache:main Jul 9, 2025
27 checks passed

Spark: Use native table FileIO instead of Hadoop to save file list in RewriteTablePath #13459

Spark: Use native table FileIO instead of Hadoop to save file list in RewriteTablePath #13459

Conversation

NikitaMatskevich commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

szehon-ho left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NikitaMatskevich Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

singhpk234 Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kevinjqliu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szehon-ho Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NikitaMatskevich Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kevinjqliu commented Jul 8, 2025

Uh oh!

kevinjqliu left a comment

Choose a reason for hiding this comment

Uh oh!

kevinjqliu commented Jul 8, 2025

Uh oh!

szehon-ho commented Jul 9, 2025

Uh oh!

Uh oh!

szehon-ho commented Jul 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

NikitaMatskevich commented Jul 3, 2025 •

edited

Loading

NikitaMatskevich Jul 3, 2025 •

edited

Loading

singhpk234 Jul 3, 2025 •

edited

Loading

szehon-ho Jul 7, 2025 •

edited

Loading

NikitaMatskevich Jul 8, 2025 •

edited

Loading