Skip to content

Conversation

@NikitaMatskevich
Copy link
Contributor

@NikitaMatskevich NikitaMatskevich commented Jul 3, 2025

RewriteTablePath leverages iceberg native table IO for everything except for writing a "file list" file. For this task it currently leverages a Spark writer, which in turn uses Hadoop. This might become a limitation in some cloud environments: for instance, Hadoop does not support modern auth strategies of Azure, and thus enforces unwanted auth paths where it is not required by the cloud provider itself. Also it requires the user to manage 2 configurations for table IO and for spark IO. We suggest to use native table io instead.

See corresponding issue here:
#13458

Copy link
Member

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the idea sounds fine, had a question


private FileIO fetchResolvingFileIO() {
FileIO tableIO = table.io();
ResolvingFileIO resolvingIO = (ResolvingFileIO) CatalogUtil.loadFileIO(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this? The rest of the code just gets io from table, and we can configure all that from the catalog/table?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for (rare) edge case when stagingDir location argument is specified and it points to another filesystem. For example, the table might be getting migrated from Azure to AWS, and stagingDir might be s3://... I know it is not the most straightforward usecase for this action, but the old hadoop code allowed it, so using table.io directly would have been a potential breaking change

Copy link
Contributor Author

@NikitaMatskevich NikitaMatskevich Jul 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After looking at the code again I realized that stagingDir couldn't have been used with another filesystem. I will simplify this part as you said. Thank you for the review!

@NikitaMatskevich NikitaMatskevich changed the title Spark: Use ResolvingFileIO to save file list in RewriteTablePath Spark: Use native table FileIO instead of Hadoop to save file list in RewriteTablePath Jul 3, 2025
.mode(SaveMode.Overwrite)
.format("csv")
.save(fileListPath);
OutputFile fileList = table.io().newOutputFile(fileListPath);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[doubt] staging location default value is table metadata path, but can be set to anything ?

if thats the case :

  1. what if table's fileIO didn't had the credentials to write to staging directory but spark did, would this cause failures ?
  2. when we are using the local disk to write this, but tables file IO was pointing to object store ? would now that workloads fail ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thank you for the review!

  1. As I checked after the review, this case is impossible: in this case the action would fail earlier, because it uses table IO to create a copy of the metadata already, and the copies of the metadata files should reside in a staging dir as well
  2. Impossible for the same reason.

You can find examples in th method rewriteVersionFile or similar

Copy link
Contributor

@singhpk234 singhpk234 Jul 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for checking, it seems like then we might need to update the docs https://iceberg.apache.org/docs/nightly/spark-procedures/#rewrite_table_path on the restriction of the staging location, not part of this effort though !

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this! Generally LGTM

I added a few comments about testing and making the same changes to other version of spark

return saveFileList(copyPlan);
}

private String saveFileList(Set<Pair<String, String>> filesToMove) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this exists in 3.4 and 4.0 as well, should we also make changes to those files?
https://github.com/search?q=repo%3Aapache%2Ficeberg%20private%20String%20saveFileList&type=code

Copy link
Member

@szehon-ho szehon-ho Jul 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, yea we typically do the latest version first (spark 4.0). then its optionally whether we want to do all the sparks in one pr or separate.

return saveFileList(copyPlan);
}

private String saveFileList(Set<Pair<String, String>> filesToMove) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NikitaMatskevich can you take a look if its possible

Copy link
Contributor Author

@NikitaMatskevich NikitaMatskevich Jul 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you share your thoughts on the nature of the test that you would like to see for this change @kevinjqliu @szehon-ho ?

If you want me to reproduce the real-life issue that we've had, then I will need to implement the setup in the iceberg-spark test module where native iceberg file io has enough permissions to the filesystem and hadoop doesn't. It should be something very close to the azure setup with managed identities that we've had, or maybe there are other complex edge-cases that I am not aware of. Anyway, do you think that such effort would be justified for this tiny PR ? It does not introduce any important changes. Usage of native table IO in this context cannot really be bug-prone by itself, because the modified class already uses exactly the same file io instance to run other steps of the action.

If you are concerned with the test coverage of this method, in TestRewriteTablePathsAction.java there are many good tests, and every one of them passes through the step of saving a file list somewhere and reading the files from this file list to copy them afterwards. I can see that the file list is created allright on my machine and if I compare it to the spark-generated one, it has exactly the same structure and file contents.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

every one of them passes through the step of saving a file list somewhere and reading the files from this file list to copy them afterwards

great! that was my original motivation for the test, to make sure that the file list is properly materialized to storage. 👍

@kevinjqliu
Copy link
Contributor

The following files had format violations:

CI failed on formatting

@NikitaMatskevich NikitaMatskevich force-pushed the nmh/use-resolving-file-io-in-rewrite-table-path branch from 3b6d49a to 457c9be Compare July 8, 2025 15:22
Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@kevinjqliu
Copy link
Contributor

@szehon-ho since we made changes to spark 3.4, 3.5, and 4.0, are there any special deployment steps we need to go through?

@szehon-ho
Copy link
Member

@kevinjqliu not as far as I know, i guess it will get bundled correctly in the appropriate spark-iceberg jar.

@NikitaMatskevich thanks it looks good to me

@szehon-ho szehon-ho merged commit 3f22677 into apache:main Jul 9, 2025
27 checks passed
@szehon-ho
Copy link
Member

thanks @NikitaMatskevich , also @kevinjqliu , @singhpk234 for thoughtful reviews!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants