-
Notifications
You must be signed in to change notification settings - Fork 3k
Spark: Support parallel delete files when abort #4704
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| // Isolation Level for DataFrame calls. Currently supported by overwritePartitions | ||
| public static final String ISOLATION_LEVEL = "isolation-level"; | ||
|
|
||
| public static final String DELETE_FILES_PARALLEL_WHEN_ABORT = "delete-files-parallel-when-abort"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[minor] missing comment for this property
| private void abort(WriterCommitMessage[] messages) { | ||
| Map<String, String> props = table.properties(); | ||
| Tasks.foreach(files(messages)) | ||
| .executeWith(writeConf.deleteFilesParallelWhenAbort() ? ThreadPools.getWorkerPool() : null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[question] does it makes sense to give it, it's own dedicated worker pool , Your thoughts ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd probably use the worker pool for now. No need to extend it until someone needs it.
| public static final String CHECK_ORDERING = "spark.sql.iceberg.check-ordering"; | ||
| public static final boolean CHECK_ORDERING_DEFAULT = true; | ||
|
|
||
| // Control whether to parallel delete files when abort |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit]
| // Control whether to parallel delete files when abort | |
| // Controls whether to parallel delete files when abort |
| public static final boolean CHECK_ORDERING_DEFAULT = true; | ||
|
|
||
| // Control whether to parallel delete files when abort | ||
| public static final String DELETE_FILES_PARALLEL_WHEN_ABORT = "spark.sql.iceberg.delete-files-parallel-when-abort"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name "delete-files-parallel-when-abort" is very long. How about "parallel-abort-enabled"?
Does this even need to be an option? Why would anyone turn it off?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdblue I remove the option. Previously I want to have the ability to keep the behavior as before.
b3e062e to
9a79e00
Compare
|
Thanks, @singhpk234 @rdblue for the reviewing, the code has been updated. Please take another look. |
| if (cleanupOnAbort) { | ||
| Map<String, String> props = table.properties(); | ||
| Tasks.foreach(files(messages)) | ||
| .executeWith(ThreadPools.getWorkerPool()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RussellSpitzer, what do you think about this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems fine to me, although I really wish we had a more discrete pool for things like this. I find that our worker pool is hard to configure for end users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a Spark-managed pool? We have the ability to plug in the pool used in most places now, thanks to @yittg.
|
Thanks, @ConeyLiu! |
|
Thanks all for the review. |
This patch adds support for parallel delete files when commit aborting.