-
Notifications
You must be signed in to change notification settings - Fork 3k
Spark 3.4: Add backup table name support for Migrate procedure #8227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| throw new UnsupportedOperationException("Dropping a backup is not supported"); | ||
| } | ||
|
|
||
| default MigrateTable withBackupTableName(String tableName) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the doc is required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. Once the commit is merged, I will add the doc. Or if I should add the doc along with this commit please let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean the document for this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, let me add it to the method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the doc in this commit; 3bd716a
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add the new property to spark-procedures.md to document it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course, sure. I add a new row about the backup_table_name argument as a draft. If there's something I need to add or change, please let me know.
| } | ||
|
|
||
| @Override | ||
| public MigrateTableSparkAction withBackupTableName(String tableName) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could implement Spark 3.4 for this PR and do a backport for other versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Revert the Spark 3.3 commit to the previous in the latest commit.
|
|
||
| @Test | ||
| public void testMigrateWithBackupTableName() throws IOException { | ||
| Assume.assumeTrue(catalogName.equals("spark_catalog")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any reasons to skip spark_catalog?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part assumes if the catalog is spark_catalog or not. Here, spark_catalog is used (if not, the migrate fails). You mean this part should be removed?
| private final StagingTableCatalog destCatalog; | ||
| private final Identifier destTableIdent; | ||
| private final Identifier backupIdent; | ||
| private Identifier backupIdent; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put those un-final fields together?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe you mean that the final field should be kept, and un-final field should be added newly.
This backupIdent field is only referred within this class, specifically it's referred indoExecute and relevant methods such as rename, restore and drop that are called in doExecute. Therefore, to keep the parameter final and to make backup table name flexible, I add method variable to each method in doExecute, and process the table name in doExecute.
I think @jackye1995 @amogh-jahagirdar @singhpk234 have more knowledge about this. |
| Assert.assertEquals("Should have added one file", 1L, result); | ||
|
|
||
| String dbName = tableName.split("\\.")[0]; | ||
| Assert.assertTrue(spark.catalog().tableExists(dbName + "." + backupTableName)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use AssertJ instead. You can refer to the contributing guide: https://iceberg.apache.org/contribute/#assertj
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the advice. Replace the test part with it.
… in the migrate and replace tests with AssertJ based on the comments
Let me add my thoughts. The ability to specify the backup table name should be necessary to expand the capability and avoid the name confliction. Therefore I submitted this PR. |
| } | ||
|
|
||
| /** | ||
| * Sets a table name for the backup of the original table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: Missing . at the end of the sentence?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for this. Add . at the end of the sentence in a next commit.
|
|
||
| private boolean dropBackup = false; | ||
|
|
||
| private String backupTableName = ""; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about placing this var next to dropBackup? Both of them are non-final variables that can be overridden.
Also, what about just making Identifier backupIdent non-final but keeping the type and the initialization in the constructor? We can call construct an identifier in withBackupTableName. That way, we should be able to reduce the amount of changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. The suggestion totally makes sense to me. Based on this comment, I update as follows:
- revert back the
backupIdentvariable with making the var non-final, and thebackupIdentinitialization part - add the table change logic in the
withBackupTableNamemethod that is newly added in this PR - remove the
backupTableNamevar along with the way to update back along with the above two changes
|
|
||
| String backupName; | ||
| if (backupTableName.isEmpty()) { | ||
| backupName = this.destTableIdent.name() + BACKUP_SUFFIX; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: We usually don't use this. when accessing fields, only while setting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the advice. Will remove it.
…eName method and add typos and parameter-call
|
Thanks for the review. Sent the commit that reflects the comments. It would be happy if you review the new one. @aokolnychyi |
|
|
||
| @Override | ||
| public MigrateTableSparkAction withBackupTableName(String tableName) { | ||
| if (!tableName.isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This behavior for checking if tableName is empty seems a bit weird to me. If this method is called, I assume someone wants to override the backup table name. I think we should just go ahead and set backupIdent.
I know dest and source identifiers are same but it would be more readable to use sourceTableIdent() from the parent class in this case. We are backing up the source table, not the destination.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. Your suggestion is correct and I totally agree with it. Update as follows:
- remove checking if the table name is empty, and directly set the backup table name to
backupIdent - set the
sourceTableIdentin thebackupIdentupdate part.
| * @param tableName the table name for backup | ||
| * @return this for method chaining | ||
| */ | ||
| default MigrateTable withBackupTableName(String tableName) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should drop with prefix from the name given that no other existing methods have them.
| } | ||
|
|
||
| boolean dropBackup = args.isNullAt(2) ? false : args.getBoolean(2); | ||
| String backupTableName = args.isNullAt(3) ? "" : args.getString(3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using an empty string, I think we should use null and adapt the logic below.
MigrateTableSparkAction action =
actions().migrateTable(tableName).tableProperties(properties);
if (dropBackup) {
action.dropBackup();
}
if (backupTableName != null) {
action.backupTableName(backupTableName);
}
MigrateTable.Result result = action.execute();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, thank you. Will update this part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe to reflect the above code, the action parameter needs to be reassigned in each if part. In the latest commit, I update this part as below (because dropBackup() and my backupTableName return the MigrateTableAction type):
MigrateTableSparkAction migrateTableSparkAction =
SparkActions.get().migrateTable(tableName).tableProperties(properties);
if (dropBackup) {
migrateTableSparkAction = migrateTableSparkAction.dropBackup();
}
if (backupTableName != null) {
migrateTableSparkAction = migrateTableSparkAction.backupTableName(backupTableName);
}
MigrateTable.Result result = migrateTableSparkAction.execute();If I misunderstand the comment, or there's more recommendation, please let me know.
|
Thanks for the review again! I updated as follows:
|
|
Thanks, @tomtongue! Thanks for reviewing, @ConeyLiu! |
|
Thanks for kindly fixing and reviewing, @aokolnychyi @ConeyLiu ! |
Changes
Add supporting the backup table name configuration for Migrate procedure.
Details
Currently, Iceberg migrate procedure keeps the table backup with
<TABLE_NAME>_BACKUP_.However, some catalogs such as Glue Data Catalog only accept lowercase as its table name, and this renaming operation in the migrate procedure would be a blocker for running the migrate.
This change enables users to set their custom table back up name to avoid the restriction with keeping the backward compatiblity of the table name.