Skip to content

Conversation

@tomtongue
Copy link
Contributor

Changes

Add supporting the backup table name configuration for Migrate procedure.

Details

Currently, Iceberg migrate procedure keeps the table backup with <TABLE_NAME>_BACKUP_.
However, some catalogs such as Glue Data Catalog only accept lowercase as its table name, and this renaming operation in the migrate procedure would be a blocker for running the migrate.

This change enables users to set their custom table back up name to avoid the restriction with keeping the backward compatiblity of the table name.

throw new UnsupportedOperationException("Dropping a backup is not supported");
}

default MigrateTable withBackupTableName(String tableName) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the doc is required.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. Once the commit is merged, I will add the doc. Or if I should add the doc along with this commit please let me know.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean the document for this method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, let me add it to the method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the doc in this commit; 3bd716a

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add the new property to spark-procedures.md to document it?

Copy link
Contributor Author

@tomtongue tomtongue Aug 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, sure. I add a new row about the backup_table_name argument as a draft. If there's something I need to add or change, please let me know.

@tomtongue tomtongue requested a review from ConeyLiu August 9, 2023 03:07
}

@Override
public MigrateTableSparkAction withBackupTableName(String tableName) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could implement Spark 3.4 for this PR and do a backport for other versions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Revert the Spark 3.3 commit to the previous in the latest commit.


@Test
public void testMigrateWithBackupTableName() throws IOException {
Assume.assumeTrue(catalogName.equals("spark_catalog"));
Copy link
Contributor

@ConeyLiu ConeyLiu Aug 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any reasons to skip spark_catalog?

Copy link
Contributor Author

@tomtongue tomtongue Aug 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part assumes if the catalog is spark_catalog or not. Here, spark_catalog is used (if not, the migrate fails). You mean this part should be removed?

private final StagingTableCatalog destCatalog;
private final Identifier destTableIdent;
private final Identifier backupIdent;
private Identifier backupIdent;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put those un-final fields together?

Copy link
Contributor Author

@tomtongue tomtongue Aug 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you mean that the final field should be kept, and un-final field should be added newly.
This backupIdent field is only referred within this class, specifically it's referred indoExecute and relevant methods such as rename, restore and drop that are called in doExecute. Therefore, to keep the parameter final and to make backup table name flexible, I add method variable to each method in doExecute, and process the table name in doExecute.

@ConeyLiu
Copy link
Contributor

ConeyLiu commented Aug 9, 2023

However, some catalogs such as Glue Data Catalog only accept lowercase as its table name, and this renaming operation in the migrate procedure would be a blocker for running the migrate.

I think @jackye1995 @amogh-jahagirdar @singhpk234 have more knowledge about this.

Assert.assertEquals("Should have added one file", 1L, result);

String dbName = tableName.split("\\.")[0];
Assert.assertTrue(spark.catalog().tableExists(dbName + "." + backupTableName));
Copy link
Contributor

@ConeyLiu ConeyLiu Aug 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use AssertJ instead. You can refer to the contributing guide: https://iceberg.apache.org/contribute/#assertj

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the advice. Replace the test part with it.

@tomtongue tomtongue changed the title Spark 3.3, 3.4: Add backup table name support for Migrate procedure Spark 3.4: Add backup table name support for Migrate procedure Aug 9, 2023
@tomtongue
Copy link
Contributor Author

However, some catalogs such as Glue Data Catalog only accept lowercase as its table name, and this renaming operation in the migrate procedure would be a blocker for running the migrate.

I think @jackye1995 @amogh-jahagirdar @singhpk234 have more knowledge about this.

Let me add my thoughts.
As described above, the current migrate keeps the source Spark table as a backup table with <src_table>_BACKUP_. This would causes the Iceberg validation exception if Glue Data Catalog impl in Iceberg handles such as table.
And, the backup table can be kept without dropping the table. This would be possible to occur the table name conflication.

The ability to specify the backup table name should be necessary to expand the capability and avoid the name confliction. Therefore I submitted this PR.

}

/**
* Sets a table name for the backup of the original table
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Missing . at the end of the sentence?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for this. Add . at the end of the sentence in a next commit.


private boolean dropBackup = false;

private String backupTableName = "";
Copy link
Contributor

@aokolnychyi aokolnychyi Aug 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about placing this var next to dropBackup? Both of them are non-final variables that can be overridden.

Also, what about just making Identifier backupIdent non-final but keeping the type and the initialization in the constructor? We can call construct an identifier in withBackupTableName. That way, we should be able to reduce the amount of changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. The suggestion totally makes sense to me. Based on this comment, I update as follows:

  • revert back the backupIdent variable with making the var non-final, and the backupIdent initialization part
  • add the table change logic in the withBackupTableName method that is newly added in this PR
  • remove the backupTableName var along with the way to update back along with the above two changes


String backupName;
if (backupTableName.isEmpty()) {
backupName = this.destTableIdent.name() + BACKUP_SUFFIX;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: We usually don't use this. when accessing fields, only while setting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the advice. Will remove it.

…eName method and add typos and parameter-call
@tomtongue
Copy link
Contributor Author

Thanks for the review. Sent the commit that reflects the comments. It would be happy if you review the new one. @aokolnychyi

@tomtongue tomtongue requested a review from aokolnychyi August 16, 2023 05:15

@Override
public MigrateTableSparkAction withBackupTableName(String tableName) {
if (!tableName.isEmpty()) {
Copy link
Contributor

@aokolnychyi aokolnychyi Aug 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behavior for checking if tableName is empty seems a bit weird to me. If this method is called, I assume someone wants to override the backup table name. I think we should just go ahead and set backupIdent.

I know dest and source identifiers are same but it would be more readable to use sourceTableIdent() from the parent class in this case. We are backing up the source table, not the destination.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. Your suggestion is correct and I totally agree with it. Update as follows:

  • remove checking if the table name is empty, and directly set the backup table name to backupIdent
  • set the sourceTableIdent in the backupIdent update part.

* @param tableName the table name for backup
* @return this for method chaining
*/
default MigrateTable withBackupTableName(String tableName) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should drop with prefix from the name given that no other existing methods have them.

}

boolean dropBackup = args.isNullAt(2) ? false : args.getBoolean(2);
String backupTableName = args.isNullAt(3) ? "" : args.getString(3);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using an empty string, I think we should use null and adapt the logic below.

MigrateTableSparkAction action =
    actions().migrateTable(tableName).tableProperties(properties);

if (dropBackup) {
  action.dropBackup();
}

if (backupTableName != null) {
  action.backupTableName(backupTableName);
}

MigrateTable.Result result = action.execute();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, thank you. Will update this part.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe to reflect the above code, the action parameter needs to be reassigned in each if part. In the latest commit, I update this part as below (because dropBackup() and my backupTableName return the MigrateTableAction type):

    MigrateTableSparkAction migrateTableSparkAction =
        SparkActions.get().migrateTable(tableName).tableProperties(properties);

    if (dropBackup) {
      migrateTableSparkAction = migrateTableSparkAction.dropBackup();
    }

    if (backupTableName != null) {
      migrateTableSparkAction = migrateTableSparkAction.backupTableName(backupTableName);
    }

    MigrateTable.Result result = migrateTableSparkAction.execute();

If I misunderstand the comment, or there's more recommendation, please let me know.

@github-actions github-actions bot added the docs label Aug 18, 2023
@tomtongue
Copy link
Contributor Author

Thanks for the review again! I updated as follows:

  • Changing the method name to backupTableName (remove with)
  • Updating the logic of dropBackup and backupTableName
  • Adding the backup_table_name parameter description to the spark-procedure.md

@tomtongue tomtongue requested a review from aokolnychyi August 18, 2023 12:03
@aokolnychyi aokolnychyi merged commit 87d2a92 into apache:master Aug 21, 2023
@aokolnychyi
Copy link
Contributor

Thanks, @tomtongue! Thanks for reviewing, @ConeyLiu!

@tomtongue
Copy link
Contributor Author

Thanks for kindly fixing and reviewing, @aokolnychyi @ConeyLiu !

@tomtongue tomtongue deleted the backup-table-name branch September 1, 2023 09:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants