Skip to content

Conversation

@aokolnychyi
Copy link
Contributor

This PR prohibits removal of data files if GC is disabled. Fixes #2366.

@aokolnychyi
Copy link
Contributor Author

@RussellSpitzer
Copy link
Member

How does a user delete files that are added to the snapshot but not part of the original table?

@aokolnychyi
Copy link
Contributor Author

aokolnychyi commented Mar 24, 2021

Ugh, good question. We could delete files only under the data subfolder but that won't be sufficient in a generic case. Suppose someone creates a table, sets gc.enabled to false and calls add_files. We need another approach.

Let me sleep on this issue. I don't have a good solution.

@pvary
Copy link
Contributor

pvary commented Mar 24, 2021

Hive has it's own parameters to drop files which we should also keep consistent or at least consider with relation to the gc.enabled.
See:

// Allow purging table data if the table is created now and not set otherwise
if (hmsTable.getParameters().get(InputFormatConfig.EXTERNAL_TABLE_PURGE) == null) {
hmsTable.getParameters().put(InputFormatConfig.EXTERNAL_TABLE_PURGE, "TRUE");
}

and:

public void commitDropTable(org.apache.hadoop.hive.metastore.api.Table hmsTable, boolean deleteData) {
if (deleteData && deleteIcebergTable) {
try {
if (!Catalogs.hiveCatalog(conf)) {
LOG.info("Dropping with purge all the data for table {}.{}", hmsTable.getDbName(), hmsTable.getTableName());
Catalogs.dropTable(conf, catalogProperties);
} else {
// do nothing if metadata folder has been deleted already (Hive 4 behaviour for purge=TRUE)
if (deleteIo.newInputFile(deleteMetadata.location()).exists()) {
CatalogUtil.dropTableData(deleteIo, deleteMetadata);
}
}
} catch (Exception e) {

@pvary
Copy link
Contributor

pvary commented Mar 24, 2021

CC: @marton-bod

@marton-bod
Copy link
Collaborator

From a Hive user perspective, I would expect my data files to be dropped purely on the basis of whether external.table.purge=TRUE is set, and not be silently overridden by the Iceberg-specific gc.enabled property. So there might be surprises for Hive users if purge=TRUE but gc.enabled=false and therefore the data files end up not actually removed.

I'm wondering if we should keep the two props in sync somehow via the HiveTableOperations, or at least provide better documentation on the possible Hive DROP TABLE behaviours?

@RussellSpitzer
Copy link
Member

I think for snapshots you only want to delete files that are made in commits after the original import, that should be possible but complicated. Basically just delete everything added after the first iceberg snapshot.

@aokolnychyi
Copy link
Contributor Author

I think for snapshots you only want to delete files that are made in commits after the original import, that should be possible but complicated. Basically just delete everything added after the first iceberg snapshot.

Yes and no. That's the case for snapshot but what if someone creates an Iceberg table with GC disabled and uses it to just promote files written not through Iceberg? Then deleting data beyond the first snapshot not safe too. Maybe, we can take snapshot table property into account.

@aokolnychyi
Copy link
Contributor Author

Let me see what we do in the Hive codebase.

@aokolnychyi
Copy link
Contributor Author

A few questions, @marton-bod @pvary.

  • How does Hive populate deleteData in commitDropTable?
  • Is it correct that external.table.purge is not specific to Iceberg, it used in other tables too?

@rdblue
Copy link
Contributor

rdblue commented Mar 24, 2021

I think that this fix is a good starting point. For files that are added to a snapshot table after creation, it is better to leak those than to delete files from the original table. I think we should move forward with this PR and then try to catch more data files later.

As for the Hive property, I'm not sure what to do but it is a good thing to consider separately. What respects that property? I think Iceberg uses it somewhere, but that may just be in the Hive code. I think having a clear understanding of what it does and where it is currently used is the right place to start.

ImmutableList.of(row(1L, "a")),
sql("SELECT * FROM %s", tableName));

sql("DROP TABLE %s", tableName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to assert that the metadata files for the snapshot table are removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would fail for the session catalog due to #2374.

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this looks good and I would move ahead with it for 0.11.1.

@aokolnychyi aokolnychyi merged commit 90225d6 into apache:master Mar 25, 2021
@aokolnychyi
Copy link
Contributor Author

Thanks, everyone! I think we should continue discussing the purge flag in Hive.

@marton-bod
Copy link
Collaborator

@aokolnychyi Thanks for your questions, Anton.

How does Hive populate deleteData in commitDropTable?

deleteData is set to true by default here: Hive.java. This ensures data is automatically deleted for managed tables.
However, data is not deleted for external tables pre-Hive4: HiveMetaStore.java,
Introduced in Hive4, we can now force the deletion even for external tables with the external.table.purge=TRUE property: Docs

Is it correct that external.table.purge is not specific to Iceberg, it used in other tables too?

Correct, it's used for Hive tables as well.

So at the moment, all Hive Iceberg tables are external, therefore their data is not dropped automatically by Hive2 and Hive3 (Hive4 does so with the purge flag). Currently our solution for that is to check for the external.table.purge property in the HiveIcebergMetaHook, and if set, call Catalogs.dropTable/CatalogUtil.dropTableData to delete the data ourselves in the postHook.

A Hive user should be able to decide whether or not data deletion is desirable for a table by changing the table property external.table.purge. Going forward, some of the options I see to avoid surprises:

  • Synchronizing external.table.purge with gc.enabled. When one changes, the other is changed accordingly. This could already work when changing the gc.enabled (and then altering the purge prop in HMS accordingly), but not the other way around (since ALTER TABLE SET TBLPROPERTIES is not implemented yet to push down HMS prop changes to Iceberg - although this is on our roadmap already).
  • Adding the gc.enabled prop to the HMS table props as well, make it control the data deletion for iceberg tables via the MetaHook (+ update docs to let users know of this) and let users change this prop via ALTER TABLE as desired. I don't really like this one, because it introduces an inconsistency between Iceberg and Hive tables.
  • Leaving things as is, but updating the Hive docs so that users are aware that DROP TABLE for Iceberg tables might behave differently depending on Iceberg configuration, and improve logging. I'm no big fan of this either.

At first glance, I think I'd prefer the first option. Any thoughts on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Respect gc.enabled property while dropping tables

5 participants