-
Notifications
You must be signed in to change notification settings - Fork 3k
Core: Don't delete data files on DROP if GC is disabled #2367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
How does a user delete files that are added to the snapshot but not part of the original table? |
|
Ugh, good question. We could delete files only under the data subfolder but that won't be sufficient in a generic case. Suppose someone creates a table, sets Let me sleep on this issue. I don't have a good solution. |
|
Hive has it's own parameters to drop files which we should also keep consistent or at least consider with relation to the iceberg/mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergMetaHook.java Lines 121 to 124 in 469f6c3
and: iceberg/mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergMetaHook.java Lines 172 to 184 in 469f6c3
|
|
CC: @marton-bod |
|
From a Hive user perspective, I would expect my data files to be dropped purely on the basis of whether I'm wondering if we should keep the two props in sync somehow via the |
|
I think for snapshots you only want to delete files that are made in commits after the original import, that should be possible but complicated. Basically just delete everything added after the first iceberg snapshot. |
Yes and no. That's the case for snapshot but what if someone creates an Iceberg table with GC disabled and uses it to just promote files written not through Iceberg? Then deleting data beyond the first snapshot not safe too. Maybe, we can take |
|
Let me see what we do in the Hive codebase. |
|
A few questions, @marton-bod @pvary.
|
|
I think that this fix is a good starting point. For files that are added to a snapshot table after creation, it is better to leak those than to delete files from the original table. I think we should move forward with this PR and then try to catch more data files later. As for the Hive property, I'm not sure what to do but it is a good thing to consider separately. What respects that property? I think Iceberg uses it somewhere, but that may just be in the Hive code. I think having a clear understanding of what it does and where it is currently used is the right place to start. |
| ImmutableList.of(row(1L, "a")), | ||
| sql("SELECT * FROM %s", tableName)); | ||
|
|
||
| sql("DROP TABLE %s", tableName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to assert that the metadata files for the snapshot table are removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would fail for the session catalog due to #2374.
rdblue
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this looks good and I would move ahead with it for 0.11.1.
|
Thanks, everyone! I think we should continue discussing the purge flag in Hive. |
|
@aokolnychyi Thanks for your questions, Anton.
Correct, it's used for Hive tables as well. So at the moment, all Hive Iceberg tables are external, therefore their data is not dropped automatically by Hive2 and Hive3 (Hive4 does so with the purge flag). Currently our solution for that is to check for the A Hive user should be able to decide whether or not data deletion is desirable for a table by changing the table property
At first glance, I think I'd prefer the first option. Any thoughts on this? |
This PR prohibits removal of data files if GC is disabled. Fixes #2366.