-
Notifications
You must be signed in to change notification settings - Fork 3k
Encryption for REST catalog #13225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Encryption for REST catalog #13225
Conversation
| // TODO(smaheshwar): This test is taken from https://github.com/apache/iceberg/pull/13066, with the | ||
| // exception of testCtas, but adapted for the REST catalog. When that merges, we can directly use | ||
| // those tests for the REST catalog as well by adding to the parameters method there, to have a | ||
| // single test class for table encryption. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Highlighting this - this file is largely taken from #13066
| : Integer.parseInt(dekLength); | ||
| } | ||
|
|
||
| // Force re-creation of encryptingFileIO and encryptionManager |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #13066 (comment)
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions. |
|
(Bump to remove staleness) |
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions. |
|
(Bump to remove staleness) |
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions. |
|
(Bump to remove staleness) |
d007aa2 to
889cbbf
Compare
| * @param dataKeyLength length of data encryption key (16/24/32 bytes) | ||
| * @param kmsClient Client of KMS used to wrap/unwrap keys in envelope encryption | ||
| */ | ||
| public StandardEncryptionManager( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could deprecate the old constructor
| import org.junit.jupiter.api.Disabled; | ||
| import org.junit.jupiter.api.TestTemplate; | ||
|
|
||
| // TODO(smaheshwar-pltr): This test is taken from https://github.com/apache/iceberg/pull/13066, with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a CTAS test to the ones authored by @ggershinsky here. Perhaps there's a world in which this PR merges first in which case can undo this comment. There are also review comments on these tests in #13066 that are potentially relevant
|
Given #7770 is merged, curious for thoughts on this PR. |
|
Could you elaborate on the api additions? I think it would help to have some more description on the general direction of this or |
|
@smaheshwar-pltr Could you please resolve the conflicts? |
|
@huaxingao @smaheshwar-pltr Our team has a person who works on encryption with the REST catalog. If @smaheshwar-pltr does not object, we can follow up on this patch. |
| return encryptionManager; | ||
| } | ||
|
|
||
| private void encryptionPropsFromMetadata(TableMetadata metadata) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this method applied on the TableMetadata that is fetched directly from the REST catalog, and not from the metadata.json file? Both are possible, but the former must override (and check) the latter, to protect against the key removal and other attacks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this method will always be applied on a metadata field of a LoadTableResponse object received directly from the REST catalog (its only usage within this class is as such, and you can check the constructor usages within RESTSessionCatalog to confirm that the metadata coming in from there is as such too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question (not sure if there have been discussions here or if your team have thoughts): we want the key ID to come from the REST catalog service directly for security reasons.
It's typical for REST catalogs to provide metadata that corresponds to the metadata file in storage and not modify it apart from that. Given this, would it be preferable to have this field returned within the LoadTableResponse itself, to encourage catalogs to track it explicitly?
The concrete proposal here might be: ENCRYPTION_TABLE_KEY and ENCRYPTION_DEK_LENGTH become properties on the LoadTableResponse's config (mentioned in the REST spec here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ggershinsky @huaxingao LMKWYT!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well I see two scenarios when thinking about this:
- metadata.json is something that both the server and the clients can read (although clients wouldn't need to, given they get the metadata with the
LoadTableResponse) - metadata.json can only be accessed on the server side and clients are not given FS credentials (either vended or not) to reach it
For case (1) I totally agree, we can't rely on just metadata.json to store these encryption properties, and the catalog should store it separately too, and eventually populating (i.e. doing the override logic referred by @ggershinsky) the properties in the LoadTableResponse to be created.
For case (2) I'm not 100% sure, but still leaning toward the catalog taking on this responsibility.
Either way, for the client side there's not much we can do other than recommending clients to consider the metadata from LoadTableResponse only. The rest (no pun intended) is on the server side to be decided and will be implementation-specific. For this code snippet above, irrelevant IMHO.
Let me know your thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Afaik, HMS is not optimized for JSON storage. But maybe someone in the community will take on storing the full metadata object there, to improve table security. Having only the table properties is barely sufficient. I think we should recommend REST catalogs for encrypted tables.
It's not. But would storing a hash suffice? If so we could generate the hash of the whole JSON content and store it via an additional (Hive) table property. Then during table loading we can verify that the TableMetadata we just read in from a potentially untrusted storage (and yet metadata.json is not encrypted) is original or has been tampered with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One aspect of having an encrypted metadata.json is when the table schema is also considered a sensitive piece of information. I haven't found this in the discussions but do you know if this has ever been considered @ggershinsky ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would storing a hash suffice? If so we could generate the hash of the whole JSON content and store it via an additional (Hive) table property. Then during table loading we can verify that the TableMetadata we just read in from a potentially untrusted storage (and yet metadata.json is not encrypted) is original or has been tampered with.
I think it's a good idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One aspect of having an encrypted metadata.json is when the table schema is also considered a sensitive piece of information. I haven't found this in the discussions but do you know if this has ever been considered
Not sure. Though, it should be possible to have a REST implementation that hides the metadata.json file from the storage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a good idea
Sounds good, I can take this on and will produce a PR shortly.
Not sure. Though, it should be possible to have a REST implementation that hides the metadata.json file from the storage.
Yes, with REST that's true, I just meant it in a general sense, e.g. it's not currently possible to hide the schema of an encrypted table with Hive catalog. It may just be one more thing to note/document as a limitation of encryption wrt. Hive catalog - just merely wanted to highlight this though.
|
Also, it would be good to refactor (if possible) a code common to this PR and to #13066 , so that other catalogs will be able to re-use it. |
ggershinsky
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @smaheshwar-pltr !
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions. |
|
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
|
Not stale - I have a PR to separately fix the failing test this PR flagged that I'm working on |
|
@singhpk234 @huaxingao, may I get this PR reopened please? |
|
@smaheshwar-pltr I removed the stale label, but somehow the reopen button is dim for me. Can you reopen on your side? |
|
reopen |
|
@huaxingao apologies, I think some force pushing I did messed things up - I believe things should be good now, may you check again if you can reopen now? |
| * @param mutationHeaderSupplier a supplier for additional HTTP headers to include in mutation | ||
| * requests (POST/DELETE) | ||
| * @param fileIO the FileIO implementation for reading and writing table metadata and data files | ||
| * @param kmsClient the {@link KeyManagementClient} for encrypted tables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // This tests that encryption keys are maintained on refreshing different metadata - if | ||
| // they are not, the table will be unreadable and this will fail. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I've verified that this test fails without the encryption-key tracking changes described in #14752. In other words, if we instead always construct the encryption manager using the current metadata in the table operations class which is wrong, this test fails and is the only one to fail. I've also documented the nuance here)
|
@huaxingao @singhpk234, I believe that all comments and discussions are addressed, this is ready for another review! It would be great to get this in soon if possible, to prevent conflicts growing and possible breaks (#13225 (comment)). |
645c75f to
1e42400
Compare
szlta
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 non-binding - thanks @smaheshwar-pltr, this looks good to me
This PR implements client-side support for REST catalog encryption. With it, clients interacting with a REST catalog can read and write encrypted data.
It is similar to #13066, that integrates encryption with the Hive catalog.
cc @rdblue @RussellSpitzer @ggershinsky