Security Issues in Querying Encrypted Data
Security Issues in Querying Encrypted Data
Introduction
There has been considerable interest in the notion of a secure database service: A Database Management System that could that could manage 1
a database without knowing the contents[11]. While the business model is compelling, it is important that such a system be provably secure. Existing proposals have problems in this respect; the security provided leaves room for information leaks. Any method for database encryption that does not meet rigorous cryptographybased standards security must be used carefully. For example, methods that quantize or bin values [11] reveal data distributions. Methods that hide distribution, but preserve order [3], can also disclose information if used na vely. While they may eectively hide values in isolation, using such techniques on multiple attributes in a tuple can pose dangers. We provide more discussion of the successes and potential pitfalls of related work in Section 6; we now give an example of how na use of prior proposals can disclose sensitive ve information. Suppose a bank is trying to nd who is responsible for missing money (e.g., fraud or embezzlement). They have gathered information on suspect employees and customers. Even though much of the information is publicly known (name, size of mortgage, age, postal code, ...), simply revealing who is being investigated is sensitive: The appearance that you are accusing a customer of fraud could well lead to a libel suit. Therefore they have encrypted each of the values using an order-preserving encryption scheme. Are they protected? The answer is probably not. Assume a newspaper wants to know if an individual Chris is being investigated. They obtain the encrypted database. They know that the name Chris would rank at about 15% of all names so if it appears in the encrypted database, it will be roughly in that position (the range for a given sample size and probability can be calculated using order statistics). The newspaper can do the same with size of mortgage, age, and other known data about Chris and with the other employees/customers of the bank. If there is a tuple in the database whose rank on all attributes is close to the corresponding rank of Chris (in the overall dataset), and there is no other tuple among the customers/employees whose ranks are similar, then the newspaper knows that with high probability, Chris is under investigation. The key problem is that while encrypting a single value using order preserving encryption or a binning scheme may reveal little information, supporting multiple index keys for each tuple reveals a surprising amount. To protect against such na misuse of order-preserving, homomorphic, or other ve such encryption techniques, we propose denitions for what it means for an encrypted database to be secure. 2
This paper presents a vision for how research enabling a secure database service should proceed: Establish solid denitions of secure, develop encryption and query processing techniques that meet those denitions, demonstrate that such techniques have practical promise. We start with denitions of security from the Cryptography community that have withstood the test of time. To what extent can we apply these denitions to a secure DBMS, enabling a proof of security? Section 3 gives security denitions for database and query indistinguishability based on the cryptographic concept of message indistinguishability. This leads to a troubling result: prior work in cryptography shows that a secure DBMS server meeting these denitions requires that the cost of every query be linear in the size of the database, making a secure DBMS impractical for real-world use. Section 4 begins the real contribution of this paper: a slightly relaxed denition that gives probabilistic guarantees of security. For the data itself, security is equivalent to strong cryptographic denitions. An adversary tracing query execution could conceivably infer information over many queries, but the quality of the information decays exponentially by the time enough queries have been seen to infer anything sensitive, the relationship between the early and later queries will have been broken, so the adversary will be unable to infer sensitive data. In Section 4.2 we show that for this denition, a secure DBMS server with reasonable performance could be constructed. The one caveat is that it requires the existence of a secure execution module: a way of running programs on the server that are hidden from the server. We show how basic query processing operations (select, join, indexed search) can be implemented with a simple secure execution module supporting encryption, decryption, pseudo-random number generation, and comparison; and give sketches of how the operations could be proven to meet our denition of security. This paper addresses read-only queries (select-project-join); extension to insert/update is reasonable, but beyond the scope of the current work. How do we get such a secure execution model? Program obfuscation, providing a program whose execution reveals no information to the server, has been proven impossible for several classes of program[6]. We provide evidence that if possible at all it is unlikely to be ecient for a secure DBMS. Fortunately, there is another solution, implementing the secure execution module in tamper-proof hardware. Such hardware exists; one example meeting our requirements is IBMs [12]. Section 5 shows how to use such a module to implement a system safe from software-driven attacks; the specications 3
and evaluation of the hardware tell us the protection provided against electronic/physical attacks.
The cryptography community has developed solid and well-regarded denitions for securely encrypting a message. The result of this research is that secure encryption should hide any partial information about the data: one should not be able to distinguish between encryptions of two dierent messages of the same length. Consider the following scenario where a stock trader sends buy or sell messages for a particular stock. An adversary knows messages are either buy or sell, but indistinguishability guarantees the adversary will have no clue which one was sent. A formal denition of indistinguishable private key encryption, from [8], is: Denition 2.1 Indistinguishability of encryptions[8] An encryption scheme with ecient key generation algorithm G, ecient encryption algorithm E and decryption algorithm D where Dk (Ek (x)) = x for secret key k has indistinguishable private key encryption if for every polynomial-size circuit family {Cn }, every polynomial p, all suciently large n and every a, b 0, 1poly(n) with equal length. |P r{Cn(EG(1n ) (a)) = 1} P r{Cn (EG(1n ) (b)) = 1}| < 1 p(n)
The probability in the above terms is taken over the internal coin tosses of algorithms G and E. Intuitively the denition states that any adversary with computing power comparable to the polynomial-size circuit family will not be able to predict whether a given ciphertext is the encrypted form of a or b signicantly better than a random guess. The same key is often used to encrypt many messages. The preceding denition ensures that no two messages can be distinguished, but it does allow the distribution of messages to be learned (e.g., an adversary could 4
learn that 75% of orders were of one type and 25% were the other, even if it didnt know which was buy and which was sell.) For multiple message, the cryptography community has a denition that ensures that even the distributions are not revealed. The formal denition from [8] is: Denition 2.2 Indistinguishability of encryptions: multiple-message case[8] An encryption scheme with ecient key generation algorithm G, ecient encryption algorithm E and decryption algorithm D where Dk (Ek (x)) = x for secret key k has multi-message indistinguishable private key encryption if for every polynomial-size circuit family {Cn }, every polynomial p, all suciently large n and every a = (a1 , a2 , . . . , at(n) ) , = (b1 , b2 , . . . , bt(n) ) where b 0, 1poly(n) , following inequality holds a, b |P r{Cn(EG(1n ) () = 1} a P r{Cn (EG(1n ) ( = 1}| < b)) where EG(1n ) () is dened as a (EG(1n ) (a1 ), EG(1n ) (a2 ), . . . , EG(1n ) (an )) Any encryption scheme that satises the above denition will hide the distribution of the transmitted encrypted messages. Our denition of securely encrypting a database is based on these denitions. The rst captures general database encryption, the second is applicable to encryption of individual tuples or eld values. Fortunately, the cryptography community has a method for extending many ciphers providing message indistinguishability to provide multi-message indistinguishability. The counter-based CTR encryption mode given in Algorithm 1 enables a block cipher meeting Denition 2.1 to meet 2.2. (It also supports encryption where the block size exceeds that of the block cipher.) The idea is to choose a unique number (counter) for each block, and encrypt that unique number. The resulting encryption is then xor-ed with the actual message block. The encrypted message consists of the counter (in plain text), and the message. Assuming the underlying block cipher is a pseudo-random permutation (DES is presumed to be such a permutation), this method satises Denition 2.2. The key idea is that two identical messages (or identical blocks in a message) will be xor-ed with dierent values, since the masking encryption will be from a dierent counter. 1 p(n)
Algorithm 1 Counter-based CTR mode Encryption Encryption: Require: plain text x and initial counter value C. Divide x into n k-bit blocks for i=1 to n do yi = Ek (C + i) xi end for return (C, y1 y2 . . . yn ) Decryption: Require: ciphertext y = (C, y1 y2 . . . yn ) for i=1 to n do xi = Ek (C + i) yi end for return(x1 x2 . . . xn )
Encryption schemes are dened to be secure if and only if the ciphertext reveals no information about the plaintext. We now use the security denitions from cryptography to dene what it means to securely encrypt a database table and securely query the encrypted data.
3.1
As mentioned above, given any two pairs of ciphertexts and plaintexts of the same length, it must be infeasible to gure out which ciphertext goes with which plaintext. This means that any two database tables with the same schema and the same number of tuples must have indistinguishable encryptions. To be more precise, we now give a database-specic adaptation of the denitions stated in Section 2. Denition 3.1 An encryption scheme (G, E, D) for database tables; consisting of key generation scheme G, encryption function E, and decryption function D; has indistinguishable encryptions if for every polynomial-size 6
circuit family {Cn }, every polynomial p, and all suciently large n, every database R1 and R2 {0, 1}poly(n) with the same schema and the same number of tuples (i.e., |R1 | = |R2 |): |P r{Cn(EG(1n ) (R1 )) = 1} P r{Cn (EG(1n ) (R2 )) = 1}| < 1 p(n)
The probability in the above terms is taken over the internal coin tosses of algorithms G and E. This denition says that if we try to construct a polynomial circuit for distinguishing any given encrypted database table R1 (i.e., the circuit will output one if the encrypted form belongs to R1 , else it will output zero), the circuit will have a success probability that is at most slightly better than a random guess. To clarify the meaning and impact of this denition, we give an example of a plausible but insecure encryption of a database table. Example 3.1 Dene an encryption scheme for a database as follows: G randomly outputs a key for a particular block cipher, E encrypts every eld of each tuple with the same key using a block cipher algorithm (e.g., DES), and D decrypts the every eld using the same cipher and key. Even though the given block cipher algorithm is secure, we can distinguish encryptions of the following two database tables with probability one. Assume that R 1 and R2 have a schema (a, b) where both a and b are k bit numbers, and both have one tuple. If R1 has a tuple (x, x) and R2 has a tuple (y, z) (y = z), a simple (polynomial) circuit that compares the encrypted values of the rst and second elds (E(a) = E(b)) will return true for the encrypted R 1 and false for the encrypted R2 , distinguishing the two.
3.2
While one solution to the problem of Example 3.1 is to simply encrypt the entire database as a single message this would prevent any meaningful query processing (the entire encrypted database would have to be returned to the querier to enable decryption). Fortunately, we can use Counter-based CTR 7
CTR Counter C
Attribute 1 E(C)
Attribute 2 E(C+1)
...
Figure 1: Encrypted tuple structure mode to meet Denition 3.1 while still encrypting at the individual elds in a tuple independently. The idea is that each tuple consists of a counter and encrypted elds, as described in Section 2. Figure 1 shows an example tuple encrypted in this manner. Since identical eld values will now be xor-ed with dierent values, the fact that they are identical (or any other relationship between them) will be hidden, alleviating the problem of Example 3.1 and in fact meeting Denition 3.1.
3.3
Much of database research has concentrated on ecient processing of queries. We would like to maintain this eciency even if the data is encrypted. Prior proposals for querying encrypted data do not meet Denition 3.1 if an adversary is allowed to view data access patterns; this will be discussed in Section 6. This is not just a problem of poor use of encryption. What we really need to ensure is that not only is the encrypted database itself secure, but that the act of processing queries against the database does not reveal information. Unfortunately, achieving such security is at odds with ecient query processing. We now give a denition of secure database querying based on a model from the cryptography community, and show that the only way to meet this strict denitions is to access the entire database for each query. In Section 4.1 we will build on these denitions to give a slightly weaker (but still semantically meaningful) denition supporting more ecient queries. In our current discussion, we assume that data resides on single server and do not consider potential gains due to the replicated data.
3.4
We now give an alternative denition based on comparison of queries. We still require that tuples be indistinguishable (Denition 3.1), and also require that two queries be indistinguishable (e.g., the queries are encrypted). The idea is that if we cant tell tuples or queries apart, we dont really gain information from processing the queries. Unfortunately, this leads us to a result where full table scan is required. The denition comes from Private Information Retrieval (PIR), which protect the query from disclosure. The server knows the data, but should learn nothing about the query.[7] A PIR server must maintain query privacy, and ensure that the query issuer gets the correct result. Why do we want the privacy of the user query be protected? The problem is that if the server knows the query, knowing just the size of the result reveals information about the database. For example, if server knows that R.a1=300 (R) returns three tuples, then server will have the knowledge of those tuples a1 elds. One important thing to note is we should only require query indistinguishability for queries that have the same result size. Otherwise we would need to set an upper bound on query result size (the entire database if we want to support full SQL), and transmit that much data for every query the actual result size would distinguish queries. We now formally dene the correctness and the privacy requirements described above. Denition 3.2 (Correctness) Assume database D is stored securely on a server w.r.t Denition 3.1. Let E(D) be the securely encrypted database and let Q be a query issued on the database. A query execution is said to be correct if given (Q, E(D)), an honest server provides a result enabling the query issuer to learn Q(D). The correctness denition implies that if the server follows the protocol, the query issuer will get the correct result. Privacy must hold even for a dishonest server: Denition 3.3 (Privacy) For every query pair Qi , Qj that run on the same set of tables over D and have the same size results, the messages mQi , mQj sent for executing the queries are
computationally indistinguishable if for every polynomial-size circuit family {Cn }, every polynomial p, all suciently large n, mQi and mQj {0, 1}poly(n) , |P r{Cn (mQi ) = 1} P r{Cn (mQj ) = 1}| < 1 p(n)
The probability in the above terms is taken over the internal coin tosses of the query issuer and the server. This privacy denition implies that whatever the server tries to do, it will not be able to distinguish between two dierent queries run on the same set of tables and returning the same size results. For example, if Q1 = a1=300 (R) returns 100 tuples and Q2 = a1=100 (R) returns 100 tuples, there is no way for the server to predict which of the two is executed more eectively than a random guess. We can dene a secure query execution as one that runs on securely encrypted data and satises Denitions 3.3 and 3.2. We will show that even for queries that are running on a single table, we need to scan the entire table. We rst prove that given a set of queries on a particular table with t, if there exists a query that must access at least v tuples, then we can distinguish it from a query that occasionally accesses fewer than v tuples. Second, we show that for any admissible query result size t, there exists a query which requires the scan of the entire database. Lemma 3.1 Let St be queries that run on table R with result size t, and let us assume that there exists a query Qt that needs to access at least v 1 tuples for correct evaluation. Let Qt be an element of St that needs to access 2 1 at most v 1 tuples with probability greater than p(n) . Then there exists a polynomial-circuit family Cn that can distinguish them with non-negligible probability. Proof. We dene Cn as follows. Given the messages exchanged during the execution of the query, the circuit will count the number of the tuples accessed. If it is v, Cn will output 1; otherwise it will output zero. Note that Cn only does a simple counting, therefore is polynomial in terms of the input size. Now let us calculate the probability P =| P r{Cn (mQt ) = 1 1} P r{Cn (mQt ) = 1} |. 2 10
P = | P r{Cn (mQt ) = 1} P r{Cn (mQt ) = 1} | 1 2 = | 1 P r{Cn(mQt ) = 1} | 2 = | 1 P r{more than v 1 tuples accessed} | 1 > | 1 (1 )| p(n) 1 > p(n) Again, note that the probability is taken over the internal coin tosses of the query issuer and the server; it does not depend on the database values. 1 Since P is bigger then p(n) we can conclude than Cn distinguishes the above queries with non-negligible probability. We now show that the queries needed by the above denition exist. Lemma 3.2 For any given result size t, there exists a query that needs to access the entire table. Proof. Since the result must be encrypted to preserve security (otherwise all queries would have to return the same result to avoid being distinguished), the resulting set size must be a multiple of the cipher block size k of size, up to the size of the table. Let R have n tuples with a attributes blocked into u blocks of size k as dened in Section 5.1.1. Here without loss of generality, we assume that each attribute is k bits long, therefore u is equal to a. Let assume that id eld added to the database is also k bit long. So for each admissible size t where t is the multiple of k and less than k n a, we can dene a query that needs to access the entire database as follows.
t kn
Qt 1
=
i=1
The above query simply gets the average of a single attribute to make sure that query needs to access the entire table, and pads the result set to make sure that result size is t. (Since we have not specied a value for k, this generalizes to any block size, including 1.) 11
Using the above lemmas, we can now prove the following: Theorem 3.1 A query execution that is secure in the sense of Denitions 3.3 and 3.2, even for queries known to access a particular database table, must scan the entire database table non-negligibly often. Proof. For the set of queries returning a result of size t, at least one must require full table access (a construction is given in Lemma 3.2), if not then not all queries would satisfy the correctness Denition 3.2. We can now build a distinguisher for any query that requires less than full table access (formal proof in Lemma 3.1). Since at least one query in t requires full table access, if any requires less than full access a non-negligible portion of the time, the distinguisher will be able to tell the two apart. Such a distinguisher contradicts Denition 3.3.
3.5
More generally, the cryptography community has produced the concept of oblivious RAM [10]: a method to obscure the program even to someone watching the memory access patterns during execution. In their main result, they show that if a program and its input with total size y uses memory size m and has a running time t, then it can be simulated by using m(log2 m)2 memory in running time O(t(log2 t)3 ) without revealing the memory access patterns of the original program (assuming t > y). In other words, they provide a solution such that the distribution of memory accesses does not depend on input. This implies that execution of queries can be made indistinguishable if they access the same number of tuples and have the same result size. Unfortunately, even under this relaxation, we will not achieve much improvement in terms of eciency. They show that the lower bound on the oblivious simulation cost is max{y, (t log t)}. In their model, the input y includes everything to be protected, including the program and data. The database would be modeled as part of the program, so the size of the database and the program will be a lower bound for number of memory access. This still implies a full table scan. At this point, we would like to stress that we are considering running a query in isolation batching queries could improve throughput (a full scan for each batch), but would prevent eective ad-hoc or interactive querying. 12
In the previous section, we showed that any strict security and privacy requirement force us to scan entire database tables. The previous denitions main problems are that they try to preserve indistinguishability even if a server can look at tuple access patterns. What we need is a denition that allows revealing the access patterns for a tuple, enabling more ecient query processing.
4.1
Denition
If the data and queries are encrypted, and the encryption satises multiplemessage indistinguishability (e.g., Denition 3.1), then the ability to distinguish between queries or tuples carries little information, especially if the ability to trace tuple access between queries is limited. Using this observation, we give a new denition that guarantees some level of privacy while allowing a higher degree of eciency than the previous examples. First, we dene a minimum set of support tuples for each query: the tuples that must be accessed to compute the query results. We then only apply query indistinguishability to queries that have the same support tuple set. Denition 4.1 (Min support set) Let query Q be dened on tables R1 , R2 , . . . , Rn . Let S be the set of elements in R1 R2 . . . Rn . A set S (R1 R2 . . . Rn ) is a min support set for Q if Q(S) = Q(R1 R2 . . . Rn ), and S is the smallest such set for which this is true. Example 4.1 Assume we have two tables: R1 (a1, a2) = {(1, a), (2, b), (3, c)} and R2 (a1, a2) = {(1, 2), (2, 3), (3, 4)}. Let Q = R1 .a1 (R1 .a1 =R2 .a2 (T )). Q(R1 R2 ) returns the same result as Q({(1, a, 1, 2), (2, b, 2, 3), (3, c, 3, 4)}), and these three tuples are the smallest such set. Using this, we can now give a denition that ensures nothing is disclosed by watching query processing except the size of the result and what tuples were processed in arriving at the result.
13
Denition 4.2 (Query Indistinguishability) For every query pair Qi , Qj on the same set of tables, with the same result size and min support set, the messages mQi , mQj sent for executing the queries are computationally indistinguishable if for every polynomial-size circuit family {Cn }, every polynomial p, all suciently large n, and mQi and mQj {0, 1}poly(n) , | P r{Cn(mQi ) = 1} P r{Cn (mQj ) = 1} |< 1 p(n)
This, combined with Denition 3.1, guarantees that all an adversary can do is to trace the tuples accessed during query execution, and possibly relate that to result size. As this could disclose information over the course of many queries, we also give the following denition, requiring that the condence in tracing tuples drops over time: Denition 4.3 (Three Card Monte Secure) A database is c-secure if given a query Q with min support set T , the proba|T | 1 bility that a server trying to track t T can do so correctly is < c(k+1) + |DB| , where k is the number of times t has been accessed since completion of Q. The key to this denition is that an adversarys condence that they know which tuples Q accessed will decrease over time. (Formal proof of the ecacy of this denition of security is beyond the scope of this paper.) With high probability any useful information inferred from tracking tuple access will be incorrect. Denition 4.4 We consider a database to support secure query processing if it meets Denitions 3.1, 3.2, 4.2, and 4.3. We now describe how to construct a database server meeting these denitions.
4.2
Methods that allow equality test of encrypted tuples, or eld values in the tuples, violate Denition 4.4 because tuples can be distinguished. The problem is that if the tuples are truly indistinguishable, the server will be unable to do any query processing beyond send the entire table to the client any meaningful query processing requires distinguishing between tuples. If the 14
tuples can be distinguished, then they can be tracked over multiple queries, disclosing information in violation of Denition 4.3. However, if we support a few simple operations that are hidden from the server, we can meet Denition 4.4. The key idea is that operations that must distinguish between tuples (e.g., comparing a tuple with a selection criteria) occur by decrypting and evaluating a tuple in a manner invisible to the server. The tuples accessed are then re-encrypted and written back to the database, but not necessarily in the same order. This prevents the server from reliably tracking the tuples accessed across multiple queries. We dont want to send tuples back to the querier to do this. However, assume the existence of a module capable of the following: 1. decrypt tuples, 2. perform functions on two tuples, 3. maintain simple (constant-size) history for performing aggregate functions, 4. generate a new tuple as a function of the inputs, and 5. maintain a constant-size store of tuples, 6. perform a counter-based CTR mode encryption of the new tuple. The module may return an (encrypted) tuple to write back into the location most recently read from but this is not necessarily the most recently read tuple (making tracking dicult). (Such swapping was proposed for PIR in [5], here we amortize the cost as opposed to periodically shuing oline.) It also optionally returns a tuple that becomes part of the result. The module also returns the address of the next tuple to be retrieved. Assuming such a module can perform these operations while obscuring its actions and intermediate results from the server, we can construct a machine meeting Denition 4.4. The idea is that the database is encrypted as in Section 3.2. An encrypted catalog (in a known location) contains pointers to the rst tuple in each table or index. The secure module decrypts the query, reads the catalog to get the location of the rst tuple of the relevant tables/indexes, then begins processing. We rst show how individual relational operations can be securely performed using the above module. We give a sketch of the 15
proof of security of each using a simulation argument (as used in Secure Multiparty Computation[9]) the idea is that given the results (min support set and result size), the server is able to simulate the actions of the secure module. If it is able to do so, then all queries on that set and result size must be indistinguishable from the simulator, and thus indistinguishable from each other. (These are sketches; full details require probabilistic simulation proofs to meet Denition 4.3.) We will then discuss composing operations to perform complex queries. Selection makes use of the fact that we have some memory hidden from the server (adversary). The secure module keeps the results until the local memory is partially lled. At this point, after each new tuple is read, one of the cached result tuples may be output to the server. This decision is a random choice, with the probability based on the estimated size of the results relative to the estimated number of tuples read. Formally, assume that the estimated number of tuples that need to be read to execute the query is t, the estimated result size is r, and the local memory size is m. The secure module reads the rst (t/r) (m/2) tuples, caching the results in local memory. At this point, for every tuple read, with probability r/t one of the cached result tuples is given to the server. When the query is complete, the remaining cached tuples are given to the server for delivery to the client. Theorem 4.1 Provided that queries contributing to the result are (approximately) uniformly distributed across all tuples read, the above process meets Denition 4.4. for full table scan selections. Proof Sketch. Using a simulation argument, we assume the simulator for the server is given t and r (since these will be known at the end of the query.) m is public knowledge. The simulator can thus compute (t/r)(m/2). After this many tuples have been read, the simulator begins creating result tuples. Since the tuples are encrypted using pseudo-random encryption, the simulator just uses a counter and an appropriate length random string of bits to simulate a tuple. By arguments on the strength of encryption the simulated output tuples and re-encrypted tuples are computationally indistinguishable from the real execution. After each tuple is read, a simulated result tuple is created with probability t/r. When all tuples have been read, the simulator creates the remaining result tuples (so the total is r.) Since the result tuples can be simulated using this approach, and the simulator decides when to create the result tuple in exactly the same fashion 16
as the real algorithm decides when to output a result tuple, the simulator is (computationally) indistinguishable from the actual selection. This shows that it meets Denition 4.2. Denition 4.3 is more dicult. This relies on the assumption of approximately uniform distribution. Because of this, the a-priori probability that a given tuple is in the rst t/r tuples is high, so little information is revealed by disclosing that the rst result occurs in the rst (t/r) (m/2) tuples. This approach does fail when the distribution of which tuples contribute to the result to all query tuples is skewed. For example, if none of the rst (t/r) (m/2) tuples cause a result tuple to be generated, the algorithm will be unable to begin outputting result tuples on schedule. Thus the server can make an improved estimate of the probability that a tuple contributes to the result. In the worst case (e.g., only the last r tuples contribute to the result), this probability approaches 1. Queries that generate most results based only on the rst tuples read are unlikely. Queries that generate results only after reading most or all of the tuples are more common: aggregation, indexed search. However, these queries will generally return a small number of results. If r m/2, the secure coprocessor will not be expected to produce results until all tuples are read, so Theorem 4.1 holds. Queries where the results are highly skewed should be processed using an indexed selection anyway (to eciently access only the desired tuples.) Indexed Selection can be done using a method developed for oblivious access to XML trees[14]. Nodes are swapped, re-encrypted, and written back to the tree. The key idea is that each time a node is read, c 1 additional nodes are read one of which is known to be empty. All the nodes are reencrypted and written, with the target written into the empty node. When the nodes are written, the original is written into the empty. This proceeds in levels: The rst two levels are read, the location of the second level empty is determined, and the parent is updated to point to the previous empty node, and the rst level written. The third level is read, second level parent updated, etc. Theorem 4.2 The algorithm of [14] satises Denition 4.4. Proof Sketch. Denition 4.2 is satised because queries with the same min support set will follow the same path to the same leaf. The random choice of c1 additional nodes comes from the same distribution, and are thus 17
indistinguishable. Likewise, encryption and rewriting is indistinguishable by arguments based on strength of encryption. Denition 4.3 is satised because of the swapping. Each time a node is accessed, it is placed in a new location. However, since c locations have been read and written, and are indistinguishable to the server, the probability that the server can pick which of the c locations the node is in is 1/c. The next time the node is read, it is again placed in one of c locations, with which one unknown to the server. The best the server can now do is guess that it is in one of the 2c locations. (Access to other of the original 2c locations may confuse the server, causing it to guess more than 2c locations, but we are guaranteed at least 2c.) This continues, with each access to the tuple causing an additional c decrease in the servers best guess, giving our 1/c(k + 1) target. The only problem is that the randomly chosen set of masking locations may include locations previously used. This is inherent in a nite database - the best we can do is 1/|DB|. This is the reasoning behind the |T |/|DB| oor factor in Denition 4.3.
This analysis is based on a query returning a single tuple. Extension to range queries is straightforward. Projection is straightforward. The comparison function simply returns E( tuple) rather than E(tuple). Knowing the length of a projection from the encrypted result, the simulator can randomly generate an equivalentlength string that is computationally indistinguishable from the real encrypted result. In particular, note that a null projection (e.g., select * from table) is indistinguishable from any other projection producing tuples of the same size the only way to distinguish selection from projection is the fact that some tuples are removed by a selection. Join can be either repeated full-table scan selection (nested loop join) or indexed selection (index join). To perform a join, the module rst requests a tuple from one table, then from the second table. Both are decrypted, the join criteria is checked, and if met the joined tuple is stored for output. Assuming a reasonably uniform distribution of tuples meeting the join criteria, or a small number of tuples meeting the join criteria, the proof follows that of Theorem 4.1. A similar argument holds for an index join. Again, we need a reasonably uniform distribution of tuples meeting the join criteria. The 18
swapping in the index search prevents too much tracking between tuples, and caching the results allows the resulting tuples to be output at a constant rate. Set operations are straightforward, except for duplicate elimination. Union is simply two selections. Intersection is a join. Set dierence is again similar to a join, but output only occurs if after completion of a loop (or index search), a joining tuple is not found. Duplicate elimination could reveal equality of two tuples. This is more than simply does it contribute to the result, and thus violates Denition 4.4. One solution is to replace duplicates with an encrypted dummy tuple. The client thus gets a correct result by ignoring the dummy tuples, at the cost of increased size of the result.
4.3
Discussion
Real query processing requires combining these methods to form a query tree/plan. A simple approach would reveal the query tree and plan to the server. At rst glance, this seems excessive. However, simply given the access patterns it is often possible to make a good guess as to the query plan: If two tables are accessed (e.g., tuples of dierent sizes), it is probably a join; a table being accessed and returning fewer tuples is a selection; logarithmic access frequencies represent a tree-based index. Rather than trying to pretend such information is hidden, we suggest explicitly revealing it. This allows the server to perform rule-based query optimization, prefetching, and likely other types of performance enhancements. We believe that meaningful improvements in security either require a query processor that can hide substantial (and non-constant) intermediate state from the server, or run afoul of the problems of Section 3. Any method with constant-space hidden storage will either require full table scan for all queries, or will reveal information equivalent to the above for some sequences of queries.
The implementation just described depends on the ability to execute a simple function on the server, without the server seeing the execution. This problem is known as program obfuscation, and has been the object of some study. The 19
results are not encouraging. General program obfuscation has been shown to be impossible for large classes of functionalities[6]. While we have not proven that such an obfuscated program is impossible for our decrypt, compare, and re-encrypt function, if possible at all it is likely to pose a high computational cost. Fortunately, there is an alternative: tamper-resistant hardware. This enables execution of a function while hiding the execution from the server, as needed to meet our denition. This has been proposed as a solution to Private Information Retrieval[5]; we show how it also applies to general database query processing. In addition, such hardware already exists. One example is the IBM 4758 Cryptographic Coprocessor.[12] This is a single-board computer consisting of a CPU, memory and special-purpose cryptographic hardware contained in a tamper-resistant shell; certied to level 4 under FIPS PUB 140-1. When installed in the server, it is capable of performing local computations that are completely hidden from the server; tampering is detected and clears internal memory. How does this solve our problem? Using public key cryptography, the client can verify that the server contains an approved tamper-proof coprocessor, and provide the coprocessor with the key for decrypting the tuples in the database. The client has a key for the coprocessor, allowing it to clear the coprocessor and load the key and program to be executed. The coprocessor can now perform the decrypt, compare, and re-encrypt function. Any attempt by the server to take control of (or tamper with) the coprocessor, either by software or physically, will clear the coprocessor, thus eliminating any decrypted view of the tuples. (The same holds true of another client executing a query; taking control of the coprocessor clears old information.) Further details on using a tamper-proof coprocessor to securely meet our requirements are given below.
5.1
Care must be taken to ensure that a tamper-proof coprocessor is in fact used in a secure way. The basic process of setup for running a query is as follows. 1. The client creates a query execution program that includes the database key, and encrypts it with the public key of the coprocessor. This query execution program is xed, and may be stored at the server (use of a checksum along with the fact that is encrypted presents the server from 20
tampering with the program.) 2. The client creates a query (including a checksum to prevent tampering), encrypts with the database key stored in the (encrypted) query execution program, and sends it to the server. 3. The server delivers the query execution program to the secure coprocessor, and instructs it to reset, then decrypts the program with its private key and executes it. This particular private key can only be used as part of such a reset, decrypt, and load command, ensuring that the database key cannot be decrypted unless the accompanying program for query execution is run. 4. The server provides the query to the coprocessor, which decrypts it and veries the checksum. 5. The coprocessor now begins requesting tuples from the server. As each is received, it is decrypted, compared with appropriate query terms, and any result is re-encrypted before being returned to the server (as described in Section 4.2). 6. On query completion, the co-processor sends a done message to the server and resets (clearing its memory.) 7. The server returns the results to the client, which decrypts them with its database key. The key to the security of this protocol is that the server never sees the key used to encrypt the database / queries, or any data that is not encrypted with that key. The database key is stored in two places: At the client1 ; and at the server, but encrypted with the coprocessors public key. The only way the server-stored key can be decrypted is after resetting the coprocessor, and while executing the (client-provided and veried) program that executes queries. Thus the database key is only accessible to code provided by the client, running in an environment that is not visible to the server. While the secure coprocessor is somewhat at the mercy of the server, the only way the server can modify the code executed by the coprocessor is by resetting and reloading the code, which clears the coprocessor (including the
The client is presumed to be trusted. In practice, key management could be handled with a secure subsystem such as the Trusted Computing Group chip[17].
1
21
a1 b1 c1
, a2 , a3 , b2 , b3 , c2 , c3
database key.) Any attempt to modify the encrypted code will prevent validation of the code, and the coprocessor will refuse to run the code. Thus the coprocessor can be guaranteed to securely provide the functionality required in Section 4.2. Note that there is nothing to prevent multiple clients from sharing the same server; the only limitation is that only one may use the secure coprocessor at any given time. This poses some interesting challenges for concurrency control, but these are beyond the scope of this paper. The tamper-proof hardware solution has other advantages. Since secure coprocessors are typically designed for use in setting where data is encrypted, they will typically contain special-purpose encryption/decryption hardware, providing improved performance. For example, the IBM 4758 includes hardware support for DES, modular math to support public-key encryption, and random number generation. Raw throughput can achieve 23.5MB/sec for DES. 5.1.1 Encryption Optimizations
In practical terms, each encrypted value needs to be of the given block cipher size. We can easily combine, split, or pad attributes to achieve this, depending on our requirements. For example, if the each attribute is 32 bits long and cipher operates on 64 bits blocks, we can encrypt them pair by pair and pad the last block with zeros. Let u be the minimum number of blocks needed to encrypt each tuple. We encrypt the ith tuples j th block Bij as EK (i u + j) Bij , where K is the encryption key. Since tuples will be processed alone, we will add a new eld to store i u in every tuple. For example, given the original database Table 1 where each attribute is 32 bits, the encrypted table using a 64-bit block cipher is shown in Table 2. denotes the concatenation operation and 032 denotes a 32 bit string of all zeros. It is clear from the example that given the key, we can decrypt any part of 22
the table independently. Another advantage is that we can decrypt counters in advance: assigning the counters in order is not a problem (they just need to be unique), so it may be possible to guess (and decrypt) counters in parallel with a block of tuples being retrieved from disk. It has been suggested that decrypting data as part of a query would pose an unreasonable cost[15]. Based on benchmarks published by IBM[13], simple decryption of a single tuple would take several milliseconds. In practice, a system would probably encrypt/decrypt at the page rather than tuple level (assuming sucient internal memory). The re-encryption/swapping would then also occur at the page level. In addition, the decryption of CTR counters can be done even before tuples are read, leaving only a simple xor operation for each tuple. Given this, we believe it is feasible to implement decryption/encryption of the CTR counters at speeds that approaches the peak 23.5MB/sec DES throughput achieved by the IBM 4758, and the xor operation would not introduce a signicant delay. This approaches disk speeds, allowing a secure database server without a signicant performance penalty.
There have been several eorts to develop systems for managing encrypted data. In [4], Ahituv, Lapid, and Neumann addressed the problem of managing updates (but not queries) on encrypted data. They require that which tuple is to be updated be known, but the value in that tuple is secret. They developed a method for additive updates in two cases: One where the value to be added is known to the server, and one where both are unknown. Their approach is based on homomorphic encryption: E(a + b) = E(a) + E(b). The idea is that using homomorphic encryption, we dont need to decrypt values to add them. The issue of indexing encrypted data is addressed in [11]. The key idea is that values are partitioned into buckets. A query is translated into operations 23
on these buckets; the server returns all results from the appropriate buckets. One problem with this technique is that buckets must be of a xed size to avoid violating Denition 3.1 even in the absence of queries. While feasible for an initial database, maintaining this after insertions is not feasible. Thus the relative size of buckets reveals information about the distribution of the data. A second problem is that relationships between elds in a tuple are revealed, as in Example 3.1. Only probabilistic relationships are learned, but as the bucket size decreases the probability of learning such a relationship increases. Large buckets are not a feasible solution, as an equ-join becomes a cross-product of buckets, with the result size (and client eort) growing rapidly with larger buckets (as shown in [11]). Ozsoyoglu et al. [15] also addressed running queries on encrypted databases. They suggest heuristic encryption methods that preserve some relationships among the data, such as order or the dierence between attributes. Such order preserving encryption functions cannot satisfy database indistinguishability (Denition 3.1). Consider databases with two attributes and n tuples. The rst database contains tuples < 1, 1 >, < 2, 2 >, ..., the second < 1, n >, < 2, n 1 >, .... If the encryption preserves order, we can always distinguish between the two sequences with probability 1, as the tuples sort the same on both attributes in database 1, and sort in opposite order on the two attributes in database 2. Similar arguments can be used for encryption that preserves dierence between the messages. While data values may be protected from direct inspection, a determined server with some additional knowledge (e.g., a history of queries) may be able to signicantly compromise security. Careful denitions of the system environment can be secure using these methods. A method for order-preserving encryption that hides distributions, and an architecture for its use, is presented in [3]. The key is in their assumptions: The database software is trusted (the adversary only has access to the encrypted database, not the running system), and only one attribute in each relation uses order-preserving encryption. While secure within the assumptions, it does not meet our goal of a secure database service. A key distinction between our work and that described above is that we address what the server learns from processing queries. Statistical database work has shown that a sequence of queries can reveal information beyond that revealed by any single query[1]; this is just as true for queries on encrypted data. Private Information Retrieval has addressed this issue [7], but under the assumption that the data is known to the server and the query must be 24
kept private. The results are discouraging; the data access lower bound for single server is the entire database. However, by encrypting both data and queries, we can obtain reasonable levels of security and avoid the impossibility results of Private Information Retrieval.
Conclusions
The idea of a database server operating on encrypted data is a nice one: It opens up new business models, protects against unauthorized access, allows remote database services, etc. Achieving this vision requires compromises between security and eciency. We have shown that a server that would be considered secure by the cryptography community would be hopelessly inecient by standards of the database community. Ecient methods (e.g., operations on encrypted data) can not meet cryptographic standards of security. We have given a denition of security that is the best that can be achieved while maintaining reasonable levels of performance. We have shown that this denition can be realized using commercially available special-purpose hardware. This denition and approach raises many questions. The rst is realworld performance: What happens if we implement such a system? We plan to pursue such an implementation. This will lead to many challenges: More ecient join and indexing strategies that meet security requirements, concurrency control that does not violate security, query optimization approaches, etc. A second issue is when the security oered by Denition 4.4 is inadequate, and only (inecient) approaches meeting Denitions 3.2-3.3 are adequate. While we have shown that our approach is the best we can (eciently) do, the disclosures may be too much for some applications. Progress in this area will require better denitions of security; e.g., privacy denitions as rigorous as the security denitions that enabled progress in multilevel secure databases[16]. In spite of these questions, a database server managing encrypted data is feasible. Such a server will provide substantial benet, such as easing enforcement of several of the ten principles proposed for a Hippocratic Database system[2]. With careful and rigorous work on ensuring that security is achieved, we can expect to see signicant progress in this area. Key areas for future work are: 25
Formal proof that Denition 4.3 adequately protects against inference from tracking query accesses, Formal proof that the Denitions given are complete and consistent, Methods for proving that query processing algorithms meet security denitions, Query optimization methods that provide provably secure means of combining the various component algorithms, Practical eciency issues: query and data pipelining in conjunction with encryption (this will demand implementation on real hardware), and System issues: How does this play out in real-world applications? We believe a new and important research direction is taking o; this paper shows how with careful denition of what it means to be secure, such research can ensure the promised security while achieving practical eciency.
References
[1] N. R. Adam and J. C. Wortmann, Security-control methods for statistical databases: A comparative study, ACM Computing Surveys, vol. 21, no. 4, pp. 515556, Dec. 1989. [Online]. Available: http://doi.acm.org/10.1145/76894.76895 [2] R. Agrawal, J. Kiernan, R. Srikant, and Y. Xu, Hippocratic databases, in Proceedings of the 28th International Conference on Very Large Databases, Hong Kong, Aug. 20-23 2002, pp. 143154. [Online]. Available: http://www.vldb.org/conf/2002/S05P02.pdf [3] , Order-preserving encryption for numeric data, in Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, Paris, France, June 13-18 2004. [4] N. Ahituv, Y. Lapid, and S. Neumann, Processing encrypted data, Communications of the ACM, vol. 20, no. 9, pp. 777780, Sept. 1987. [Online]. Available: http://doi.acm.org/10.1145/30401.30404 26
[5] D. Asonov and J.-C. Freytag, Almost optimal private information retrieval, in Second International Workshop on Privacy Enhancing Technologies PET 2002. San Francisco, CA, USA: Springer-Verlag, Apr. 14-15 2002, pp. 209223. [Online]. Available: http://www.springerlink.com/openurl.asp?genre=article&issn= 0302-9743&vo%lume=2482&spage=209 [6] B. Barak, O. Goldreich, R. Impagliazzo, S. Rudich, A. Sahai, S. P. Vadhan, and K. Yang, On the (im)possibility of obfuscating programs, in Proceedings of the 21st Annual International Cryptology Conference on Advances in Cryptology (CRYPTO 01), J. Kilian, Ed. Santa Barbara, California: Springer-Verlag, Aug. 19-23 2001, pp. 118. [Online]. Available: http://www.springerlink.com/openurl.asp?genre= article&issn=0302-9743&vo%lume=2139&spage=1 [7] B. Chor, E. Kushilevitz, O. Goldreich, and M. Sudan, Private information retrieval, Journal of the ACM, vol. 45, no. 6, pp. 965981, 1998. [Online]. Available: http://doi.acm.org/10.1145/293347.293350 [8] O. Goldreich, The Foundations of Cryptography. Cambridge University Press, 2004, vol. 2, ch. Encryption Schemes. [Online]. Available: http://www.wisdom.weizmann.ac.il/ oded/PSBookFrag/enc.ps [9] , The Foundations of Cryptography. Cambridge University Press, 2004, vol. 2, ch. General Cryptographic Protocols. [Online]. Available: http://www.wisdom.weizmann.ac.il/ oded/PSBookFrag/prot.ps [10] O. Goldreich and R. Ostrovsky, Software protection and simulation on oblivious RAMs, Journal of the ACM, vol. 43, no. 3, pp. 431473, May 1996. [Online]. Available: http://doi.acm.org/10.1145/233551.233553 [11] H. Hacigumus, B. R. Iyer, C. Li, and S. Mehrotra, Executing SQL over encrypted data in the database-service-provider model, in Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, Madison, Wisconsin, June 4-6 2002, pp. 216227. [Online]. Available: http://doi.acm.org/10.1145/564691.564717 [12] IBM PCI cryptographic coprocessor. [Online]. Available: //www.ibm.com/security/cryptocards/html/pcicc.shtml http:
27
[13] CCA API performance - IBM PCI cryptographic coprocessor. [Online]. Available: http://www.ibm.com/security/cryptocards/html/ perfcca.shtml [14] P. Lin and K. S. Candan, Hiding traversal of tree structured data from untrusted data stores, in Proceedings of Intelligence and Security Informatics: First NSF/NIJ Symposium ISI 2003, Tucson, AZ, USA, June 2-3 2003, p. 385. [Online]. Available: http://www.springerlink.com/openurl.asp?genre=article&issn= 0302-9743&vo%lume=2665&spage=385 [15] G. Ozsoyoglu, D. A. Singer, and S. S. Chung, Anti-tamper databases: Querying encrypted databases, in Proceedings of the 17th Annual IFIP WG 11.3 Working Conference on Database and Applications Security, Estes Park, Colorado, Aug. 4-6 2003. [Online]. Available: http://art.cwru.edu/TOpapers/IFIP2003.Security.pdf [16] B. Thuraisingham and W. Ford, Security constraint processing in a multilevel secure distributed database management system, IEEE Trans. Knowledge Data Eng., vol. 7, no. 2, Apr. 1995. [17] TCG TPM specication version 1.2, Nov. 5 2003. [Online]. Available: https://www.trustedcomputinggroup.org
28