Aaemw 2
Aaemw 2
Information Sciences
journal homepage: www.elsevier.com/locate/ins
a r t i c l e i n f o a b s t r a c t
Article history: Data outsourcing or database as a service is a new paradigm for data management. The
Received 31 January 2013 third party service provider hosts databases as a service. These parties provide efficient
Received in revised form 17 July 2013 and cheap data management by obviating the need to purchase expensive hardware and
Accepted 1 October 2013
software, deal with software upgrades and hire professionals for administrative and main-
Available online 16 October 2013
tenance tasks. However, due to recent governmental legislations, competition among com-
panies and database thefts, companies cannot use database service providers directly. They
Keywords:
need secure and privacy preserving data management techniques to be able to use them in
Data outsourcing
Query processing
practice. Since data is remotely stored in a privacy preserving manner, there are efficiency
Data privacy and security related problems such as poor query response time. We propose a new framework that
provides efficient and scalable query response times by reducing the computation and
communication costs. Furthermore, the proposed technique uses several service providers
to guarantee the availability of the services while detecting the dishonest or faulty service
providers without introducing additional overhead on the query response time. The eval-
uations demonstrate that our data outsourcing framework is scalable and practical.
Ó 2013 Elsevier Inc. All rights reserved.
1. Introduction
Data outsourcing or database as a service is a new paradigm for data management in which a third party service provider
hosts database as a service. The service provides data management for its customers and thus obviates the need for the ser-
vice user to purchase expensive hardware and software, deal with software upgrades and hire professionals for administra-
tive and maintenance tasks. Since using an external database service promises reliable data storage at a low cost by
eliminating the need for expensive in-house data-management infrastructure, it is very attractive for companies. However,
recent governmental legislations, competition among companies and database thefts have pushed companies to use secure
and privacy preserving data management techniques. Using an external database service is a straightforward server–client
application in an environment where service providers and clients are honest and clients do not hesitate to share their data
with database service providers. However, this is usually not the case and thus the research challenge here is to build a ro-
bust and efficient service to manage data in a secure and privacy preserving manner.
Current research has been focused only on how to index and query encrypted data [20,21,9]. Although one of the main
problems is querying the encrypted data efficiently, it is not the only problem in data outsourcing. Since thousands of clients
per database service provider are expected, the scalability of the proposed techniques and the availability of the services is a
very important problem. However, current proposals do not consider this issue and assume a simple scenario consisting of
an always available database service provider and a simple service user. Furthermore, they assume both of the parties are
honest and trust each other. For example, the service provider may corrupt the data and it would be impossible to recover
0020-0255/$ - see front matter Ó 2013 Elsevier Inc. All rights reserved.
http://dx.doi.org/10.1016/j.ins.2013.10.006
F. Emekci et al. / Information Sciences 263 (2014) 198–210 199
it for the service user. To be able to use external database service providers in real life, there should be a mechanism to re-
cover the data and also to prove that data has been corrupted. Providing a trust mechanism to push both database service
providers and clients to behave honestly is another important problem.
We propose a new data outsourcing framework providing efficient and scalable query response times. In addition to this,
the proposed technique uses multiple service providers to guarantee the availability of the services and to be able to recover
from hardware failures. Furthermore, we propose a technique to identify the dishonest or faulty service providers.
Current proposals use encryption to hide the content from service providers [20,9]. However, the computational complex-
ity of encrypting and decrypting data to execute a query increase the query response time. Therefore, this complexity is one
of the bottlenecks in current solutions [3]. The proposed solution in this paper uses information theoretically secure tech-
niques similar to Shamir’s secret sharing mechanism [29] instead of computationally secure techniques such as encryption.
Furthermore, label-based filtration is used to execute range queries [20,22]. However, a data provider reveals some informa-
tion about the underlying data by labeling a row. Therefore, the computational complexity of our solution is much less than
the current proposals using encryption. Therefore, there is a privacy performance tradeoff in these solutions. Our technique
does not reveal any information about the content of the data and only the required data is retrieved from the service
providers.
In this paper, we use multiple service providers for the fault tolerance. The fault tolerance in this context is the availability
of service providers and the ability to recovery from data corruption. Data corruption may happen due to either disk failures
or malicious service providers. Our solution deals with both these faults without incurring any additional overhead to the
query response time.
The rest of the paper is organized as follows: The model and the types of queries are introduced and also related work is
reviewed in Section 2. The basic attempts to solve the problem is discussed in Section 3. Section 4 presents the data distri-
bution technique. The query processing methods for our data distribution technique is studied in Section 5. Section 6 dis-
cusses the fault tolerance of the proposed technique. The query response time of the technique is analyzed in Section 7.
The last section discusses the future work and concludes the paper.
In this section, we define the problem and introduce the model. Then, we briefly discuss our solution and finally we re-
view the related work.
Assume data source D wants to outsource its data to eliminate its database maintenance cost by using the database ser-
vice provided by database services DAS1, . . . , DASn. D needs to store and access its data remotely without revealing the con-
tent of the database to any of the database services. For the sake of this discussion, assume D has a single table
Employees(EID, name, lastname, department, salary) in its database and stores Employees using the services provided by DAS1, -
. . . , DASn. After storing Employees,D needs to query Employees without revealing any information about either the content of
the table or queries. Basically, D can pose any of the following queries over time:
1. Exact match queries such as ‘‘Retrieve all employees whose name is ‘John’ ‘‘.
2. Range queries such as ‘‘Retrieve all employees whose salary is between 10 K and 40 K’’.
3. Aggregate queries such as MIN/MAX, MEDIAN, SUM and AVERAGE (including aggregate queries over ranges).
There are several proposals addressing exact match queries and range queries [20,9,3], however, these proposals are not
complete and do reveal some information about the underlying data (e.g. the range of salaries of employees). In this paper,
we will propose a complete approach to execute exact match, range and aggregation queries in a privacy preserving manner.
Throughout the paper, we will assume that there are two kinds of attributes in tables namely numeric attributes (e.g. salary)
and non-numeric attributes (e.g. name). The solution first will be presented for numeric attributes and then we will show how
to extend it for non-numeric attributes. Throughout this paper, we will develop the work in [20,21] referred to as data
encryption in parallel with our proposed technique referred to as secret dividing so as to show the differences and compare
them.
In our solution, data is divided into n shares and each share is stored in a different service provider. When a query is gen-
erated at a data source, it is rewritten and the relevant shares are retrieved from the service providers and the query answer
is reconstructed at the data source. In order to answer queries, any k of the service providers are needed to be available. n and
k are the system parameters and will be discussed later.
Hacigumus et al. [20], Hore et al. [21] and Aggarwal et al. [3] propose using third parties as database service providers.
The differences between our work and these work is discussed and compared throughout the paper.
200 F. Emekci et al. / Information Sciences 263 (2014) 198–210
The authors of provided an extension to the work in [3] by splitting the columns in [19]. However, instead of splitting the
information of each column among several DAS providers, they split the columns among the service providers. To preserve
the privacy, the data source enforces privacy constraints expressible as combinations of columns that have to be split among
multiple service providers. The goal of using privacy constraints is to reduce the extent of encrypted columns. On the other
hand, coming up with the privacy constraints is a problem in its own, and the constrains are not easily understandable by the
end users who need privacy guarantees on their personal data. In the most conservative case, the system degenerates to the
case where all the columns are encrypted. In addition, partitioning columns among servers and identifying which columns to
encrypt (in order to cater for the workload) is a provable intractable problem. Most importantly, the scheme does not handle
the case of data corruption or malicious/curious service providers.
Storing and querying public health information where privacy is an important aspect is studied in [24]. The work focuses
on the case where external users can issue queries on the health records without identifying the identity of the patients and
the owners of the data. The authors make use of multiple trusted authorities to achieve scalability as well as privacy for both
the queries’ keywords and the results even in the existence of ‘‘honest-but-curious’’ service providers. The authors’ main fo-
cus is multi-dimensional authorized private keyword searches supporting a subset of conjunctive formulas with equality,
subset and a class of simple range queries. The authors enhance the efficiency of the query execution via hierarchical attri-
butes. Our work is different than this as we focus on the relational data and propose a framework to execute SQL like queries
including (select, join, production and aggregate queries).
The authors provided a general encryption-based architecture for cloud storage for data owners to store data on a cloud,
and share it with other users [23]. They employ a searchable encryption scheme that provides a way to encrypt a search in-
dex so that its contents are hidden (except to a party that is given appropriate tokens). Given a token for a keyword one can
retrieve pointers to the encrypted files that contain the keyword. The approach employs efficient asymmetric searchable
encryption in [1] to be able to support range queries, which makes the data vulnerable to dictionary attacks.
The authors proposed a distributed storage system called Secret Sharing Storage System (SSSS), which uses the (k, n) se-
cret sharing scheme, while also encrypting the file blocks [26]. That is, in the SSSS, files are secret data, and shares of files are
stored on storage nodes that are distributed on an ultra-fast network. The authors focus on agents for implementing the ba-
sic functions for realizing the distributed storage system. A client agent receives a user request, transfers it to a server agent,
and returns its result to the user. Client agents, which can communicate with arbitrary server agents and can switch server
agents according to server agent load conditions or the network state, provide a nonstop storage system to users. Whenever
there is a file fetch request from a client, a server agent collects together a total of k shares to decrypt the file, performs a
decryption, and returns the file to the client.
The work in [27] employs Shamir’s secret sharing to propose a multisecret sharing technique. The proposed technique
recursively constructs the shares in order to hide multiple secrets into the n shares, such that any k of the n shares surface
to recreate the secrets. While the algorithm is very efficient in terms of communication cost, We could not find a straight-
forward way to incorporate it and still choose the polynomials in a way that allows for range queries.
There are several other research topics in this area other than secure data outsourcing such as privacy preserving data
sharing and privacy preserving data mining. Agrawal et al. [5] and Emekci et al. [17,16] proposed techniques to share data
across private databases. Emekci et al. [17,16] used secret sharing schemes in privacy preserving data sharing. In addition to
this, Aggarwal et al. [4] show the challenges in finding the kth element in the union of more than two databases while pre-
serving privacy and propose an approximate solution. Furthermore, several high level design efforts and requirement spec-
ifications have been made to support the privacy of individual information while still supporting some degree of sharing
[2,6–8,12]. Although related, our work is orthogonal or complementary to privacy preserving data management in data min-
ing and information retrieval. In data mining, several efforts have been made to either preserve the privacy of individuals
using randomized techniques [10,11,18,28] or to preserve the privacy of the database while running data mining algorithms
over multiple databases [15,25] using cryptographic techniques such as secure multi-party computation and encryption. On
the other hand in privacy preserving information retrieval, the privacy of the query poser is preserved by hiding the record
he/she queried from the data source [13,14].
Data source D divides the numeric value in the numeric attribute into n shares and stores them at service providers DAS1,
DAS2, . . . , DASn (one share for each of the service providers). The goal here is to divide a secret value into n shares to be stored
at n service providers such that they cannot figure out the secret even if they combine their shares. The solution is based on,
but slightly different than Shamir’s secret sharing method [29].
Our scheme allows data source D to distribute a secret value vs among n data service providers {DAS1, DAS2, . . . , DASn}, such
that knowledge of any k (k 6 n) service providers is required to reconstruct the secret in addition to some secret information,
X, known only by data source D. Since, even complete knowledge of k 1 peers cannot reveal any information about the
secret even though they know secret information X, this method is information theoretically secure [29]. Data source D
chooses a random polynomial q(x) of degree k 1 where the constant term is the secret value, vs, and secret information
X which is a set of n random points. Then, data source D computes the share of each service provider as q(xi) and sends it
to data service provider DASi. The method is summarized in Algorithm 1.
F. Emekci et al. / Information Sciences 263 (2014) 198–210 201
1: Input:
2: vs: Secret value;
3: D: Data source of secret vs;
4: DAS: Set of service providers DAS1, . . . , DASn to distribute secret;
5: Output:
6: share1, . . . , sharen: Shares of secret, vs, for each service provider DASi;
7: Procedure:
8: D creates a random polynomial q(x) = ak1xk1 + + a1x1 + a0 with degree k 1 and a constant term a0 = vs.
9: D chooses secret information X which is n random points, x1, . . . , xn, such that xi – 0.
10: D computes share coming from vs for each service provider DASi, share(vs, i), where share(vs, i) = q(xi).
Data source D divides each secret value in its table using Algorithm 1 and stores them in different data service providers.
Since service providers do not know each other and secret information X, they cannot find out the secret values (even if they
combine their shares). In order to reconstruct the secret value vs, any set of k peers will need to share the information they
have received and they need to know the set of secret points, X, used by D. Since only data source D knows X, only it can
reconstruct the secret after getting at least k shares from any k of the service providers. The shares coming from service pro-
viders can rewritten as follows at the data source:
k2
sharesðv s ; 1Þ ¼ qðx1 Þ ¼ axk1
1 þ bx1 . . . þ vs
k2
sharesðv s ; 2Þ ¼ qðx2 Þ ¼ axk1
2 þ bx2 . . . þ vs
..
.
k2
sharesðv s ; nÞ ¼ qðxn Þ ¼ axk1
n þ bxn . . . þ vs
The secret value can be reconstructed using any k of the above equations since there are k unknowns including the secret
value vs. The key observation is that at least k points and the corresponding shares are required in order to determine a un-
ique polynomial q(x) of degree k 1 along with secret information X.
After storing data with this method, when a query is posed data source D collects all relevant shares from all service pro-
viders, and then it calculates the corresponding secret values. Then, it executes the query using these secret values.
Example 1. Assume that data source D needs to outsource the salary attribute of the Employees table in using 3 data service
providers, DAS1, DAS2 and DAS3. In order to do this, it chooses 5 random polynomials degree of 1 for each salary in the table
whose constant term is the salary (n = 3 and k = 2). In addition, secret information X, X = {x1 = 2, x2 = 4, x3 = 1}, is also chosen
one for each data service provider. Therefore, the polynomials would be q10(x) = 100x + 10, q20(x) = 5x + 20, q40(x) = x + 40,
q60(x) = 2x + 60 and q80(x) = 4x + 80 for salaries {10, 20, 40, 60, 80} respectively. Then, it sends {q10(xi), q20(xi), q40(xi), q60(xi),
q80(xi)} to service provider DASi to store them. This is summarized in Fig. 1. Note that neither the polynomials nor the salaries
are stored at the service provider and Fig. 1 shows them for the sake of the illustration. The service providers on the hand
stores the shares coming from the salaries. When a query comes, it needs to retrieve all shares from all service providers, i.e.,
{q10(xi), q20(xi), q40(xi), q60(xi), q80(xi)} from DASi. After this, it needs to find out the coefficient of each polynomial q and thus
all secret salaries (note that receiving any k shares is enough for this since polynomials are degree of k 1). In our example,
data source D needs to receive shares from any 2 of the service providers and computes the coefficients of polynomials q10,
q20, q40, q60 and q80 and thus all salaries, 10, 20, 40, 60 and 80 to answer a query asking for salaries more than 40.
The solution proposed in Section 3 are impractical since the data source needs to retrieve all the information from the
service providers to execute a query. The communication and computation cost paid for query processing makes them
impractical. In this section, we will extend the techniques in Section 3 to be able to retrieve only the required data from ser-
vice providers.
The key observation to achieve this is that the order of the values in the domain DOM = {v1, v2, . . ., vn} needs to remain the
same in the shares of the service providers. In other words, if data source D needs to outsource secret values from domain
DOM and v1 < v2 < < vn, the shares of a service provider DASi,share(v1, i),share(v2, i), . . ., share(vn, i), derived from v1, v2, . . ., vn
respectively need to preserve the order (i.e., share(v1, i) < share(v2, i) < < share(vn, i)). Since the order of the shares at the ser-
vice provider is not preserved in the solution in Section 3, data service providers cannot filter data. However, if we had a
mechanism to construct the polynomials used in Section 3 calculating shares in an order preserving manner for a specific
domain, then data source D could retrieve only the required tuples instead of a superset to answer a query. In this section,
202 F. Emekci et al. / Information Sciences 263 (2014) 198–210
we propose an order preserving polynomial building technique to achieve this goal. For the sake of this discussion without
loss of generality, we will assume that polynomials are of degree 3 and in the following form ax3 + bx2 + cx + d (i.e., k = 4).
Given any two secret values v1 and v2 from a domain DOM, we need to construct two polynomials
pv 1 ðxÞ ¼ a1 x3 þ b1 x2 þ c1 x þ v 1 and pv 2 ðxÞ ¼ a2 x3 þ b2 x2 þ c2 x þ v 2 for these values such that pv 1 ðxÞ < pv 2 ðxÞ for all x points
if v1 < v2. The key observation for our solution is that pv 1 ðxÞ < pv 2 ðxÞ for all positive x values if a1 < a2,b1 < b2,c1 < c2 and
v1 < v2. We first present a simple approach to construct a set of order preserving polynomials and show why it is not secure
in Section 4.1. Then, we will present a secure way constructing order preserving polynomials (Section 4.2).
A straightforward method to form a set of order preserving polynomials for a specific domain is to use using monotonic
increasing functions of the secret values to determine the coefficients of the polynomials. In this scheme, we need three
2
monotonic increasing functions fa, fb and fc to find the coefficients of the polynomial pv s ¼ ax3 þ bx þ cx þ v s which is used
to divide the secret value vs. The coefficients of the polynomial pv s are the values of the monotonic increasing functions of the
secret value vs where a = fa(vs), b = fb(vs) and c = fc(vs). Therefore, for two secret values v1 and v2 (v1 < v2) and their respective
polynomials pv 1 ðxÞ ¼ fa ðv 1 Þx3 þ fb ðv 1 Þx2 þ fc ðv 1 Þx þ v 1 and pv 2 ðxÞ ¼ fa ðv 2 Þx3 þ fb ðv 2 Þx2 þ fc ðv 2 Þx þ v 2 , the value of pv 1 ðxÞ is al-
ways less than the value of polynomial pv 2 ðxÞ for all x values. Since any service provider DASi gets the value of the polyno-
mials at point xi, the share coming from secret value v1, share(v1, i) would always be less than the share coming from the
secret value v2, share(v2, i) (i.e., p1(xi) < p2(xi)).
However, this solution is not secure enough to protect secret values from the service providers. For example, assume the
following monotonic functions are used: fa(vs) = 3vs + 10, fb(vs) = vs + 27 and fc(vs) = 5vs + 1. Then, the share of data source DASi
from secret value v1 would be p1 ðxi Þ ¼ ð3v 1 þ 10Þx3i þ ðv 1 þ 27Þx2i þ ð5v 1 þ 1Þxi þ v 1 which is p1 ðxi Þ ¼ 3x3i þ x2i þ 5xi þ 1 v 1 þ
10x3i þ 27x2i þ xi . Basically, the secrets are multiplied by the same constants and the other same constant is added to compute
the share of a service provider for all secret values. Therefore, a service provider breaking this method for only one secret item can
figure out all of the secret values and thus this method is easy to break. Instead of simple monotonic functions, more complex
monotonic functions can be used. However, again an adversary by breaking for a single secret can figure out all the secret items.
Since the method used in Section 4.1 to construct an order preserving polynomial is not secure enough, we will propose
another scheme to build order preserving polynomials for values from a specific domain.
In particular, we propose a secure method using different coefficients for each secret value so that service providers
cannot know the relation between secret values except the order.
In polynomial construction, the coefficients a, b and c are chosen from the domains DOMa, DOMb and DOMc. Since the
coefficients can be real numbers, the sizes of the coefficient domains are independent from the data domain size. For finite
F. Emekci et al. / Information Sciences 263 (2014) 198–210 203
domain DOM = {v1, v2,. . . , vn}, the domains DOMa, DOMb and DOMc are divided into n equal sections. For example DOMa is divided
h i h i h i
into n slots: 1; jDom
n
aj
for v 1 ; jDom
n
aj
þ 1; 2 jDom
n
aj
for v2, . . . , ðn 1Þ jDom
n
aj
þ 1; jDoma j for vn. After this, coefficient av i for value vi is
h i
selected from the slot ði 1Þ jDom n
aj
þ 1; i jDomn
aj
with the help of hash function ha which maps vi to a value from
h i
jDoma j jDoma j
ði 1Þ n þ 1; i n . The other coefficients bv i and cv i are computed similarly with the hash functions from domains Domb
and Domc. Finally, the polynomial used to divide the secret value vi into shares would be pv ðxÞ ¼ av x3 þ bv x2 þ cv x þ v i .
i i i i
Example 2. Assume data domain is DOM = {1, 2, 3, 4, 5}, and we want to construct order preserving polynomials of degree 3
for this domain. In order to do this, we need to find 3 coefficients a, b and c for each value in DOM. Furthermore, assume
coefficients a, b and c are chosen from the domains Doma = [1 25], Domb = [1, 15], and Domc = [1, 50] respectively. The domain
of each coefficient is divided into 5 equal pieces since we have 5 elements in domain DOM. For example Doma is divided into
5 pieces: [1, 5], [6, 10], [11, 15], [16, 20] and [21, 25]. The other domains are divided into similar slots as shown in Fig. 2.
Coefficients a, b and c for secret item 3 are selected from the third slots in domains Doma, Domb and Domc respectively with
the help of the hash functions ha, hb and hc. Assume for the sake of this example hash function ha maps secret value 3–13
which is in the third slot of Doma, hb maps it to 7 in the third slot of Domb, and hc maps it to 23 in the third slot of Domc. Then,
the resulting polynomial for secret value 3 would be p3(x) = 13x3 + 7x2 + 23x + 3. Similarly, the polynomial for secret value 5,
p5(x) = 24x3 + 14x2 + 44x + 5, can be constructed with the same method using the values from the 5th slot of each domain.
The main observation here is that the value of polynomial p5(x) is always greater than the value of polynomial p3(x) for all
positive x values, since the values in the 5th slot are bigger than the values in the 3rd slot.
After constructing the polynomial for the secret value v i ; pv i , data source D divides the secret value vi into n pieces to be
sent to each of the service providers. In other words, D stores pv i ðx1 Þ at DAS1, pv i ðx2 Þ at DAS2, . . . , pv i ðxn Þ at DASn. The secret
value vi is reconstructed as described in Section 3 after getting these shares from the service providers. The service provider
DASi storing pv i ðxi Þ for the secret value vi cannot know the secret value vi. Because it does not know xi and anything about the
domains of the coefficients Doma, Domb and Domc.
We now will discuss the security of the proposed polynomial construction technique. Basically, we will discuss what a
service provider can infer from the stored data and then show that it cannot know the content of the data with the inferred
information.
From the stored data, service provider DASi can know an upper bound on the sum of the domain sizes (i.e., jDOMj + jDoma-
j + jDombj + jDomcj). This can only happen when it stores the last secret value from DOM and the coefficients are mapped to
the last slots of the domains for the last secret value vn in the domain. Let us assume this worst case happened for now. Then,
the polynomial for secret value vn would be P v n ðxÞ ¼ jDoma jx3 þ jDomb jx2 þ jDomc jx þ v n and the share of DASi would be
shareði; v n Þ ¼ P v n ðxi Þ ¼ jDoma jx3i þ jDomb jx2i þ jDomc jxi þ v n . From this share, DASi can only know an upper bound on the
sum of the sizes of the domains and that upper bound is too lose to infer something about the content of the data. Therefore,
we can derive the following lemma.
Lemma 1. Data service providers can only know an upper bound on the sum of the domains, the data domain and the coefficient
domains, from the stored information.
Furthermore, data service provider DASi cannot know each domain size or the exact value of the sum of the coefficient
domain sizes even if it knows the secret point xi, in the worst case scenario described above. Because, there are 4 unknows,
Doma, Domb, Domc and vn, in the share of DASi ; shareði; v n Þ ¼ P v n ðxi Þ ¼ jDoma jx3i þ jDomb jx2i þ jDomc jxi þ v n (assuming xi is
known). Thus, these unknowns cannot be found.
After the worst case scenario, we now discuss the general case. If data service provider DASi knows the secret point xi and
the sum of the coefficient domain sizes jDomaj + jDombj + jDomcj, it cannot infer anything about the secret items (even with
simple hash functions mapping the secret values to the first values in the slot are used). Thus, the coefficients of polynomial
pv i ðxÞ with these simple hash functions would be a ¼ v i jDom
n
aj
, b ¼ v i jDom
n
bj
and c ¼ v i jDom
n
cj
(Hash functions ha, hb and hc maps
secret values always the first values in each slot). Then, the share of DASi would be:
jDoma j 3 jDomb j 2 jDomc j
pv i ðxi Þ ¼ v i xi þ v i xi þ v i xi þ v i
n n n
In addition to its share, if DASi knows the sum of the sizes of the domains which is jDomaj + jDombj + jDomcj, there are 5 un-
knowns (jDomaj,jDombj,jDomcj,jDOMjand vi) and 2 equations. Therefore, the unknowns and thus the secret value is not re-
vealed to data service provider DASi even with these simple hash functions. In our scheme, a service provider can only
derive an upper bound on the sum of the domains jDomaj + jDombj + jDomcj + jDOMjbut not the secret point xi. In addition,
the hash functions map secret values to any value not only the first value in the slot. The following lemma can be concluded
from this discussion.
Lemma 2. Data service provider DASi cannot know the secret value vi even if it knows the secret point xi.
The service provider DASi storing fpv 1 ðxi Þ; pv 2 ðxi Þ,. . .,pv n ðxi Þg cannot know the secret values {v1, v2,
. . ., vn}. From these
information, service provider DASi may learn an upper bound value (not tight) for the sum of the sizes of the domains
(i.e., jDOMj + jDomaj + jDombj + jDomcj). In order to find the secret values, DASi needs more information such as xi and the size
of the each domain (the sizes of domains DOM, Doma, Domb and Domc). Thus, the following lemma concludes the security of
the proposed scheme.
Lemma 3. Data service provider DASi storing pv 1 ðxi Þ; pv 2 ðxi Þ,. . .,pv n ðxi Þ cannot know the secret values v1, v2, . . . , vn.
In addition to the security guarantees, for two secret values vi and vj from the same domain, data source DASi will get its
shares shareðv i ; iÞ ¼ pv i ðxi Þ (share of DASi from vi) and shareðv j ; iÞ ¼ pv j ðxj Þ. If vi < vj then share(vi, i) < share(vj, i). The reason for
this is that how the polynomials are constructed. The following lemma formalizes this discussion.
Lemma 4. For any two secret values vi and vj from the same domain, the shares of data source DASi, shareðv i ; iÞ ¼ pv i ðxi Þ (share of
DASi from vi) and shareðv j ; iÞ ¼ pv j ðxj Þ, preserves the order (i.e., if vi < vj then share(vi, i) < share(vj, i)).
5. Query processing
In this section, we will discuss how to process queries in the Encryption with Labeling (EL) [20,21] and Secret Dividing (SD)
techniques discussed in Section 4. The queries are Exact Match Queries, Range Queries and Aggregation Queries.
We consider Sum/Average, Min/Max/Median aggregation queries and how to process them in EL and SD methods. We clas-
sify aggregation queries in two class: (1) Aggregations over Exact Matches. (2) Aggregation over ranges. We will present
aggregation query processing techniques with the following example queries:
QUERY-I: Sum/Average of the salaries of the employees whose name is ‘John’ (Sum/Average over Exact Match).
QUERY-II: Sum/Average of the salaries of the employees whose salary is between 20 and 40 (Sum/Average over
Ranges).
QUERY-III: Min/Max/Median of the salaries of the employees whose name is ‘John’ (Min/Max/Median over Exact
Match).
QUERY-IV: Min/Max/Median of the all salaries of the employees whose salary is between 20 and 40 (Min/Max/Median
over Ranges).
QUERY-I: Sum/Average of the salaries of the employees whose name is share (‘John’, i).
QUERY-II: Sum/Average of the salaries of the employees whose salary is between share(20, i) and share(40, i).
QUERY-III: Min/Max/Median of the salaries of the employees whose name is share(0 John0 , i).
QUERY-IV: Min/Max/Median of the all salaries of the employees whose salary is between share(20, i) and
share(40, i).
Then, DASi finds the tuples needed to answer these queries and performs an intermediate computation over them, which
will be discussed later. These intermediate results are then sent to the data source D. After getting all of these intermediate
results, data source D computes the final answer. In this scheme, only the intermediate results need to be sent by
service providers while a superset of the required tuples needs to be sent in EL method. Therefore, the communication cost
is negligible e.g., sending a single value referring to the shared sum. Thus the query response time is much faster in this
scheme.
Assume data source D has secret values (e.g. salaries) V = {v1, v2, . . . , vn}. Recall that, in order to store them the data source
D constructs a set of order preserving polynomials (av j xk1 þ bv j xk2 þ . . . þ v j to hide each secret value vj). After generating
these polynomials, it sends the shares of the service providers by computing the share of DASi as av j xik1 þ bv j xik2 þ . . . þ v j Þ
for each secret vj and stores them at DASi.
The Execution of QUERY-I and QUERY-II: To answer QUERY-I asking for the sum of the l secret values {v1, v2, . . . , vl} from
P
V, DASi computes the intermediate result, INTRESi ¼ lm¼1 ðshareðv m ; iÞÞ. Hence INTRESi can be written as follows:
a1 xik1 þ b1 xk2
i . . . þ v 1þ
a2 xik1 þ b2 xk2
i . . . þ v 2þ
..
.
al xik1 þ bl xik2 . . . þ v l ¼ ða1 þ þ al Þxk1
i þ ðb1 þ þ bl Þxk2
i þ SUM
Data source D receives n intermediate results from the service providers and writes the following equations for the inter-
mediate results:
Since X = {x1, x2, . . . , xn} is known by the data source, there are a total of k unknown coefficients including SUM and n P k
equations. Therefore, SUM can be found by solving any k of the above equations.
Pl
ðshareðv m ;iÞÞ
For the average query, DASi sends INTRESi ¼ m¼1 l . Then data source formulates and writes the following equation
INTRESi ¼ ða1 þa2 þ...þa
l
xi þ . . . : þ AVGÞ where AVG ¼ v 1 þv 2lþ...þv l . Therefore, data source D receives n results from the service
l Þ k1
providers:
ða1 þ a2 þ þ al Þ k1
INTRES1 ¼ x1 þ þ AVG
l
ða1 þ a2 þ þ al Þ k1
INTRES2 ¼ x2 þ þ AVG
l
..
.
ða1 þ a2 þ þ al Þ k1
INTRESn ¼ xn þ þ AVG
l
Again since X = {x1, x2, . . . , xn} is known by the data source, there are k unknown coefficients including AVG and n P k equa-
tions. Therefore, AVG can be found by using any k of the above equations.
In order to answer QUERY-II, service provider DASi first finds the shares it stores between share(20, i) and share(40, i). Since
the polynomials are order preserving, DASi can find those shares in this range. Then, the operation performed for QUERY-I is
performed for these tuples to compute the sum of them.
The Execution of QUERY-III and QUERY-IV: The key observation for computing the answers to this set of queries is that
if v1 < v2 < < vn, the shares of service provider DASi,share(v1, i), share(v2, i), . . . , share(vn, i), coming from v1, v2, . . . , vn respec-
tively preserve the order (share(v1, i) < share(v2, i) < < share(vn, i)). This result follows from the fact that order preserving
polynomials are used to compute the shares.
Assume l of the secret values, v1, v2, . . . , vl satisfy the condition of QUERY-III (employees whose name is John). Depending
on the query, service provider DASi returns the minimum/ maximum/median of its shares, share(v1, i),share(v2, i), . . . , share(vl, -
i). Without loss of generality, we can assume that the query asks for minimum. Then, service providers send the minimum of
their shares to the data source D.
After the service providers send in the results back, data source D computes the value of the minimum by using the re-
sults of the service providers. The intermediate result of the service provider DASi is in the following form:
INTRESi ¼ axk1
i þ þ MIN:
Thus, MIN could be found similar to sum/ average queries. Data source D receives n intermediate results from the service
providers:
INTRES1 ¼ axk1
1 þ þ MI
INTRES2 ¼ axk1
2 þ þ MIN
..
.
INTRESn ¼ axk1
k þ þ MIN
Since X = {x1, x2, . . . , xn} is known by the data source D, the minimum value can be computed by solving any k of the above
equations.
In order to answer QUERY-IV, service provider DASi first finds the shares between share(20, i) and share(40, i). Since the
polynomials are order preserving, DASi can find those shares in this range. Then, the operation performed for QUERY-III is
applied for these tuples to compute the answer.
5.4. Discussion
We have considered only numeric attributes so far and the proposed technique is for numeric attributes. In order to apply
our scheme for non-numeric attributes, we need to convert them to numeric attributes. This conversion is straightforward.
For example, the attribute name length of 5 characters (i.e., VARCHAR (5)), can be represented as a numeric attribute
although it is in fact a non-numeric attribute. For the sake of this discussion, assume the characters in names can be one
F. Emekci et al. / Information Sciences 263 (2014) 198–210 207
of the letters in the English alphabet and they can be shorter than 5 characters. Thus, the regular expression for this attribute
is (AjBj . . . .jZj⁄)5 where ⁄ represents blank. The name attribute consists of a combination of 29 possible characters which are
enumerated (⁄ = 0, A = 1, B = 2, C = 3 . . . , Z = 29). and thus, each name can represent a number in a number system of base 29.
For example, name ‘‘ABC**’’ can be rewritten as (12300)29 which is equal to 21998878 in decimals. With this simple enumer-
ating technique, nonumeric attributes can be converted into numeric attributes and then the proposed outsourcing tech-
nique can directly be applied. With the proposed enumeration technique execution of widely used queries over non-
numeric attributes can be handled easily. For example, a query asking for employees whose name starts with ‘‘AB’’ or a query
asking employees whose name is between ‘‘Albert’’ and ‘‘Jack’’ can be converted into range queries and executed with the
range query processing technique in this paper.
Moreover, we assumed data sources have only one table for the sake of the presentation and thus did not consider join of
tables. If they had more than one table in their schemas, they may need to join these tables. Our technique can be applied if
these tables are related to each other through referential keys and join is based on these keys. Consider a simple schema
consisting of two tables:
Employees (EID, Name, Lastname, Department, Salary).
Managers (EID, ManegerID, ManagerUserName, Password).
A possible query may ask for the salaries of all managers. To execute this query, these two tables should be joined using
the attribute EID. Our scheme can be directly applied to execute this query since join is based on two attributes which are
from the same domain and our polynomials are constructed for each domain not for each attribute. Therefore, this join can
be done by the service provider at the service provider site. However, if a join is based on two attributes from different do-
mains such as Name and ManagerUserName, then the approach in this paper cannot be used for this kind of joins. Thus, the
query asking for the salaries of the managers whose name is the same as the ManagerUserName cannot be answered with
the proposed scheme.
Finally, if we need to compare two attributes from different domains to execute a query, the proposed technique cannot
be applied. For example, a query asking for employees whose salary is 10 times their ages cannot be answered efficiently
with our technique. On the other hand, a query asking for the employees whose salary is more than the salary of their man-
agers can be executed efficiently. Furthermore, a query asking for the employees whose salary is 2 times the salary of their
managers can be executed efficiently too. The execution of these queries are straightforward with the basic methods in this
section. In order to answer all kinds of queries efficiently with the proposed technique, we need to represent all attributes
with a universal domain. If we had such a domain, we can compare all attributes with each other and join tables based on any
subset of the attributes. We leave forming a universal domain issue as a future work.
6. Fault tolerance
There are two issues related to the fault tolerance: (1) Service availability and (2) Malicious service providers. Both of
these issues are very important in using database services.
Data sources always need to answer their queries. In our scheme, a polynomial of degree k 1 is used to divide the secret
and thus k shares and parties are needed to compute the secret. Therefore, in the secret dividing scheme if k of the n service
providers are available, the queries can be answered using the shares coming from these service providers.
Another important problem is dealing with malicious service providers. These malicious service providers may corrupt
the shares they store (intentionaly or unintentionaly). Therefore, there must be a mechanism to detect the malicious behav-
iors and to execute queries correctly in spite of their existence. In this section, we will explore the fault tolerance of the pro-
posed data outsourcing technique. In response to a query, each of the n service providers send their shares or intermediate
results to the data source. Then the data source solves the linear system and computes the secret values which are the an-
swer of the queries. Since the results retrieved from service providers are k consistent (i.e, any k of n equations give the same
value for the answer of the query), solving the linear system for any k of them is sufficient if all service providers are honest.
However, some of the service providers may be malicious and may send incorrect values, thus solving one linear system of k
n
equations may not be sufficient in this case. There are different possible groups of linear systems that could be used to
k
nt
find the answer. If there are n t honest service providers, then the solutions of linear systems would give the same
k
value, which is the answer of the query. However, any other linear system with at least one malicious third party would give
a different value. This follows from Lemma 5.
Lemma 5. Two different equation sets (i.e., two different linear systems) each with at least one malicious service provider produce
the same value as the result with a very low probability.
Proof. Two different linear systems Ax1 = a and Bx2 = b, produce the same solution, i.e., x1 = x2, and x1 = x2 if b = BA1a. The
probability that the solution of two linear systems is the same is equal to the probability of receiving the same b with ran-
domly chosen numbers, which is 1/jDomjk. This is because providers do not know the matrices A and B. The probability of
two sets of results coming from service providers giving the same incorrect value is, therefore, infinitely small for large
domains. h
208 F. Emekci et al. / Information Sciences 263 (2014) 198–210
nt
Therefore, if k is chosen to be smaller than the number of honest service providers, n t, then, results would be the
k
same while the rest of the results could be different if at least one malicious provider is involved. Having such a mechanism
would push service providers to behave honestly since a data source can prove the dishonesty of the malicious service
providers.
It is possible to optimize this scheme using the result of Lemma 5. Finding the same value at least twice is sufficient to say
that the value is correct for two different linear systems. Throughout query processing, the data source can determine the
possible trustworthy service providers and can use two such sets to execute the queries. Whenever there is a conflict be-
tween the two sets, the query poser would use the other sets to compute the final result.
7. Evaluation
In this section, we will compute the query response time of the two techniques EL and SD for exact match, range and
aggregation queries such as sum and average.
Let Cd be the cost of encryption, B be the bandwidth, T be the number of tuples required to answer the query, and S be the
selectivity of the filtration. To answer the query in EL method, data source D retrieves S T tuples and decrypts all of them. If
the size of each tuple is b, then the communication cost would be: STbB
. And the cost of computation would be S T Cd.
Hence the query response time for all queries is
ST b
þ S T Cd:
B
The selectivity ratio S is equal to 1 in exact match queries and thus the query response time for exact match queries is:
T b
þ T Cd:
B
For the SD method, data source D needs to retrieve T tuples (i.e., shares) from n service providers for the exact match and
range queries. Let Cp be the cost of computation of coefficients of the polynomial. Then the query response time for exact
and range queries in the SD method is
nT b
þ T Cp :
B
where nTb
B
is the communication cost and T Cp is the computation cost. The query response time for aggregation queries is
quite different in SD method. For these queries, service providers perform an intermediate computation which is T times
addition. If the cost of addition is Ca and the size of the intermediate result is b, the query response time for aggregation
queries is
nb
þ T Ca þ Cp
B
where nb
B
is the cost of retrieving intermediate results and Cp is the cost of final computation to find the query result from
intermediate results.
We simulate the two systems and compare the query response times. with the following parameters b = 1024 bits and
B = 60 Kb/s. For the SD method, we used 3 service providers and the polynomials are of degree 3 and symmetric encryption
is used for the EL method. Fig. 3, shows the query response time for exact queries. The query response time varies with the
number of tuples in the query answer. We varied the number of tuples from 100 to 1000. The query response times of two
methods are close to each other as shown in Fig. 3.
55
EL
50 SD
The query response time
45
40
35
30
25
20
15
10
5
100 200 300 400 500 600 700 800 900 1000
The number of tuples
Fig. 3. The query response time (in s) for exact match queries.
F. Emekci et al. / Information Sciences 263 (2014) 198–210 209
120
EL S= 1.25
EL S= 1.5
100
60
40
20
0
100 200 300 400 500 600 700 800 900 1000
The number of tuples
120
EL S= 1.25
EL S= 1.5
100 EL S= 1.75
The query response time
EL S= 2
SD
80
60
40
20
0
100 200 300 400 500 600 700 800 900 1000
The number of tuples
The query response times for range queries are shown in Fig. 4. We varied the number of tuples in the query result from
100 to 1000 and selectivity ratio S from 1.25 to 2 for EL method. The query response times are shown in the Fig. 4. Again,
query response times of EL methods are very close to the SD method. However, the privacy leakage is more in the EL method
due to labeling.
The query response time for aggregate queries are shown in Fig. 5. We varied the number of aggregated tuples in the
query from 100 to 1000 and selectivity ratio S from 1.25 to 2 for the EL method. The query response time of the SD method
is much more efficient than the query response time of the EL method since the cost of communication is less in the SD
method (only the intermediate results are sent instead of all tuples).
The query response time for our proposal SD is comparable to the EL method for exact queries and slightly better than the
EL method for range queries while preserving more privacy than the EL method. Note that we assume there is a slow com-
munication mean between the service providers and the data source (B = 60 Kbits/s). Since our proposal is less computation
intensive, the query response time can be improved by increasing the bandwidth more than the EL method. The SD method
gives very efficient query response times for the aggregation queries compare to the EL method since the cost of computation
is almost zero.
8. Conclusion
We proposed a novel privacy preserving data outsourcing framework in this paper. The proposed data outsourcing frame-
work provides efficient and scalable query response times by introducing new efficient methods to store data at several
service providers and also query them in a privacy preserving manner. Since the proposed technique uses several service
providers, it guarantees the availability of the services. Furthermore, the dishonest or faulty service providers can be de-
tected without overhead on the query response time. However, there are several issues left as future work such as dealing
with infinite data domains, forming a universal data domain and transaction management.
210 F. Emekci et al. / Information Sciences 263 (2014) 198–210
References
[1] Advances in cryptology – crypto 2007, in: A. Menezes, (Ed.), 27th Annual International Cryptology Conference, Santa Barbara, CA, USA, August 19–23,
2007, Proceedings, CRYPTO, Volume 4622 of Lecture Notes in Computer Science, Springer, 2007.
[2] G. Aggarwal, M. Bawa, P. Ganesan, H. Garcia-Molina, K. Kenthapadi, N. Mishra, R. Motwani, U. Srivastava, D. Thomas, J. Widom, Y. Xu, Enabling privacy
for the paranoids, in: Proc. of the 30th Int’l Conference on Very Large Databases VLDB, August 2004, pp. 708–719.
[3] G. Aggarwal, M. Bawa, P. Ganesan, H. Garcia-Molina, K. Kenthapadi, R. Motwani, U. Srivastava, D. Thomas, Y. Xu, Two can keep a secret: a distributed
architecture for secure database services, in: CIDR, 2005, pp. 186–199.
[4] G. Aggarwal, N. Mishra, B. Pinkas, Privacy-preserving computation of the k’th-ranked element, in: Proc. of IACR Eurocrypt, 2004, pp. 40–55.
[5] R. Agrawal, A. Evfimievski, R. Srikant, Information sharing across private databases, in: Proc. of the 2003 ACM SIGMOD International Conference on
Management of Data, 2003, pp. 86–97.
[6] R. Agrawal, P.J. Haas, J. Kiernan, A system for watermarking relational databases, in: Proc. of the 2003 A CM SIGMOD International Conference on
Management of Data, ACM Press, 2003. 674–674.
[7] R. Agrawal, J. Kiernan, R. Srikant, Y. Xu, Hippocratic databases, in: 28th Int’l Conf. on Very Large Databases (VLDB), Hong Kong, August 2002.
[8] R. Agrawal, J. Kiernan, R. Srikant, Y. Xu, Implementing p3p using database technology, in: Proc. of the 19th Int’l Conference on Data Engineering,
Bangalore, India, March 2003.
[9] R. Agrawal, J. Kiernan, R. Srikant, Y. Xu, Order preserving encryption for numeric data, in: SIGMOD ’04: Proceedings of the 2004 ACM SIGMOD
International Conference on Management of Data, ACM Press, New York, NY, USA, 2004, pp. 563–574.
[10] R. Agrawal, R. Srikant, Privacy-preserving data mining, in: Proc. of the 2000 ACM SIGMOD International Conference on Management of Data, ACM
Press, 2000, pp. 439–450.
[11] S. Agrawal, J.R. Haritsa, A framework for high-accuracy privacy-preserving mining, in: ICDE, 2005, pp. 193–204.
[12] E. Bertino, B.C. Ooi, Y. Yang, R.H. Deng, Privacy and ownership preserving of outsourced medical data, in: ICDE, 2005.
[13] C. Cachin, S. Micali, M. Stadler, Computationally private information retrieval with polylogarithmic communication, Lecture Notes in Computer Science
1592 (1999) 402–414.
[14] B. Chor, N. Gilboa, Computationally private information retrieval (extended abstract), in: Proc. of the Twenty-Ninth Annual ACM Symposium on Theory
of Computing, ACM Press, 1997, pp. 304–313.
[15] C. Clifton, M. Kantarcioglu, J. Vaidya, X. Lin, M.Y. Zhu, Tools for privacy preserving distributed data mining, SIGKDD Exploration Newsletter 4 (2) (2002)
28–34.
[16] F. Emekci, D. Agrawal, A.E. Abbadi, Abacus: A distributed middleware for privacy preserving data sharing across private data warehouses, in: ACM/IFIP/
USENIX 6th International Middleware Conference, 2005.
[17] F. Emekci, D. Agrawal, A.E. Abbadi, A. Gulbeden, Privacy preserving query processing using third parties, in: ICDE, 2006.
[18] A. Evfimievski, R. Srikant, R. Agrawal, J. Gehrke, Privacy preserving mining of association rules, in: Proc. of the Eighth ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, ACM Press, 2002, pp. 217–228.
[19] V. Ganapathy, D. Thomas, T. Feder, H. Garcia-Molina, R. Motwani, Distributing data for secure database services, Transactions on Data Privacy 5 (1)
(2012) 253–272.
[20] H. Hacigumus, B.R. Iyer, C. Li, S. Mehrotra, Executing SQL over encrypted data in the database service provider model, in: SIGMOD Conference, 2002.
[21] B. Hore, S. Mehrotra, G. Tsudik, A privacy-preserving index for range queries, in: Proc. of the 30th Int’l Conference on Very Large Databases VLDB, 2004,
pp. 720–731.
[22] B. Hore, S. Mehrotra, G. Tsudik, A privacy-preserving index for range queries, in: VLDB, 2004, pp. 720–731.
[23] S. Kamara, K. Lauter, Cryptographic cloud storage, in: Financial Cryptography Workshops, 2010, pp. 136–149.
[24] M. Li, S. Yu, N. Cao, W. Lou, Authorized private keyword search over encrypted data in cloud computing, in: ICDCS, 2011, pp. 383–392.
[25] Y. Lindell, B. Pinkas, Privacy preserving data mining, in: Proc. of the 20th Annual International Cryptology Conference on Advances in Cryptology,
Springer-Verlag, 2000, pp. 36–54.
[26] T. Miyamoto, S. Doi, H. Nogawa, S. Kumagai, Autonomous distributed secret sharing storage system, Systems and Computers in Japan 37 (6) (2006) 55–
63.
[27] A. Parakh, S. Kak, Recursive secret sharing for distributed storage and information hiding, CoRR, abs/1001.3331 (2010).
[28] S. Rizvi, J.R. Haritsa, Maintaining data privacy in association rule mining, in: Proc. of the 28th Int’l Conference on Very Large Databases, August 2002,
pp. 682–693.
[29] A. Shamir, How to share a secret, Communications of the ACM 22 (11) (1979) 612–613.