0% found this document useful (0 votes)
160 views14 pages

Chaotic Searchable Encryption For Mobile Cloud Storage

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
160 views14 pages

Chaotic Searchable Encryption For Mobile Cloud Storage

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2015.2511747, IEEE
Transactions on Cloud Computing

IEEE TRANSACTIONS ON CLOUD COMPUTING 1

Chaotic Searchable Encryption for Mobile


Cloud Storage
Abir Awad, Adrian Matthews, Yuansong Qiao, Brian Lee

Abstract—This paper considers the security problem of matching, which is not a practical requirement for current
outsourcing storage from user devices to the cloud. A secure mobile phone input methods and 2) boolean search without
searchable encryption scheme is presented to enable searching of capturing the relevance of data files. The system usability can
encrypted user data in the cloud. The scheme simultaneously
be greatly enhanced by the use of fuzzy keyword search [1],
supports fuzzy keyword searching and matched results ranking,
which are two important factors in facilitating practical [8]- [10] instead of traditional searchable encryption. Fuzzy,
searchable encryption. A chaotic fuzzy transformation method is or error tolerant, searchable encryption returns to the user the
proposed to support secure fuzzy keyword indexing, storage and files that match not only the exact predefined keywords but
query. A secure posting list is also created to rank the matched also the closest possible matched files based on keyword
results while maintaining the privacy and confidentiality of the similarity semantics. Similarly, system usability is greatly
user data, and saving the resources of the user mobile devices.
enhanced by ranked search [11], [12] which returns the
Comprehensive tests have been performed and the experimental
results show that the proposed scheme is efficient and suitable for matched files in a ranked order determined by appropriate
a secure searchable cloud storage system. relevance criteria. This paper investigates the problem of
supporting both ranked and fuzzy keyword search in a single
Index Terms—Cloud, Security, Searchable encryption, Chaos, scheme to achieve effective utilization of remotely stored
Locality sensitive hashing. encrypted data in mobile cloud computing applications.
Many approaches are proposed to enable fuzzy search.
I. INTRODUCTION Researchers in [8] consider the use of wildcards to enlarge the
LOUD computing is a model to enable convenient, on- range of possible similar keywords searched, but this
C demand network access to a shared pool of configurable
computing resources (e.g. networks, servers, storage,
technique only covers part of the possible close keywords. A
wildcard only permits capturing of errors provided we know
applications, and services) [1]. In the current Internet, people where they are located in the keyword [1]. In [9], the authors
can easily access their data stored in the cloud with their proposed a new cryptographic primitive called Public Key
mobile devices from anywhere e.g., check emails, read the Error Tolerant Searchable Encryption (PKETS) which is based
history of online chatting applications, view previously saved on public key encryption with keyword search proposed in [2].
photos, videos or other kind of documents. To provide security This algorithm was applied to the biometric data in [13].
in all such scenarios, it is essential to store and access the Acceptable erroneous keywords did not have to be specified in
outsourced data in a secure and efficient manner. For the advance in their algorithm. However, this approach was
protection of data privacy and control, data is usually designed for a special type of data i.e. iris code. This
encrypted before outsourcing, which makes its effective technology is useful at airports as a replacement for passports
utilization a challenge. In particular, indexing and searching but it is not designed for text documents. The authors in [14],
the outsourced encrypted data becomes problematic. proposed to embed edit distance (Levenshtein distance) into
Searchable encryption (SE) allows searching over encrypted Hamming distance to obtain a fuzzy keyword search suitable
data in the cloud and returns to the user the data that for strings and then text files. This method uses existing
correspond to the given keywords, without having to reveal locality sensitive hashing (LSH) to enable the fuzziness in the
the keywords. It is thus a critical enabler for securing search method and has a very low distortion. However, this
outsourced data. Traditional searchable encryption [2]-[7] method is mainly theoretical and the proposed embedding
schemes allow a user to securely search over encrypted data technique introduces a lot of redundancy, which increases the
through keywords but only support 1) exact keyword dimension of the stored data, and is not suitable for the case of
mobile usage because of the small amount of memory
The paper is submitted the October 2014. available. Another method, proposed in [15], uses bloom
This work is made within the framework of the Irish Centre for Cloud
Computing and Commerce (IC4) which is an Irish government and Enterprise filters and Jaccard similarity to perform the translation and the
Ireland supported technology research centre established in 2012. LSH. It also introduces ranking of the retrieved encrypted
A. Awad. B. Lee, A. Matthews and Y. Qiao are within the Irish Centre for data. However, the ranking has to be performed by the user
Cloud Computing and Commerce Software Research, Athlone Institute of
Technology, Software research Institute, Athlone, , Ireland. himself and not automatically by the server which can add
e-mail: [email protected], [email protected], [email protected], unwanted burden for a mobile user’s device.
[email protected]

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2015.2511747, IEEE
Transactions on Cloud Computing

IEEE TRANSACTIONS ON CLOUD COMPUTING 2

Actually, very few searchable encryption schemes support The distance between these sets can be obtained by (4) as
the ranking of matched items though this problem has recently follows:
attracted the attention of some researchers [11], [12], [16].
Fuzziness and ranking are currently two different research d A, B! = 1 − s A, B! (4)
axes and very few researchers have considered combining
them [15], [17]. However, these methods are either not For this measure, some researchers [26] proposed an LSH
practical for mobile usage as is the case in [15] or they suffer family called min-hash. If π is a random permutation, the hash
from security problems as is the case in [17]. value is defined by (5) as follows:
In this paper, we propose a new fuzzy transformation by
introducing chaos and enhance the fuzziness through h3 A! = min6π a! ∕ a ∈ A9 (5)
amplification of the LSH, which significantly improves both
the security and the efficiency of the fuzzy searching process and the probability that the two hashed values are equal, is
compared to the existing solutions. Furthermore, equal to the Jaccard distance (6):
comprehensive tests on different LSH methods are performed
in order to select the best one to be used in our algorithm. Pr h3 A! = h3 B! = s A, B! (6)
Chaotic systems are widely used in the cryptography domain
and have attracted the attention of many researchers [21]-[23] In our proposal, min-hash is used to support the fuzzy
due to the interesting characteristics of chaos. However, to the transformation applied on the keyword indexing the
best of our knowledge, this is the first paper proposing to use outsourced files.
chaos in the searchable encryption schemes. Our proposed
system is, in addition, designed to support fuzzy and ranking
B. Bloom filters
mechanisms and is proven to be practical for mobile usage.
A Bloom filter is a data structure used for answering set
The paper is organized as follows: Sections 2 and 3 provide
membership queries.
background information and related works. Section 4 presents
the proposed and tested chaos based locality sensitive hashing
Bloom filter with storage
methods. Section 5 describes the proposed chaotic searchable
A Bloom filter with storage is an extension of Bloom filter
encryption solution. The simulation results and security
[9], [27]. It gives not only the result of the set membership test
analysis of the proposed algorithm are given in section 6.
but also an index associated with the element. It has an array
Finally, we summarize our conclusions in section 7.
of subsets called buckets T , … T which are initially empty.
For each element y to be indexed, we add to the bucket T< all
II. USEFUL TOOLS
the tags associated with y as follows: T< ← T< ∪ ψ y! where
A. Locality sensitive hashing ψ is the tagging function and α = hA y!, j ∈ 1, v (where v is
The differences occurring between similar data are reduced the number of hash functions).The set of tags associated with
y is obtained by computing the intersection between the
by LSH functions with high probabilities. Then, similar results
corresponding buckets ⋂HAI TDG .
are obtained for data with close proximity. However, distant
data remain remote. E F!

Let B be a metric space, λ , λ ∈ ℝ with λ <λ


and ϵ , ϵ ∈ 0,1 with ϵ > ϵ Bloom filter encoding
A family H = h , … , h is an LSH family if for all x, x ∈ B: Bloom filter encoding is described by the authors in [15]. A
bloom filter is a bit array that is affiliated with some hash
Pr h x! = h x′! > ϵ , if d x, x′! < λ
functions. Each hash function maps an element to a bit
(1)
location with a uniform probability. The bloom filter in this
case is used to embed a string S into the filter in order to
Pr h x! = h x′! < ϵ , if d x, x′! > λ (2)
obtain an array of numbers which can be used as an input for
the minhash method. Each n-gram of a keyword is subject to
' is the distance metric utilized (e.g. Hamming distance).
each hash function and the corresponding bit locations are set
The choice of a suitable LSH depends on the data types
to 1. The indices of the "1" values in the bloom filter provide
(Binary, Euclidean space, biometric...). A survey of existing
the array of numbers which can be then used as an input for
LSH families can be found in [24].
the minwise permutation to obtain the minhash value.
The Jaccard coefficient is usually used to measure the
similarity between two sets ( and ) containing words from
C. Order Preserving Symmetric Encryption
two documents [25]. It is defined in (3) as follows:
The OPSE [28] is a deterministic encryption scheme in
|-⋂/|
which the numerical order of the plaintexts is preserved by the
s A, B! = |-⋃/| (3) encryption function and a comparison operation can then be
performed without revealing the plaintext values. In our
proposal, we use OPSE to encrypt the relevance score of each

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2015.2511747, IEEE
Transactions on Cloud Computing

IEEE TRANSACTIONS ON CLOUD COMPUTING 3

keyword in the related files. These values need to be stored in al. algorithm resolves this issue and provides the privacy
the cloud in order to perform ranking in the search phase and missing in the existing algorithms. This method uses a
must be secured as they can reveal information about the combination of LSH method specific for an iris code (beacon
keywords and the files. Traditional methods such as AES are indexes) to enable the fuzziness and a Bloom filter with
not appropriate for this case as the relevance score ranking storage to accelerate the search on the encrypted data.
need to be achieved on the encrypted values. In this situation, In [14], the authors modified the above mentioned
encryption schemes like OPSE that preserve the numerical algorithm to allow its usage for text messages. The changes
ordering should be used. entail on applying embedding and sketching methods on the
message which enables the application of the above mentioned
D. PWLCM Chaotic Map algorithm in [9], [13] that was previously used for the
Chaos has certain distinct characteristics, e.g. good pseudo- biometric information. However, the algorithm is still
randomness and sensitivity to its control parameters, that can theoretical and no implementation or test is provided.
be directly linked to the properties of confusion and diffusion The authors in [1], proposed an Effective Error-Tolerant
in cryptography. In addition, these systems are deterministic, Keyword Search for Secure Cloud Computing. They propose a
meaning that their future behavior is fully determined by their scheme based on a fuzzy extractor. Their method is able to
parameters, with no random elements involved. However, the transform the servers' search for error-tolerant keywords on
chaotic signal is pseudo-random and may appear as noise for cipher texts to the search for exact keywords on plaintexts
unauthorized users. Chaotic values are often generated with using an index table. Their method is tested on the Digital
simple iterations, which make chaos suitable for designing Bibliography & Library Project (DBLP) dataset, which was
strong and high speed systems. developed and maintained by a team from Germany Trier
PWLCM (Piece Wise Linear Chaotic Map) [21], [23] is one University. The algorithm seems promising but it does not
of the simplest chaotic systems and has good properties [29]. take the ranking problem into consideration.
A piecewise linear chaotic map is a map composed of multiple
linear segments and can be described in (7) as follows: B. Ranking based SE method
In [11], the authors are the first to propose a ranked
x ( n ) = F [ x ( n − 1)] keyword search over encrypted cloud data that enables
 1 effective utilization of remotely stored encrypted data in the
 x ( n − 1) × p if 0 ≤ x ( n − 1) < p cloud. They embed weight information (relevance score) of
 each file during the establishment of a searchable index before
 1
=  [ x ( n − 1) − p ] × if p ≤ x ( n − 1) < 0 . 5 (7) outsourcing the encrypted file collection. They also used
 0.5 − p
 F [1 − x ( n − 1)]
Order Preserving Symmetric Encryption (OPSE) to protect
if 0 . 5 ≤ x ( n − 1) < 1
 this sensitive information. Experimental evaluation is
 conducted on the Request For Comments (RFC) database [30].
where the positive control parameter and the initial condition This scheme allows the ranking of the searched files but does
are respectively p є (0; 0.5) and x(i) є (0; 1). not take into account the fuzziness of the keyword.

III. RELATED WORK C. Combined fuzziness and raking based SE methods


In this section, we briefly explain some existing searchable In [15], the authors proposed a symmetric scheme for
encryption methods. We classify these methods into three similarity search over encrypted data and their algorithm
groups; Fuzzy SE methods, ranking based SE methods and allows a fuzzy keyword search over text documents. First, a
combined fuzziness and ranking based SE methods. translation is used to embed strings into a Bloom filter. In this
case, each keyword is represented by a set of substrings of
A. Fuzzy SE methods length n or n-grams. Then, each substring is hashed and the
In their papers [9], [13], Bringer et al. proposed a new corresponding bit locations set to one. The other buckets of the
scheme permitting search over encrypted data with an Bloom filter are null. The encoding, J, of the keyword is an
approximation of a keyword. An application in the biometric array of the bit locations in the Bloom filter. If ∆ is the domain
domain is also proposed. A biometric identification scheme of all possible elements of the encoding set J and L is a
arises from this construction; it permits identification of a random permutation on ∆, L M is the element in the M NO
person using his biometrics in an encrypted way. A specific position of L and PMQ is a function that returns the minimum
difficulty concerning biometrics is their fuzziness. It is nearly of a set of numbers. Then, the minhash of a keyword ( under
impossible for a sensor to obtain the same image from L is as follows in (8):
biometric data twice. The classical way to solve this problem
is to use a matching function, which basically tells if two PMQℎSTℎU J! = PMQ 6M| 1 ≤ M ≤ |∆|ΛL M 9! (8)
measures represent the same biometric data or not, but these
methods do not meet the privacy requirements that someone This method also permits the user to perform the ranking by
can expect from an such identification scheme. The Bringer et means of the encrypted bit vectors returned by the server as an

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2015.2511747, IEEE
Transactions on Cloud Computing

IEEE TRANSACTIONS ON CLOUD COMPUTING 4

answer to the user's query. Once the ranking is performed, the method [18], [19]. GRP permutation is defined in (9) as
user sends the identifiers of the data items with top W high follows:
scores to the server which, in turn, returns the encrypted items
corresponding to the provided identifiers. The user decrypts J3 = YZ[ J1, J2! (9)
these to obtain their plaintexts. The authors provide an
implementation and test of the method on the Enron dataset R1 is the source array, R2 is the configuration array which
[31]. As can be seen, this method combines ranking of the is generated by a pseudo random generator and R3 is the
returned results and fuzziness of the search. However, this destination array for the permuted values.
method requires additional work i.e. the ranking must be The basic idea of the Grp is to divide the values from the
performed by the user which is not practical for a mobile source R1 into two groups according to the values in R2. For
device. In our proposal, the user is relieved from this task and each bit in R1, we check the corresponding bit in R2. If the bit
the ranking is calculated automatically by the server while in R2 is 0, we move this bit from R1 into the first group.
maintaining the privacy and the confidentiality of the user. Otherwise, we put this bit into the second group (see Fig.2).
In [17], we proposed a new solution enabling fuzzy search
and ranking together on encrypted data in the cloud. This R2 1 0 0 1 1 0 1 0
method uses a murmur hash function to enable the fuzziness.
The used LSH takes a word as an input and then hashes each R1 a b c d e f g h

letter of this word. The locality sensitive hash value of this


word is then the minimum of the letters' hashes. In Fig. 1, we
give an example of the word "minhash".
R3 b c f h a d e g

m i n h a s h
Fig. 2. GRP permutation method

Lsh(m) Lsh(i) Lsh(n) Lsh(h) Lsh(a) Lsh(s) Lsh(h)

min
2) Omflip minhash
This method is also a variation of Kuzu but the random
Lsh(n)
permutation is replaced this time by the Omflip permutation
Fig. 1. A simple minhash example
method [23]. The OMFLIP (OMega-FLIP) permutation is
basically a concatenation of two permutation stages – an
Even though this method is quite fast and gives good results Omega stage and a Flip stage.
for the fuzzy search, it suffers from security issues because of
the simple fuzzy transformation used. R2 1 0 0 1 1 0 1 0

IV. PROPOSED LOCALITY SENSITIVE HASHING R1 a b c d e f g h


METHODS

In this section, we describe the locality sensitive hashing 1st stage 1 - 0 - 0 - 1 -


methods proposed in this paper. In the first chapter A, we
describe the two proposed minhash methods: Grp and Omflip e a b f c g h d
minhashes which will be compared to Kuzu minhash [15] later
in the paper. In chapter B, we describe the amplification that
we apply on the proposed minhashes to obtain the amplified 2nd stage 1 0 1 0 - - - -

Grp minhash and amplified Omflip which is also compared to


the amplified Kuzu minhash in section VI. R3 a b g h e f c d
The modification of all the mentioned minhashes with the
chaos that we propose in this paper is explained in chapter C
Fig. 3. OMFLIP permutation method
and the obtained chaotic minhashes are introduced: chaotic
Kuzu minhash, chaotic Grp minhash, chaotic Omflip minhash, In Fig. 3, J1 is the source array that contains the values to
amplified chaotic Kuzu minhash and amplified chaotic Grp be permuted. J2 is the configuration array that is generated
and Omflip minhashes. randomly and which should be of the same length as J1. J3 is
A. Minhash methods the resultant permuted sequence. The values are permuted in 2
1) Grp minhash stages. In the first stage, the values of J1 are placed as shown
In this method, the same scheme of Kuzu is used but the in Fig. 3. The first ]/2 values of the control sequence J2
random permutation is replaced by the Grp permutation control the permuted array. If the value of the box M of J2 is 1,
the two adjacent cells M and M + 1 are permuted. If not, nothing
is done. In a second stage, the bits of the array resulting from

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2015.2511747, IEEE
Transactions on Cloud Computing

IEEE TRANSACTIONS ON CLOUD COMPUTING 5

the first stage are placed as shown below in Fig. 3. The last methods in order to determine the best one to use in our
]/2 values of the control sequence J2 control the resulting searchable encryption method.
array. If the value of J2 is 1, the two adjacent cells M and
M + ]/2 are permuted. If not, nothing is done. At the end, the V. THE PROPOSED ALGORITHM
permuted OMFLIP array J3 is obtained. In this section, we propose a new chaotic searchable
B. Amplified minhash methods encryption algorithm. The proposed approach allows
searching over encrypted data stored in the cloud and returns
To amplify a locality-sensitive hashing family a AND-OR
the relevant files to the queries in a ranked order. This scheme
construction can be used [15].
permits search not only with the exact keyword used during
The AND construction is formed with ` random functions
the storage process, but also with an approximate keyword.
from H: g = bh c Λh e , … Λ h f g. In this context, g x! = Fig. 4 introduces the considered scenario for the proposed
g y! if and only if ∀j ih E x! = h E y!j where 1 ≤ j ≤ k. The algorithm.
OR construction is formed with λ different AND constructions
such that g x! = g y! if and only if ∃ibg x! = g y!g where
1 ≤ i ≤ λ.With such a construction, we can turn an
r , r , p , p ! sensitive family into an r , r , p′ , p′ ! &
File

n o
sensitive family where p′ = 1 − b1 − p g and p′ = 1 −
File Keyword 1, File
Keyword 2,...

n o
Keyword 1,

b1 − p g .
Keyword 1,
Keyword 2,...
Keyword 2,...

File

We applied the AND-OR construction on the three Keyword 1,


Keyword 2,...
File File
explained minhash methods in order to obtain: the amplified
Keyword 1, Keyword 1,
Kuzu minhash, amplified Grp minhash and amplified Omflip Keyword 2,...
File Keyword 2,...

minhash that we compare in section VI of this paper. Keyword 1,


Keyword 2,...

C. Chaotic minhash methods


Fig. 4. Usage scenario of the proposed algorithm
The idea of taking advantage of digital chaotic systems and
of constructing chaotic cryptosystems has been extensively It consists of two different parts: the sending process and
investigated and attracted many researchers [18]- [23] but to the searching process. In the sending part, the user’s computer
the best of our knowledge, it has not been previously encrypts the files that he wants to store in the cloud. It creates
considered for searchable encryption methods. In this paper, the meta-data necessary for the cloud to search these files
we propose new minhash methods based on Piece Wise Linear later. In the searching process, the user queries the cloud
Chaotic Map (PWLCM) presented in section II. In these through his mobile phone. The cloud receives the hashed
methods, the translation i.e. the encoding of the keyword, is query, performs the search, retrieves and returns the required
performed by the chaotic map instead of the Bloom filter used documents in a ranked order. We give below the detailed
by Kuzu et al. [15]. PWLCM is then used to transform the description for both processes.
keyword to a set of numbers that will be used as input for the A. Sending a message x
minwise permutation method in order to obtain finally the
This phase consists, basically, of two transformations; the
minhash value.
fuzzy and the ranking calculation. The obtained keyword
index (from the fuzzy calculation) and the ranking score, sc,
A 1-gram shingling is applied on each keyword and the
ASCII code of each letter is mapped to the interval [0,1] and
are stored in the cloud in addition to the encrypted file itself.
then encoded by the chaotic map. For each shingle, a number
Assume that N files need to be added to the cloud. The
of iterations are performed and the obtained chaotic values are
following steps of the algorithm will be performed:
then mapped to integers in the interval [0,m], where m is a
1- The cloud attributes to each file FA , j = 1 … N! a unique
secret parameter for the minhash. Finally, the keyword is
represented by an array of values that are used as an input for identifier FsE and sends it to the user device.
the minhash method. The usage of chaos instead of a Bloom 2- The user device sends the encrypted file EncbFAg to the
filter in the translation phase in the above mentioned cloud which stores it in a storage cell that depends on FsE .
minhashes gives the following chaotic minhashes: chaotic 3- The user device adds FsE (the ID of the file) to the
Kuzu minhash, chaotic Grp and Omflip minhashes. When the
amplification method i.e. the AND-OR construction is also posting list Ivw of the corresponding keyword w and adds also
applied on each one in addition with chaos, the amplified the relevance score sc of this keyword for the added file. A
chaotic minhashes are obtained: amplified chaotic Kuzu posting list (also referred to as inverted index) is an index data
minhash and amplified chaotic Grp and Omflip minhashes. A structure storing a mapping from content, such as words, to
comparison is performed on these locality sensitive hashing the location of a file in a set of documents. The purpose of a
posting list is to allow fast searches over the database. An

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2015.2511747, IEEE
Transactions on Cloud Computing

IEEE TRANSACTIONS ON CLOUD COMPUTING 6

example of the posting list for a keyword yz is given in Fig. 5 and then retrieve the most relevant files for this
[17]. keyword/query.
We explain below the algorithm in steps:
}~c }~e }~• 1. Retrieve the items that contain a keyword yz . The user
File ID
...
}~
Relevance applies the amplified locality sensitive hashing on this
–L—˜ …T{byz , }~c gŠ –L—˜ …T{byz , }~e gŠ –L—˜ …T{byz , }~• gŠ
score keyword in order to construct the query of `”“ hash values.
Fig. 5. Posting list Ivw of the keyword w
Then the cloud uses this array of hashes as follows:
Each ` hashes are used to find the corresponding posting
Note that the size of the posting list varies from one list(s) in the hash map. This process is done for the “ sets.
keyword to another one depending on the number of files Then, the most frequently found posting list is considered as
where this keyword occurs. The definition of the relevance the most similar to the queried keyword and will be put in the
score sc is given in (10) as follows: first rank. The ranking is continued in descending order of the
frequency of occurrence of retrieved posting lists until
Œ
T{ |, }~• ! = ∑ˆ‰ ∈‹ . …1 + †Q‡~•,ˆ‰ Š . †Q i1 + j (10) completed.
•‚ƒ• • •މ
2. The cloud uses the retrieved and ranked posting lists in
|denotes the set of the keywords in each index.
order to find the required file IDs corresponding to the query
starting with the posting list with the highest rank.
‡~• ,ˆ‰ is the term frequency (TF) of the term yz in the file
3. The cloud returns the requested number of matched files
identified by }~• . in a ranked order based on the (encrypted) keyword scores in
•}~• • is the length of file having the identifier }~• . It is the files of each posting list i.e. the desired encrypted data
items are returned to the user starting with the most relevant
obtained by counting the number of indexed terms.
one (starting with the file having the highest score to the file
‡ˆ‰ is the number of files that contain the term yz .
having the lowest score). This comparison between the
N is the total number of files in the collection. encrypted scores is possible thanks to the property of the
OPSE encryption which conserves the numerical order of the
In our algorithm, we consider the case of a single keyword
scores' values. The files of the posting list of higher rank are
search. In this case, (11) could be used instead of (10) for the
first returned to the user device in order. Then, following the
ranking purpose:
number of required files, the posting lists having lower ranks
are used to return more files to the user device.
T{•Z• yz , }~• ! = . …1 + †Q‡~•,ˆ‰ Š (11) 4. Once the encrypted items corresponding to the search
•‚ƒ •

request are retrieved, the user device decrypts them to obtain
the plain versions of the requested files.
As directly outsourcing relevance scores (without
encryption) will leak sensitive frequency information, thereby
VI. EXPERIMENTAL RESULTS AND SECURITY
weakening the keyword privacy, order preserving encryption
ANALYSIS
(OPE) is used to protect sensitive weight information (scores).
4- The locality sensitive hashing function (minhash), which The system architecture consists of three components that
is effective for the Jaccard measure, is applied on each are implemented as software modules: The Client PC, Cloud
keyword. Manager and Smartphone App. The Client PC software sends
5- The user sends the minhash value(s) †Tℎ yz ! and the encrypted files and keyword indices to the cloud. The Cloud
posting list ‘ˆ‰ to the cloud where they are added to a hash manager stores the encrypted files. The Smartphone App
map. Note that the AND-OR construction is applied on the searches for files from the cloud. While a file can have many
locality sensitive hashing method. Then, we have ` × “ hash keywords as an index, the Smartphone will only be able to
values that will be used as pointers for the posting list of the search with one keyword each time. It is assumed that there
corresponding keyword as follows: are secure communications between the different components
Each array of ` hashes will be inserted in a hash map which of the system.
is a variant of the Bloom filter with storage. Each of the “ The hardware required for the implementation comprises:
arrays of size ` (` hashes) can then point to this posting list.
- A PC acting as a cloud. The used PC is a Dell XPS 8500,
Intel(R) Core (TM) i7-3770 CPU, 3.40 GHzx4, memory12
This is how the AND-OR construction is performed. The same
GB.
approach is also used in the retrieval process below.
- A PC representing the Client PC in the storage process.
B. Retrieving a message x The used PC is a Dell, Intel Core i7-3770 CPU, 3.40 GHzx8,
During this process, the client mobile phone needs to memory 11.8 GB.
perform the fuzzy transformation on the keyword and query - An Android phone representing the Smartphone client in
the cloud with it. In its turn, the cloud uses this index (hashed the search process. The used phone is a Samsung Galaxy S3.
keyword) to find the corresponding posting (encrypted) list The test is performed over WiFi (IEEE 802.11g) network.
The client PC Software and the Cloud Manager are

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2015.2511747, IEEE
Transactions on Cloud Computing

IEEE TRANSACTIONS ON CLOUD COMPUTING 7

implemented in Java. OPSE is C based and the Smartphone The effect of each of the amplification parameters is shown
app is Android based. All the tests are performed on selected later on in the paper.
files from the Request For Comments database (RFC) [30]. Table II shows the failure ratio of the chaotic locality
sensitive hashing methods: chaotic Kuzu and Omflip
A. Choice of the LSH
minhashes and their amplified versions.
In this section, a comparison between a number of locality
sensitive hashing methods with, and without, the amplification
TABLE II
method i.e. the AND-OR construction is conducted in order to FAILURE RATE FOR CHAOTIC LSH METHODS WITH AND WITHOUT
choose the best method to be used in our storage and search AMPLIFICATION
scheme. Amplified Amplified
Chaotic Chaotic
To perform this test, we applied the locality sensitive Chaotic Chaotic
Kuzu Omflip
hashing methods on 1000 selected keywords from the RFC Kuzu Omflip
minhash minhash
database. In order to calculate the failure rate, we inserted minhash minhash
random misspelling errors on these keywords (1 Failure
error/keyword) after which we applied the locality sensitive 139 0 117 18
(‰)
hashing methods and compared the resulted hashes.
The parameters of the AND-OR construction are ` = 3, The chaotic locality sensitive hashing methods shown in
“ = 37. The secret parameter for the chaotic minhash is Table II give better failure rates than the values shown in
m=70, the number of chaotic iterations performed on each Table I. As we can see, the introduction of the chaos reduced
shingle of the keyword is equal to the length of the keyword the failure rate for Kuzu LSH (75%) and Omflip LSH (78%)
and the chaotic control parameter is [=0.3. compared with the original Kuzu and Omflip LSH methods
It is assumed that failures are the number of times the and also gives better results than the amplification of the
locality sensitive hashing method does not lead to a similarity original LSH methods. However, the amplification of the
between the original and the misspelt word (erroneous word). chaotic methods is also more effective e.g. the failure rate is
Table I shows the failure ratio of the following locality reduced from 25% to 0% by introducing chaos into the
sensitive hashing methods: Kuzu minhash, Grp minhash, and amplified Kuzu minhash.
Omflip minhash and their amplified versions. The following tests are performed on the storage and search
schemes using the amplified chaotic Kuzu locality sensitive
TABLE I hashing method as a fuzzy transformation and the same
FAILURE RATE FOR LSH METHODS WITH AND WITHOUT AMPLIFICATION chaotic parameters values of this section are used.
Amplified Amplified Amplified
Kuzu Omflip B. Time of index construction
Kuzu Omflip Grp
Grp
minhash minhash minhash To allow the fuzzy ranked search, an index object (or
minhash minhash minhash
posting list) is created for each indexed keyword on the client
Failure PC. This posting list contains the hashed keyword, the file IDs
570 250 534 365 540 385
(‰)
and the encrypted scores (encrypted with OPSE).
We built the indexes using 100 indexed keywords and 100
As we can see, the failure is more than 500‰ for the locality selected files from the RFCs. The used AND-OR construction
sensitive hashing methods Kuzu, Omflip and Grp without parameters are ` = 3 and “ = 37.
amplification. Omflip minhash gives slightly better results
(534‰) comparing to Grp minhash (540‰) and Kuzu minhash 1) Effect of the indexed and stored files
(570 ‰) which gives the biggest failure in this case. However, Table III shows the effect of the indexed and stored files
applying the AND-OR construction i.e. the amplification number on the index construction time. As we can see, the
method with (` = 3 and “ = 37) reduces the failure rate by average index construction time increases with the number of
around 30% (for Omflip and Grp lsh methods) and better stored files as the posting list size increases. The same
results are obtained for the amplified Kuzu method where the keyword can be used, in this case, to index more files if this
failure rate is reduced by more than 56% comparing to Kuzu keyword occurs inside these files. The indexing and the score
LSH without amplification., Amplified Kuzu is the best calculation are file content based.
compared to the Amplified Omflip minhash and Grp
minhashes as it has the lowest failure rate between the three TABLE III
FILES' NUMBER EFFECTS ON THE INDEX CONSTRUCTION TIME
amplified minhashes. The failure rate can be obviously Number of
reduced further by increasing the number of the generated files 20 100 300 500 800 1000 3000
minhashes for each keyword i.e. by increasing the AND-OR Time (ms)
parameters' values. However, increasing these parameters will
add more complexity and require more computation time to
Average Index
calculate the LSH of each keyword which will respectively 56 182 484 786 1443 2073 11769
construction time
affect the efficiency of the whole storage and search scheme.

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2015.2511747, IEEE
Transactions on Cloud Computing

IEEE TRANSACTIONS ON CLOUD COMPUTING 8

Rank calculation 1 sec 1 sec 1 sec 1 sec


Average OPE time 50 163 429 697 1256 1781 9472 time 789 ms 781 ms 779 ms 765 ms

Minhash time 31 ms 44 ms 68 ms 77 ms
Rank calculation time 5 18 54 88 185 290 2295

As can be seen also, the OPSE requires the majority of the TABLE VI
EFFECT OF “ ON THE INDEX CONSTRUCTION TIME WHEN ` = 3
time of the index construction. It is argued here that it is “
Time 15 26 37 60
implemented in a different language, C, that is called as an
executable by the Java program leading to an increase in the Index construction 17 sec 18 sec 18 sec 18 sec
processing time. A Java implementation of the OPSE may time 443 ms 189 ms 471 ms 531ms
enhance the speed of the indexing construction process. The 15 sec 16 sec 16 sec 16 sec
OPE time
average time to perform the minhash on a keyword is 10 ms. 590 ms 338 ms 596 ms 630 ms
This value is considered in the index construction time 1 sec 1 sec 1 sec 1 sec
Rank calculation time
measurement but it is independent from the number of 780 ms 776 ms 781 ms 806 ms
performed files and depends on the size of the keyword itself.
Table IV shows how the index construction time increases Minhash time 34 ms 38 ms 44 ms 58 ms
when increasing the number of indexing keywords.
C. Search time
TABLE IV The search time means the average time between the search
KEYWORDS' NUMBER EFFECTS ON THE INDEX CONSTRUCTION TIME
Number of request and the identification of the files' identifiers.
keywords 10 100 300 500 800 1000
This time can be split into three distinct slots:
Time The first slot š is the time for processing the minhash over
the phone. The second slot š is the processing and search
Index
5 sec 18 sec 45 sec
1 min 1min 1 min
time on the Cloud Manager side. Finally, the third slot š› is
construction 6 sec 16 sec 21 sec
755 ms 223 ms 291 ms the phone round trip time which includes the communication
time 468 ms 56 ms 521 ms
time between the phone and the cloud and the displaying of
1 min 1 min
5 sec 16 sec 39 sec 56sec the file IDs on the phone.
OPE time 3 sec 6 sec
196 ms 344 ms 485 ms 981 ms To perform this measurement, 100 text files from the RFC
633 ms 395 ms
Rank indexed by 100 keywords are used. We show in Table VII, the
calculation 494 ms
1 sec 5 sec 9 sec
787 ms 663 ms 298 ms
12 sec
180 ms
14 sec
855 ms
value of each time slot šœ , • = 1, 2, 3 for 100 queries. The
time AND-OR construction parameters are ` = 3 and “ = 37.
Minhash time 23 ms 44 ms 85 ms 121 ms 171 ms 195 ms TABLE VII
SEARCH TIME FOR 100 QUERIES
Minhash Cloud Manager Communication
2) The effect of the LSH parameters Time š processing Time T Time š›
In this section, we measure the effect of the AND-OR
construction parameters on the index construction time for 100 Time (ms) 792 60 306
keywords used to index 100 files, which is shown in Table V
and VI. The average search time for one query is shown in Table
The effect of `, λ parameters is not significant on the index VIII.
construction time. Actually, the AND-OR construction TABLE VIII
parameters affect only the locality sensitive hashing time SEARCH TIME FOR 1 KEYWORD
which is relatively fast. The Rank calculation and the OPSE Minhash Cloud Manager Communication
time depend on the number of files and keywords and are Time š processing Time T Time š›
independent of the number of the hashes for each keyword. Time (ms) 10 2 59

TABLE V
EFFECTS OF K ON THE INDEX CONSTRUCTION TIME WHEN “ = 37 Notice that in Table VII, we presented the time when the
` 1 3 7 9 cloud is performing the search for 100 queries and returning
Time
Index construction 18 sec 18 sec 18 sec 18 sec
the corresponding file IDs in one round trip which reduces the
time 391 ms 471 ms 512 ms 675 ms latency of the wireless transmission and the networking
delays.
16 sec 16 sec 16 sec 16 sec
OPE time
531 ms 596 ms 620 ms 791ms

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2015.2511747, IEEE
Transactions on Cloud Computing

IEEE TRANSACTIONS ON CLOUD COMPUTING 9

1) The effects of the number of stored data items 2500


In this test, we show the effects of the number of stored data
items on the search time. The number of indexed keywords is 2000

Search Time (ms)


100. Table IX presents the changes in the processing and 1500
search time š on the cloud manager side and the phone round
trip search time š› . In this test, ` = 3, “ 37 and the number 1000
of stored keywords are 100.
500
As we can see, the search time increases linearly with the
number of stored data because the size of the posting list 0
becomes bigger when increasing the number of stored files. In 15 25 35 45 55 65
this case, the cloud manager needs to search and rank more λ
files for each keyword. The number of relevant file IDs also
increases which affects also the communication time. Fig. 7. Search time (ms) over “ for ` 3

TABLE IX
In this test, we assumed that the user is looking only for the
EFFECT OF THE NUMBER OF STORED FILES files indexed by the keyword most similar to the query. In this
Number of case, the search time is increasing with ` and “. Otherwise, if
files 20 100 200 500 800 1000 2000 3000
Time (ms) the user is requesting more files, the keywords with less
similarity to the query will be considered. In this case, the
Time on the
cloud/keyword
1 2 3 13 22 28 58 146 search time will increase when increasing “, because more
similar items are retrieved, and decrease when increasing k
Communication
51 59 76 120 135 168 291 431 because fewer similar items will be found and returned to the
Time/keyword
user device.

The search performance does not depend on the number of D. Precision/recall


keywords. In fact, the cloud manager does not have to traverse Precision and recall metrics are used to evaluate the
every posting list in the stored indices, but instead uses a hash retrieval performance of our algorithm [15], [16].
map to fetch the corresponding posting list which makes the To do the evaluation, 1000 random keywords are selected
search independent of the number of stored posting lists or from the database and used to index the files. Then, we
indexed keywords. inserted spelling errors in 25% of these keywords and queried
with the new set of keywords.
2) The effect of the LSH parameters If ]•Ž is the number of positive files in the cloud, J•ž is the
Fig. 6 and 7 respectively show how the choice of k and number of retrieved files for a query Ÿ and J•Ž is the number
“ affects the search time for 100 keywords used to index 100
of positive retrieved files i.e. the number of retrieved files for
files.
the original keyword w.
4000 Then, the precision prec q! and recall rec q! of a query Ÿ
3500 and the averages avprec Q! and avrec Q! for the query set
£ 6Ÿ , Ÿ , … , Ÿ¤ 9 are described in (12), (13), (14) and (15)
Search Time (ms)

3000
2500 as follows:
2000
¥ ¦Ž
1500 prec q! ¥¦ž
(12)
1000
500 §¨©ª «w!
avprec Q! ∑I (13)
0
1 2 3 4 5 6 7 8 9
¥¦Ž
k rec q! (14)
Œ ¦Ž

Fig. 6. Search time (ms) over ` for “ 37


¨©ª «w !
avrec Q! ∑I (15)

In this section, we calculate the precision and recall


depending on the number of files W required by the user. Once
the query is issued, if many posting lists are retrieved, these
posting lists are firstly ranked according to their similarity
with the query. The exact match is ranked first then the similar

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2015.2511747, IEEE
Transactions on Cloud Computing

IEEE TRANSACTIONS ON CLOUD COMPUTING 10

positing lists are ranked respectively. The retrieved documents respectively files) increases when “ increases but this list
identifiers are ranked according to their scores. Finally, if always contains the correct posting list (and respectively
number of requested files W is smaller than the number of positive files) for ` = 3 and the corresponding positive files
retrieved files, just the top W files are returned to the client (and are returned first to the user.
considered in the calculation as the number of retrieved files)
otherwise all the retrieved files (by the cloud provider) are TABLE XI
PRECISION WHEN ` = 3 AND WHEN THE NUMBER OF REQUESTED FILES W AND
returned to the user. Average precision and recall for 1000
THE OR CONSTRUCTION PARAMETER “ CHANGE
queries when changing W are shown in Table X.
W 1 5 10 20 50 100 500 1000
TABLE X
PRECISION AND RECALL FOR ` = 3 AND “ = 37 WHEN THE NUMBER OF “ = 15 0.87 0.85 0.83 0.81 0.79 0.78 0.77 0.77
REQUESTED FILES W CHANGES
“ = 37 0.87 0.83 0.80 0.77 0.74 0.72 0.69 0.69
W 1 5 10 20 50 100 500 1000
“ = 60 0.87 0.83 0.79 0.75 0.71 0.67 0.64 0.64
avprec Q! 0.87 0.83 0.80 0.77 0.74 0.72 0.69 0.69

The number of retrieved files affects just the precision and not
avrec Q! 0.12 0.31 0.41 0.52 0.68 0.78 0.87 0.87
the recall as long as the positive files are still retrieved.
In Table XII and XIII, we show the effect of the AND-
As we can see, the recall is increasing when the number, W, construction parameter ` respectively on the precision and
of required files increases. The number of positive files in the recall. We choose “ = 37 and we show the retrieval ratio for
cloud does not change when W changes then the denominator `=1, 2 and 3.
in the recall equation will remain the same. However, the
number of positive retrieved files (which the user receives) TABLE XII
PRECISION WHEN “ = 37 AND WHEN THE NUMBER OF REQUESTED FILES T
increases when the number of required files W is increasing. THE AND-CONSTRUCTION PARAMETER K CHANGE
For this reason, the average recall curve is increasing with W.
W
When the number of required files W is equal to the number of 1 5 10 20 50 100 500 1000

positive files in the cloud, the average recall becomes constant `=3 0.87 0.83 0.80 0.77 0.74 0.72 0.69 0.69
as we can see in Table X.
For the precision, the number of retrieved files is increasing `=2 0.87 0.79 0.73 0.66 0.56 0.48 0.41 0.40
with W. When W is small the method is most likely finding the
exact match to the query and all the retrieved files are positive `=1 0.37 0.31 0.28 0.24 0.21 0.19 0.18 0.18
thus the precision is high. When W increases the method is
either (a) retrieving other files (in addition to the files with TABLE XIII
exact match to the query) that contain similar keywords to the RECALL WHEN “ = 37 AND WHEN THE NUMBER OF REQUESTED FILES T THE
query and then the number of retrieved files is bigger but the AND-CONSTRUCTION PARAMETER K CHANGE
number of positive retrieved files (with exact match to the W 1 5 10 20 50 100 500 1000
query) remains the same thus the precision is smaller or (b) the
`=3 0.12 0.31 0.41 0.52 0.68 0.78 0.87 0.87
method is not retrieving any further files and then the
precision is constant which explains the behavior of the `=2 0.12 0.31 0.41 0.52 0.68 0.78 0.87 0.87
precision curve.
In the following subsection, we study the effect of the `=1 0.07 0.16 0.20 0.25 0.30 0.33 0.36 0.36
AND-OR construction parameters, the chaos parameters and
Changing ` has more effects on the precision values i.e. the
the error types on the precision and recall.
precision increases with `. When ` is higher, the retrieval
1) Effect of the AND-OR parameters
method is better able to find the exact match for the query i.e.
To perform this test, we first fix the AND construction
to find just the correct files without more items in addition.
parameter ` and change the OR-construction parameter “,
Then, the precision is higher. The recall is low for ` = 1 and
then we fix “ and we change `. In Table XI, we show the
it is almost the same for ` > 2.
effect of the OR construction parameter “ on the precision
when the number of requested files W changes. We choose 2) Effect of chaos parameter
` = 3 and we show the precision for “=15, 37 and 60. As it In this section, we show how the precision and recall
can be seen, the precision decreases slightly when “ increases. change with the chaos parameter [ (Fig. 1. The precision
The recall does not change because the algorithm is still slightly increases when [ decreases, as we can see in Table
XIV. However, the obtained recall values are the same when [
able to retrieve the positive keyword posting list and then the
positive files. The number of retrieved posting lists (and
changes.

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2015.2511747, IEEE
Transactions on Cloud Computing

IEEE TRANSACTIONS ON CLOUD COMPUTING 11

TABLE XIV
PRECISION FOR ` = 3 AND “ = 37 AND WHEN THE CHAOTIC PARAMETER P TABLE XVII
AND THE NUMBER OF REQUESTED FILES W CHANGE PRECISION VALUES , WHEN THE NUMBER OF REQUESTED FILES W CHANGES
W
FOR RFC AND ENRON DATABASES
1 5 10 20 50 100 500 1000
W 1 5 10 20 50 100
[ = 0.4 0.87 0.83 0.80 0.77 0.72 0.70 0.67 0.67
RFC 0.99 0.95 0.94 0.92 0.90 0.90
[ = 0.3 0.87 0.83 0.80 0.77 0.74 0.72 0.69 0.69
Enron 0.99 0.96 0.95 0.94 0.93 0.93

[ = 0.1 0.87 0.85 0.83 0.80 0.78 0.77 0.76 0.76


TABLE XVIII
RECALL VALUES , WHEN THE NUMBER OF REQUESTED FILES W CHANGES FOR
3) Effect of the error type RFC AND ENRON DATABASES
Tables XV and XVI show respectively how the precision W 1 5 10 20 50 100
and recall values change with the type of error that occurs in RFC 0.24 0.61 0.77 0.9 0.96 0.99
the query. To do the evaluation, 100 erroneous queries are
Enron 0.27 0.59 0.74 0.84 0.93 0.99
used to search over the database containing 100 files.
The type of error does not have a big effect on the
precision/recall. As we can see in Tables XV and XVI, we E. Average retrieval ratio
obtain very close precision and recall values with all types of Let w be the keyword used to index a document and Ÿ be
error with a maximum difference of 0.04 for the precision and the query. If Ws is the number of keywords within a distance
0.01 for the recall. d from a query q and R ²³ q! is the number of files indexed
´

TABLE XV
by these keywords i.e. the number of retrieved files within a
PRECISION VALUES, WHEN THE NUMBER OF REQUESTED FILES W CHANGES, distance d from the query q, which is a subset of N²³ q!, the
´
WITH THE TYPE OF ERROR OCCURRED IN THE SEARCH QUERY
number of files within a distance ' in the database, then the
W 1 5 10 20 50 100 retrieved ratio rrd q ! of a query Ÿz and the average retrieved
ratio arrd Q!of the query set £ = 6Ÿ , Ÿ , … , Ÿ¤ 9 are described
Deletions 0.98 0.96 0.94 0.92 0.91 0.91
in (16) and (17) as follows [15], [16]:
¥µ¶ ·‰ !
Permutations 0.99 0.95 0.92 0.91 0.90 0.90
rrd Ÿz ! = ƒ
, M = 1….,Q (16)
Œµ¶ ·‰ !
ƒ
Insertions 0.99 0.95 0.92 0.90 0.89 0.89
∑¸
w¹c ¨¨s ·‰ !
Substitutions 0.99 0.95 0.92 0.89 0.87 0.87 arrd Q! = (17)

TABLE XVI where Q is the number of queries.


RECALL VALUES, WHEN THE NUMBER OF REQUESTED FILES W CHANGES, WITH
THE TYPE OF ERROR OCCURRED IN THE SEARCH QUERY
The distance between a query and a file is the Jaccard
W 1 5 10 20 50 100 distance between the encoding of the query and the encoding
Deletions 0.24 0.60 0.76 0.89 0.95 0.98 of the keyword that led to this file. As with the precision and
recall tests, 1000 random keywords are selected from the
Permutations 0.24 0.61 0.77 0.9 0.96 0.99 database and used to index the files. When querying, 25% of
Insertions 0.24 0.61 0.77 0.9 0.96 0.99 these keywords have spelling errors.
In the following subsection, we study the effect of the
Substitutions 0.24 0.61 0.77 0.9 0.96 0.99 AND-OR construction parameters, the chaos parameters and
the error's types on the retrieval ratio.
4) RFC vs Enron database
Tables XVII and XVIII show the precision and recall values 1) Effects of the AND-OR parameters
for the RFC and Enron databases. For each database, 100 To perform this test, we first fix the AND construction
keywords are selected to index 100 files then 100 erroneous parameter ` and change the OR-construction parameter “ then
keywords are used as queries for each dataset to perform test. we fix “ and change `.
As we can see, similar curves are obtained for both In Table XIX, we show the effects of the OR construction
databases. parameter “ on the retrieval ratio. We choose ` = 3 and we
show the retrieval ratio for “=15, 37 and 60.

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2015.2511747, IEEE
Transactions on Cloud Computing

IEEE TRANSACTIONS ON CLOUD COMPUTING 12

3) Effect of the errors' types


TABLE XIX In the previous tests, random errors are inserted in the query
RETRIEVAL RATIO , WHEN THE DISTANCE D CHANGES, FOR ` = 3 AND WHEN
list of keywords. In this section, we test how the retrieval ratio
“ CHANGES
changes with each type of error i.e. deletion of a letter of the
d 0 0.2 0.3 0.4 0.5 0.7 1 keyword, insertion or substitution of a letter or permutation of
two adjacent letters.
“ = 60 1 0.98 0.88 0.68 0.45 0.29 0.27 In this test, 100 erroneous keywords are used to query the
database. As we can see in Table XXII, the retrieval ratio
“ = 37 1 0.98 0.85 0.61 0.39 0.25 0.23 values are very close for all types of error with a maximum of
0.04 difference between them. Consequently, the test with a
“ = 15 1 0.97 0.83 0.57 0.35 0.21 0.20 random error in the query is realistic.

As we can see, the retrieval ratio is higher when “ increases. TABLE XXII
RETRIEVAL RATIO, WHEN THE DISTANCE D CHANGES, FOR DIFFERENT TYPE OF
By increasing this value, more items are found and then more ERRORS IN THE QUERY
items within a given distance ' are retrieved.
d 0 0.2 0.3 0.4 0.5 0.7 1
In Table XX , we show the effect of the AND-construction
parameter ` on the retrieval ratio. We choose “ = 37 and we Deletions 1 0.94 0.79 0.56 0.36 0.20 0.19
show the retrieval ratio for `=2, 3 and 5.
Insertions 1 0.96 0.81 0.55 0.34 0.19 0.19
TABLE XX
RETRIEVAL RATIO , WHEN THE DISTANCE D CHANGES, FOR “ = 37 AND WHEN Permutations 1 0.94 0.79 0.59 0.34 0.19 0.18
` CHANGES
d 0 0.2 0.3 0.4 0.5 0.7 1 Substitutions 1 0.94 0.82 0.60 0.34 0.20 0.19

`=2 1 0.99 0.93 0.76 0.55 0.36 0.33


4) Enron database
`=3 1 0.98 0.85 0.61 0.39 0.25 0.23 The same retrieval ratio test is performed on Enron database
and then compared with the RFC database in Table XXIII.
`=5 1 0.90 0.66 0.30 0.09 0.01 0.01 The retrieval ratio values when the distance d increases
between the query and the stored keyword are shown.
As we can see, the retrieval value decreases when `
TABLE XXIII
increases. Actually, the algorithm is more selective, thus the
RETRIEVAL RATIO, WHEN THE DISTANCE ' CHANGES FOR RFC AND ENRON
precision is higher but the retrieval ratio is smaller. DATABASES

d 0 0.2 0.3 0.4 0.5 0.7 1


2) Effect of the chaos parameter
The use of chaos in cryptography has attracted many Enron 0.97 0.92 0.78 0.54 0.31 0.23 0.23
researchers due to its randomness and high sensitivity to its
control parameters. Even if the chaotic values change when RFC 1 0.98 0.85 0.61 0.39 0.25 0.23
modifying the control parameter [, this change doesn't
significantly affect the overall behaviour of the system vis-à- As we can see in Table XXIII , similar results are obtained
vis the retrieval ratio as we can see in Table XXI. We obtain for both databases RFC and Enron.
very close results for [=0.1,0.3 and 0.4 with a maximum
difference of 0.05 between the obtained retrieved ratio values. F. Security analysis
In this section, we analyse the security of the proposed
TABLE XXI algorithm. As assumed in all existing searchable encryption
RETRIEVAL RATIO, WHEN THE DISTANCE D CHANGES, FOR DIFFERENT VALUES
FOR THE CHAOTIC PARAMETER [
methods, we consider a semi-trusted server i.e. “honest-but-
curious” server [1], [8]-[12], [15]-[17]. The security guarantee
d 0 0.2 0.3 0.4 0.5 0.7 1
in this case is to prevent the cloud server from learning the
[ = 0.4 1 0.99 0.88 0.66 0.43 0.26 0.24 plaintext of either the data files or the searched keywords, and
achieve the “as strong-as-possible” security strength compared
[ = 0.3 1 0.98 0.85 0.61 0.39 0.25 0.23 to existing searchable encryption schemes. In our scheme, the
cloud provider can only see the encrypted files and indexes.
[ = 0.1 1 0.98 0.85 0.61 0.38 0.22 0.20 The file content is clearly well protected due to the security
strength of the file encryption scheme. Thus, we only need to
focus on keyword privacy. In the following, we analyse the

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2015.2511747, IEEE
Transactions on Cloud Computing

IEEE TRANSACTIONS ON CLOUD COMPUTING 13

security of both fuzzy transformation and ranking supported and retrieved ratio curves are obtained. Our proposed
by our algorithm. algorithm supports the search with only one keyword and an
- Concerning the fuzzy transformation: The security extension of the proposed algorithm to enable conjunctive and
requirement is typically characterized that nothing should be disjunctive multi-keywords search, will be considered in the
leaked except the result of a search, which is referred to as an future work.
access pattern or similarity pattern in our case because we are
developing a similarity searchable encryption scheme and not REFERENCES
a standard one. Thus, similar to [15], to achieve this, we [1] B.Yang, X. Pang, Q. Du, and Dan Xie, "Effective Error-Tolerant
perform multiple LSHs (Amplified LSH) on each keyword to Keyword Search for Secure Cloud Computing," Journal of
construct the query. Thus, the number of common components computer science and technology, vol. 29, no.1, pp. 81-89, Jan.
between distinct queries may leak relative similarity between 2014.
them. And, an adversary might infer some information about [2] D. Boneh, G. D. Crescenzo, "Public key encryption with keyword
search," in C. Cachin and J. Camenisch, editors, Advances in
the similarity of the indexes. In this case, the similarity pattern
Cryptology, Eurocrypt, vol. 3027 of LNCS, pp. 506–522, Springer,
is leaked instead of the search pattern which is acceptable in 2004.
the case of similarity matching. We conclude that our fuzzy [3] S. Kamara, K. Lauter, "Cryptographic cloud storage, " in Financial
searchable encryption scheme does not leak any information Cryptography and Data Security, pp. 136-149, Springer Berlin
beyond the trace, which is the maximum amount of Heidelberg, 2010.
[4] S. Kamara, C. Papamanthou, T. Roeder, "CS2: A searchable
information the content owner is willing to leak. In addition,
cryptographic cloud storage system," Microsoft Research, Tech.
the usage of chaos based LSH increases the computational Report MSR-TR, 2011.
indistinguishability from a totally random permutation based [5] Y. Earn, R. Alsaqour, M. Abdelhaq, T. Abdullah, "Searchable
transformation which results in queries computationally symmetric encryption: review and evaluation," Journal of
indistinguishable from random values and thus the fuzzy Theoretical and Applied Information Technology, vol. 30, 2011.
keyword search scheme is secure regarding the search privacy. [6] R. Koletka, A. Hutchison, "An architecture for secure searchable
cloud storage," IEEE, Information Security South Africa
- Concerning the ranking scheme: To ensure the security (ISSA), pp. 15-17, Aug., 2011.
guarantee, the cloud server should learn very little about the [7] E. Stefanov, C. Papamanthou, E. Shi, "Practical Dynamic
relevance criteria as they exhibit significant sensitive Searchable Encryption with Small Leakage," IACR Cryptology
information against keyword privacy. Similar to [11], the ePrint Archive, 2013.
[8] J. Li, Q. Wang, C. Wang, N. Cao, K. Ren, W. Lou, "Fuzzy keyword
proposed ranking scheme uses a posting list that embeds the search over encrypted data in cloud computing," INFOCOM, 2010
encrypted relevance scores in addition to file ID containing Proceedings IEEE, Dept. of ECE, Illinois Inst. of Technol.,
this keyword. These scores are encrypted using order Chicago, IL, USA , Mar. 2010.
preserving encryption (OPSE) which gives only a sequence of [9] J. Bringer, H. Chabanne, B. Kindarji, "Error-tolerant searchable
encryption," Communication and Information Systems Security
order-preserved numeric values. Though adversary may learn Symposium, International Conference on Communications (ICC),
partial information from the duplicates (e.g., ciphertext scores Dresden, Germany, pp. 14-18, Jun. 2009.
duplicates may indicate very high corresponding plaintext [10] J. Yu, J. Li, X. Wang, W. Gao, "Conjunctive Fuzzy Keyword
scores duplicates), OPSE makes it difficult for the adversary Search Over Encrypted Data in Cloud Computing,"
TELKOMNIKA Indonesian Journal of Electrical Engineering,
to predict the original plaintext score. Thus, the keyword vol.12, no.3, pp. 2104-2109, Mar. 2014.
privacy is also well preserved in our scheme. [11] C. Wang, N. Cao, J. Li, K. Ren, W. Lou, "Secure ranked keyword
search over encrypted cloud data," ICDCS '10 Proceedings of the
VII. CONCLUSION 2010 IEEE 30th International Conference on Distributed
Computing Systems, IEEE Computer Society Washington, DC,
In this paper, we proposed the first chaos based searchable USA, pp. 253-262, 2010.
encryption approach which also allows both ranked and fuzzy [12] R. Li, Z. Xu, W. Kang, K. Choong Yow, C. Z. Xu, "Efficient multi-
keyword ranked query over encrypted data in cloud computing,"
keyword searches on the encrypted data stored in the cloud.
Elsevier, Future Generation Computer Systems, vol. 30, pp. 179–
Our approach guarantees the privacy and confidentiality of the 190, 2014.
user even vis-à-vis the cloud provider who is semi-trusted in [13] J. Bringer, H. Chabanne, B. Kindarji, "Identification with
our case. The proposed method is designed to achieve encrypted biometric data, "Security and Communication Networks,
vol. 4, no. 5, pp. 548–562, May 2011.
effective retrieval of remotely stored encrypted data for
[14] J. Bringer, H. Chabanne, "Embedding edit distance to enable
mobile cloud computing scenarios. This scheme is private keyword search," Secure and Trust Computing, Data
implemented and evaluated using two databases: RFCs and Management and Applications, Communications in Computer and
the Enron database. Comprehensive tests have been performed Information Science, vol. 186, no. 1, pp. 105-113, 2011.
[15] M. kuzu, M. S. Islm, M. Kantarcioglu, "Efficient similarity search
to prove the efficiency of our proposition. First, the chaotic
over encrypted data," ICDE's12 proceedings of the 2012 IEEE
locality sensitive hashing method with 0‰ failure is selected. 28th International conference on data engineering, pp. 1156-1167,
Then, effects of different parameters of the amplification IEEE computer society Washington, DC, USA, 2012.
method (AND-OR construction) and the chaos, on the [16] W. Lu, A. Swaminathan, A. L. Varna, and M. Wu, "Enabling
Search over Encrypted Multimedia Databases," In IS&T/SPIE
efficiency of the algorithm, are shown when different numbers
Electronic Imaging, International Society for Optics and
of files are requested. The algorithm is also tested when Photonics, pp. 725418-725418, 2009.
different kind of errors (deletions, insertions, permutations and [17] A. Awad, A. Matthews, and B. Lee, "Secure cloud storage and
substitutions) occur in the query and similar precision, recall search scheme for mobile devices," in the 17th

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2015.2511747, IEEE
Transactions on Cloud Computing

IEEE TRANSACTIONS ON CLOUD COMPUTING 14

IEEE Mediterranean Electrotechnical Conference (MELECON), Dr. Brian Lee is the Director of the
pp. 144-150, Apr. 2014. Software Research Institute and holds
[18] A. Awad, A. Saadane, “New Chaotic Permutation Methods for
Image Encryption”, IAENG International Journal of Computer
a PhD from Trinity College Dublin the
Science, vol. 37, no. 4, pp. 402-410, 2010. application of programmable
[19] A. Awad, and A. Saadane, "Efficient chaotic permutations for networking for network management.
image encryption algorithms," in Proceedings of the World He has over 20 years experience
Congress on Engineering, vol. 1, 2010. in fixed and mobile network
[20] A. Awad, and D. Awad, "Efficient image chaotic encryption
algorithm with no propagation error," ETRI journal, vol. 32, no. 5,
management and has extensive
pp. 774-783, 2010. experience of systems and software
[21] A. Awad, and A. Miri, "A new image encryption algorithm based design and development for large telecommunications
on a chaotic DNA substitution method," in Communications (ICC), products He was previously director of research for LM
2012 IEEE International Conference on, pp. 1011-1015, 2012. Ericsson Ireland with responsibility for overseeing all research
[22] M. Ismail, G. Chalhoub, and B. Bakhache, "Evaluation of a fast
activities. His research interests include programmable
symmetric cryptographic algorithm based on the chaos theory for
wireless sensor networks," in Trust, Security and Privacy in networking for network management and distributed data
Computing and Communications (TrustCom), 2012 IEEE 11th analytics for infrastructure management.
International Conference on, pp. 913-919, 2012.
[23] R. Rostom, B. Bakhache, H. Salami, and A. Awad, "Quantum Dr. Adrian Matthews is a Research
cryptography and chaos for the transmission of security keys in Engineer at the SRI. He holds a BSc in
802.11 networks," in the 17th IEEE Mediterranean
Electrotechnical Conference (MELECON), pp. 350-356, Apr. Physics and Applied Mathematics
2014. (1994), an MSc in Opto-Electronics
[24] A. Andoni, P. Indyk, "Near optimal hashing algorithms for (1995) and a PhD in atomic physics
approximate nearest neighbor in high dimensions, (1999), all from the Queen’s University
"Communications of the ACM - 50th anniversary issue: 1958 -
2008, ACM New York, NY, USA, vol. 51, no. 1, pp. 117-122, Jan.
of Belfast. After several years working
2008 . in Ericsson, Athlone, he joined the AIT
[25] A. Z. Broder, "On the resemblance and containment of in 2006 and has worked on many
documents,” in proceedings of Compression and Complexity of Enterprise Ireland funded projects for the SRI.
Sequences, pp. 21-29, 1997.
[26] A. Z. Broder, M. Chaikar, A. M. Frieze, M. Mitzenmacher, "Min
wise independent permutations," Journal of Computer and System Dr. Yuansong Qiao received his PhD in
Sciences, vol. 60, no. 3, pp. 630-659, 2000. Computer Applied Technology from the
[27] B. Julien, H. Chabanne, and B. Kindarji, "Identification with Institute of Software, Chinese Academy
encrypted biometric data," Security and Communication Networks,
vol. 4, no. 5, pp. 548-562, 2011.
of Sciences (ISCAS), Beijing, China, in
[28] A. Boldyreva, N. Chenette, Y. Lee, A. O'Neill, "Order preserving 2007. As part of his Ph.D. research
symmetric encryption," in Proceedings of Eurocrypt, vol. 5479 of program he joined the SRI at Athlone
LNCS, Springer, 2009. I.T. in 2005. He continued his research
[29] A. Awad, et al. "Comparative study of 1-D chaotic generators for
digital data encryption." IAENG International Journal of
in the SRI as a postdoctoral researcher
Computer Science 35, no. 4, pp. 483-488, 2008. in 2007. He is currently a postdoctoral
[30] RFC, "Request for comments database," researcher and the Principal Investigator on the COMAND
http://www.ietf.org/rfc.html. Technology Gateway program in the Software Research
[31] “Enron email dataset,” http://www.cs.cmu.edu/enron, 2011.
Institute at Athlone Institute of Technology. His research
Dr. Abir Awad received her interests include network protocol design and multimedia
communications for the Future Internet.
PhD degree from Polytech’ Nantes
(France) in 2009. Later, she joined
the Operational Cryptology and
Virology Laboratory, ESIEA and Le
Mans University in Laval in 2010 and
the computer science lab. (LIFO),
University of Orleans in 2011 as a
lecturer-researcher. In 2012, she
worked at the computer science
dep., Ryerson University, Toronto (Canada). She is currently
working as a researcher within the Irish Centre of Cloud
Computing and Commerce (IC4) in Ireland. Her research
interests include cloud security, searchable encryption,
provenance, security analytics, trusted computing and chaotic
cryptography.

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

You might also like