0% found this document useful (0 votes)
53 views9 pages

Smart Cloud Search Services Verifiable Keyword-Based Semantic Search Over Encrypted Cloud Data

This document discusses the need for smart cloud search services that can perform semantic searches over encrypted cloud data while supporting verifiable search results. It notes limitations of existing searchable encryption schemes, including only supporting exact keyword matching and assuming an honest-but-curious cloud server model. The document then proposes a new smart semantic search scheme that returns keyword-based exact and semantic match results while supporting verifiability of search outputs even if the cloud server is selfish.

Uploaded by

lvdanyu39
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views9 pages

Smart Cloud Search Services Verifiable Keyword-Based Semantic Search Over Encrypted Cloud Data

This document discusses the need for smart cloud search services that can perform semantic searches over encrypted cloud data while supporting verifiable search results. It notes limitations of existing searchable encryption schemes, including only supporting exact keyword matching and assuming an honest-but-curious cloud server model. The document then proposes a new smart semantic search scheme that returns keyword-based exact and semantic match results while supporting verifiability of search outputs even if the cloud server is selfish.

Uploaded by

lvdanyu39
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

762 IEEE Transactions on Consumer Electronics, Vol. 60, No.

4, November 2014

Smart Cloud Search Services: Verifiable


Keyword-based Semantic Search
over Encrypted Cloud Data
Zhangjie Fu, Member, IEEE, Jiangang Shu, Xingming Sun, and Nigel Linge

Abstract — With the increasing popularity of the pay-as- I. INTRODUCTION


you-consume cloud computing paradigm, a large number of Pay-as-you-consume cloud computing paradigm has become
cloud services are pushed to consumers. One hand, it brings more and more prevalent, due to its benefits for consumers,
great convenience to consumers who use intelligent terminals; including a large number of convenient service, relief of the
on the other hand, consumers are also facing serious burden for storage, flexible data access, reduction of cost on
difficulties that how to search the most suitable services or hardware and software. A lot of companies have set up and
products from cloud. So how to enable a smart cloud search provided various cloud computing services. More and more
scheme is a critical problem in the consumer-centric cloud
sensitive data from consumers (e.g., photo albums, emails,
computing paradigm. For protecting data privacy, sensitive
personal health records and financial transactions, etc.) have
data are always encrypted before being outsourced. Although
the existing searchable encryption schemes enable users to been centralized into the cloud for its flexible management and
search over encrypted data, these schemes support only exact economic savings. Meanwhile, many technical schemes related
keyword search, which greatly affects data usability. to cloud computing service are proposed by researchers. Noh et
Moreover, these schemes do not support verifiability of search al [1] proposed a flexible communication bus model for
result. In order to save computation cost or download multimedia services in cloud environment. Shahnaza et al [2]
bandwidth, cloud server only conducts a fraction of search proposed a realistic IEEE 802.11e EDCA model for QoS-aware
operation or return a part of result, which is viewed as selfish differentiated multimedia mobile cloud services. Cabarcos et al
and semi-honest-but-curious. So, how to enhance flexibility of [3] proposed a middleware architecture that allows sessions
encrypted cloud data while supporting verifiability of search initiated from one device to be seamlessly transferred to a
result is a big challenge. To tackle the challenge, a smart second one under a cloud environment.
semantic search scheme is proposed in this paper, which However, it is a very difficult to search the most suitable
returns not only the result of keyword-based exact match, but services or products for ordinary consumers, as there are so
also the result of keyword-based semantic match. At the same many services and products in cloud. Meanwhile, data
time, the proposed scheme supports the verifiability of search outsourcing enables the data owner and the cloud service
result. The rigorous security analysis and performance provider not in a same trusted domain, making the data owner
analysis show that the proposed scheme is secure under the not manage data in real time. It is a common practice to
proposed model and effectively achieves the goal of keyword- encrypt sensitive information before outsourcing. However,
based semantic search 1. data encryption makes existing search techniques on plaintext
Index Terms — Consumer-centric cloud computing, privacy not applied to ciphertext, thus prompting a big challenge to
preserving, semantic search, verifiable search. effective data utilization. The trivial solution of downloading
the whole encrypted data first and then decrypting it locally is
obviously impractical, due to the huge bandwidth and
computation burden. Consumers might want to retrieve only
1
This work is supported by the NSFC (61232016, 61173141, 61173142, certain specific data files they are interested in rather than the
61173136, 61103215, 61373132, 61373133), GYHY201206033, 201301030, whole data collection. A popular way to address this problem
2013DFG12860, BC2013012, PAPD fund, and the Prospective Research
Project on Future Networks of Jiangsu Future Networks Innovation Institute
is searchable encryption, which can retrieve specific files
(BY2013095-4-10). through keyword-based search with data protection and
Z. Fu is with the School of Computer and Software & Jiangsu Engineering keyword privacy-preserving. Therefore, searchable encryption
Centre of Network Monitoring, Nanjing University of Information Science
and Technology, Nanjing 210044, CHINA (e-mail: [email protected]).
is no doubt a subject worthy of study.
J. Shu is with the School of Computer and Software & Jiangsu Engineering In recent years, a growing number of researchers have been
Centre of Network Monitoring, Nanjing University of Information Science engaging in studying searchable encryption and various
and Technology, Nanjing 210044, CHINA (e-mail: [email protected]). efficient search schemes over encrypted cloud data have been
X. Sun is with the School of Computer and Software & Jiangsu
Engineering Centre of Network Monitoring, Nanjing University of proposed. However, these searchable encryption schemes
Information Science and Technology, Nanjing 210044, CHINA (e-mail: based on keyword search do no longer fully satisfy the new
[email protected]). challenge and users’ increasing needs, specifically manifested
N. Linge is with the School of Computing, Science and Engineering,
University of Salford, Salford, M5 4WT, UK.(e-mail: [email protected]). in the following two aspects.
Contributed Paper
Manuscript received 09/26/14
Current version published 01/09/15
Electronic version published 01/09/15. 0098 3063/14/$20.00 © 2014 IEEE
Authorized licensed use limited to: Zhejiang University. Downloaded on December 04,2023 at 14:46:04 UTC from IEEE Xplore. Restrictions apply.
Z. Fu et al.: Smart Cloud Search Services: Verifiable Keyword-based Semantic Search over Encrypted Cloud Data 763

One is that most of these schemes support only exact keyword multimedia mobile cloud services. Cabarcos et al [3] proposed a
search. That means the returned result is completely dependent on novel middleware architecture that allows sessions initiated from
whether query terms users enter match pre-set keywords. For one device to be seamlessly transferred to a second one under a
example, if someone submits a query containing the term “host” cloud computing environment. Díaz-Sánchez et al [6] presented a
(“host” is not in preset keyword sets but “server” is), he will just cloud computing middleware Media Cloud for set-top boxes for
retrieve an empty result (“host” and “server” is almost same in the classifying, searching, and delivering media inside home network
computer field). and across the cloud. Grzonkowski et al [7] proposed a user
The other one is that most of existing searchable encryption centric approach to authentication for home networks. Sanchez et
schemes assume that the cloud server is honest-but-curious. al [8] proposed an IdM architecture based on privacy and
However, Chai et al [4] notice that the cloud server may be selfish reputation extensions to enable the global scalability and usability
to save its computation or download bandwidth. That is, the cloud for consumer cloud computing paradigm. However, all these
server might conduct only a fraction of search operation or return services are likely to be available to consumers only with the
a part of result honestly. premise that an smart and efficient cloud search service is
Besides, Fu et al [5] recently proposed a multi-keyword search achieved.
scheme in encrypted cloud environment which can achieve
B. Searchable Encryption in Cloud
synonym query. The main contribution of the scheme is that it
solved the problem of synonym search. Using this method, the To apply the searchable encryption to cloud computing, some
search results can be achieved when authorized cloud customers researchers have been studying further on how to search over
input the synonyms of the predefined keywords, not the exact or encrypted cloud data efficiently. Li et al [9] firstly proposed a
fuzzy matching keywords, due to the possible synonym fuzzy keyword search scheme over encrypted cloud data. Wang
substitution and/or her lack of exact knowledge about the data. et al [10] proposed a secure ranked search scheme. But this
This is a substantial improvement in the field of searchable scheme supports only single keyword search. Then Cao et al [11]
encryption. However, this scheme has not addressed the problems proposed a privacy-preserving ranked scheme supporting multi-
of semantic search and verification of search results, which will keyword, which is lack of flexibility. And the fuzzy keyword
be solved in this paper. search scheme proposed by Li [9] just tacks the problems of
To improve the flexibility and support verification of search minor typos and format inconsistence, but does not meet the
result, a smart semantic search scheme is proposed in this paper. needs of users retrieving as more relevant data files as possible.
This scheme not only supports keyword-based semantic search Chai et al [4] propose a verifiable search scheme, which can
over encrypted data, but also provides verifiable searchability prove the correctness and completeness of result efficiently.
with data privacy preserving. The contributions in this paper can However, there are some security problems that are not addressed
be summarized as follows: properly in the paper. Based on VSSE and fuzzy keyword search,
Wang et al [12] propose a scheme supporting both verification
(1) This paper proposes a smart keyword-based semantic and fuzzy search, but the scheme ignores result ranking.
search scheme over encrypted data. By building a semantic tree in Fu et al [5] proposed a multi-keyword search scheme in
real time according to original query terms, the scheme can find encrypted cloud environment which solved the problem of
out some related words (including synonyms and various synonym query. This is a good step forward in the field of
morphological forms, etc.) semantically similar to original query searchable encryption. However, this scheme has not addressed
terms and then carry out query expansion. The expanded query the problems of semantic search and verification of search results.
can find more related result, thereby improving the flexibility of This paper proposes a semantic search scheme over encrypted
system. data by building a semantic tree in real time according to original
(2) By combining the keyword-based semantic search scheme query terms. The scheme is smarter than previous scheme, and
with verifiable symmetric searchable encryption, an efficient can improve user experience.
search scheme supporting verification of completeness and
correctness of search result is proposed. The scheme is secure and III. PROBLEM FORMULATION
privacy preserving according to the rigorous security analysis. A. System Model
The rest of this paper is organized as follows. Section II The system model considered in the paper involves three
summarizes related work. Section III presents the system model. different entities: the data owner, the data user and the cloud
The proposed verifiable search scheme is shown in section IV. server, as illustrated in Fig.1. The data owner firstly
Section V and VI present security and performance analysis outsources huge size of document collection D in the
respectively. Finally, section VII concludes the paper. encrypted form C, together with encrypted search index
II. RELATED WORK generated from D, to the cloud server. The cloud server
provides the authorized users the search service over
A. Consumer-centric Cloud Services encrypted data C. Supposing that the mutual authentication
Noh et al [1] proposed a flexible non-uniform communication between the data owner and the data user is well done. The
bus model for multimedia services in cloud computing authorized user just generates the trapdoors for the keywords
environment. Shahnaza et al [2] proposed a realistic IEEE and sends them to the cloud server. Upon receiving a search
802.11e EDCA mathematical model for QoS-aware differentiated request from an authorized user, the cloud server starts

Authorized licensed use limited to: Zhejiang University. Downloaded on December 04,2023 at 14:46:04 UTC from IEEE Xplore. Restrictions apply.
764 IEEE Transactions on Consumer Electronics, Vol. 60, No. 4, November 2014

performing the search over the index and then sends back all Among lots of ranking functions, the “TF×IDF” rule is most
the responding data containing the specific trapdoors. The widely used, where TF (term frequency) denotes the
capacity of the user to decrypt the received data files is a occurrence of the term appearing in the document, and IDF
separate problem and is out of scope of this paper. (inverse document frequency) is often obtained by dividing
the total number of documents by the number of files
containing the term. That means, TF represents the importance
of the term in the document and IDF indicates the importance
or degree of distinction in the whole document collection.
Among hundreds of variations TF×IDF schemes, the basic
scheme without loss of generality is adopted. The following
statistic values are used in the ranking function:
fd,t: the TF of the term t in document Dd;
ft : the number of documents containing the term t;
N: the number of the whole document collection;
The score of the term in the document can be calculated as
follows:
N
ln(1  f d ,t )  ln(1  )
f d ,t
St , Dd  (1)
Fig. 1. Framework of search encrypted cloud data N
 tDd ln(1  f d ,t )  ln(1 
f d ,t
)
B. Threat Model
In this work, the cloud server is considered as “semi-honest-but- In this scheme, after executing operation of the keyword
curious” in the model, which is greatly different with most previous semantic extension, two categories of terms can be obtained:
searchable encryption schemes. The cloud server acts in a selfish original query terms Qo (the terms the user typed in) and
way that it may not completely follow the designed protocol extended query terms Qe (the terms semantically similar to
specification. In this paper, note that when designing the verifiable original query terms). The definition of the ranking function is
keyword-based semantic search scheme, the security definition shown as follows:
deployed in the traditional searchable symmetric encryption [13] is Score(Q, Dd )  a  St , Dd  (1  a)  St , Dd (2)
followed. That is, nothing can be leaked from the encrypted data tQo tQe
files and search index beyond the search pattern and the access
Where a  (0.5,1) , which can be adjusted in the experiment.
pattern.
m-best tree
C. Notations The m-best tree here is considered as the unit in the
For the sake of clarity, the main notations used in the paper are semantic tree model, which is composed of keywords, as
introduced as follows. shown in Fig.2. sim(q,p) denotes the similarity between word
D: the plaintext document collection, denoted as D= {D1,...,Dm} q and word p. Given any word q, a tree model can be used to
C: the encrypted document collection stored in the cloud, express relationships with any other words, as illustrated in the
denoted as C= {C1,...,Cm}; left part in Fig.2. In Fig.2, the word q is the root node and
W: the keyword set consisting of n keywords of D, denoted as other words are leaf nodes. The path between the root and the
W= {W1,...,Wn}; leaf shows the similarity between them. It is worth noting that
 : a keyed hash function as SHA-1, denoted as all the leaf nodes from left to right are sorted according to the
{0,1}*  {0,1}K  {0,1}L ;
similarity with the root node. That means the more similar the
leaf node is to the root node, the more left it will be. In other
gk: a pseudo-random function, denoted as {0,1}*  K  {0,1}L ;
words, it should satisfy the following formula:
 z : a block cipher such as DES; sim(q, p1 )  sim(q, p2 )    sim(q, pm ) (3)
∆={ai}: the pre-defined symbol set, αi is θ-bit binary vector,
|∆|=2θ;
Twi  {ai ,1 , ai ,2 ,..., ai , L / }
: the trapdoor generated by user as the
search request for keyword wi;
Ord (a ) returns the alphabetic order of a i   ;
i
Gx,y denotes the y-th node from left to right of depth x in a tree G.
Fig. 2. m-best tree
D. Preliminary
Rank function sim(q,p) can be calculated by WordNet. The left most m
In information retrieval, a ranking function is usually used leaf nodes are chosen and other leaf nodes are given up. It is
to evaluate relevant scores of matching files to a request. called m-best tree, which is the unit in the semantic tree

Authorized licensed use limited to: Zhejiang University. Downloaded on December 04,2023 at 14:46:04 UTC from IEEE Xplore. Restrictions apply.
Z. Fu et al.: Smart Cloud Search Services: Verifiable Keyword-based Semantic Search over Encrypted Cloud Data 765

model, as shown in the right part in Fig.2. Note that the chosen out. If the term v satisfies the following criteria, it can
variable m can be adjusted according to the specific situation. be considered as an extended term.
Term Similarity Tree
 sim (Q , w)   sim ( qi , w)  K  ( v )
K
Given a query term vector Q=(q1,q2,…,qk) containing K
terms, a term similarity tree TST(Q,v,m) based on Q in real- 
time can be built, as shown in Fig.3.  overlay (TSTi(1Q , v , m ), w)   K  1
 2
 
(4)

sim(Q,w) is the similarity between the original query terms


Q and the term w. sim(qi,w) is the similarity between the
query term qi and the term w. The  ( v ) is the increasing
function of the level v, and it is defined as follows:
v
 (v)  sin( )  0.5, v  (0,10), v  N (5)
60
In the formula above, when v  (0,10), v  N , the domain
of values of  ( v ) ranges from 0.5 to nearly 1. The value will
increase with the increase of level v.
overlay (TST (Q, v, m), w) denotes how many subtrees in TST
K 
Fig. 3. TST(Q,v,m) the term w exists.  
2
will always round down to the nearest
whole unit to K/2.
The variable v is the number of layers in the tree and the
From the type, the first formula evaluates similarity
variable m means each unit of the tree is the m-best tree. In
between the term w and the query term q1,q2,…,qk and the
other words, each internal node has m leaf nodes. With the
second formula represents the similarity between the term w
term similarity tree TST(Q,v,m), the similarity between the
and the whole original query terms Q. The whole query terms
root node and any internal node or leaf node can be easily
Q usually represents user’s semantic tendency of query, which
calculated.
can be seen as a point in the semantic space. The extended
WordNet terms should be more similar to the point. That means, the
WordNet [15] is a lexical database for English language,
extended term should be more similar to the whole query
which is created by Princeton University. It groups the English
terms Q rather than one query term qi among Q for the reason
words into sets of synonyms called synsets, provides short,
that the whole query terms Q expresses the explicit meaning,
general definitions, and records the various semantic relations
but one single query term qi among Q cannot convey the
between these synonym sets. a variety of semantic similarity
explicit meaning.
and relatedness measures based on WordNet can be easily
Here is a example, if a user submits a query
implemented.
Q=(computing,system), then the term similarity tree
IV. CONSTRUCTION OF VERIFIABLE KEYWORD-BASED TST(Q,2,3) is build in real-time, as shown in Fig.4.
SEMANTIC SEARCH IN CLOUD The term “plan” is used to test whether it is an extended
term satisfying the requirements. From the figure 4, there are
The key idea behind the verifiable keyword-based semantic
two paths from “computing” to “plan”: “computing-mapping-
search scheme is two-fold: 1) building up the semantic tree
plan” and “computing-representation-plan”. Among them, the
model mentioned above and choose out the semantically
path “computing-mapping-plan” is the shortest path. The
similar keyword satisfying some qualifications; 2) designing a
similarity between the term “computing” and “plan” can be
verifiable keyword-based semantic search scheme supporting
calculated as below:
verifying the correctness and completeness of the result.
sim  computing, plan 
A. Technique for Keyword Semantic Extension Query
 sim  computing, mapping   sim  mapping, plan 
When the authorized user wants to retrieve some data files
 0.89  0.86
of his interest, he may type in some query terms, denoted as
Q=(q1,q2,…,qk) which contains K query terms. To enable  0.7654
search more practical and flexible, a keyword extension The shortest path between “system” and “plan” is the path
technique to extend original query terms is proposed, getting “system-scheme-plan”, the similarity between “system” and
some appropriate additional terms semantically related to the “plan” is calculated as follows:
sim( system, plan)
original query terms. Here, building the semantic similarity
tree TST(Q,v,m) to extend original query terms Q is adopted.  sim( system, scheme)  sim( scheme, plan)
To extend query terms, firstly, TST(Q,v,m) should be built  1 0.94
and then the extended terms meeting the requirements are  0.94

Authorized licensed use limited to: Zhejiang University. Downloaded on December 04,2023 at 14:46:04 UTC from IEEE Xplore. Restrictions apply.
766 IEEE Transactions on Consumer Electronics, Vol. 60, No. 4, November 2014

Fig. 4. TST(Q,2,3) example

So, towards the term “plan”, the values of two formulas Build the symbol-based index trie:
below can be obtained. (1)The data owner computes TWi   (k , wi ) for each
3 Wi  W with the random key k, and then divides them into
sim((computing , system), plan)  1.7054  2  [sin( )  0.5]
60
(6) symbols as TWi  {ai ,1 , ai ,2 ,..., ai , L / } .
2
overlay (TST (Q, 2,3), w)  2     1 (2)The data owner builds up a symbol-based index trie G
2
Form the type, the term “plan” satisfies two formulas covering all the TWi for each Wi  W , where each node
above, so it is considered as an extended term. contains two attributes (r0,r1). r0 stores the symbol in the ∆; r1
Similarly, other terms can be calculated. After all terms are is a globally unique value path||memory||gk(path||memory) in
calculated, an extended query Q = (computing, system, plan, G. The path is the sequence symbols from the root to the
scheme, representation, control) can be obtained. current node, denoted as ai ,1 , ai ,2 ,..., ai , j , where j  L /  ; The
memory is 2θ-bit binary string, which represents the set of the
B. The Verifiable Keyword-based Semantic Search Scheme children nodes of the current node. If the current node has a
In this section, the proposed scheme is emphatically children node whose r0 is the i-th symbol in ∆, and then the i-
presented in detail. The scheme includes five algorithms th bit is set “1”, while other bits are set “0”. In parallel to build
(Setup, GenIndex, GenQuery, Search, and Verify). search index G, plaintext documents are separately encrypted
by a symmetric way in a traditional manner.
 Setup (3)The data owner attaches IDSet which is
In this initialization phase, the data owner initiates the {FIDWi || g k ( FIDWi )}1i  n to index G and outsources it together
k
scheme to generate a random key kR  {0,1} and a secret key with encrypted files to the cloud server.
Z R{0,1}L .  GenQuery
(1)When the user inputs the query terms Q=(q1,q2,…,qk),
 GenIndex first builds term similarity tree TST(Q,v,m) and executes
To improve the search efficiency, a symbol-based tree to keyword semantic extension, getting the extended query
store elements in a finite symbol set is built. The symbol- Q=(q1,q2,…,qk,qk+1,…,qm) ;
based trie tree is a multi-way tree where all trapdoors sharing a (2)For each qi  Q , the user computes the trapdoor
common prefix have a common parent node. The root node is
empty and the symbols in the trapdoor can be extracted in a Tqi   (k , qi ) , and divides them into symbols as
search by a path from the root node to the leaf node that ends a Tqi  {ai ,1 , ai ,2 ,..., ai , L / } , finally sends {Tqi }qi Q to the cloud
trapdoor. Assuming that ∆={ai} is a predefined symbol set,
where the number of distinct symbols is |∆|=2θ and each server. Meanwhile, the user should store {Tqi }qi Q temporarily
symbol ai   is denoted as θ-bit binary vector. Below, the L to verify the search result later.
is the output length of the function  (k , ) .  Search
Upon receiving the search request, the cloud server
Preprocess: performs the search operation over the index G. The search is
(1)The data owner scans the plaintext document collection principally to find a path in G according to the search request,
D and extracts the distinct keywords of D, denoted as W; from the root node to the leaf node. The existence of a path
S
(2)The data owner computes the score W , D for each Wi  W
i i
indicates that the queried words happens at least one of the
and D  D . For all data files containing the keyword W , the
j i targeted data files. During every step of path exploration, the
identifier set is denoted as cloud server produces the proof which is later together with
FIDWi  ID( D1 ) ||  z ( sWi , D1 ),... || ID( D j ) ||  z ( sWi , D j ) FIDS returned to the user for validity of search outcome. Note
.

Authorized licensed use limited to: Zhejiang University. Downloaded on December 04,2023 at 14:46:04 UTC from IEEE Xplore. Restrictions apply.
Z. Fu et al.: Smart Cloud Search Services: Verifiable Keyword-based Semantic Search over Encrypted Cloud Data 767

that the proof is the r1 of each node found in the path during extension here, GenQuery produces Tw2  {a5 , a6 , a8 } and sends it
search, which is a globally unique value.
to the cloud server. Upon receiving the search request, the cloud
 Verify & Rank server performs the search over index trie G. The server thus
When the user receives the outcome from the cloud server, honestly returns “Yes” with “x” and the proof
he can verify the correctness and completeness of search result. ( || 10001001|| a5 || 00000110 || a5 a6 || 0010000 | a5 a6 a3 || 00000000 || 3) .
The key idea behind it is that the outcome returned by the In another scenario, the cloud server deceives the user that the term
cloud server contains the proof, which is a globally unique w2 does not exist in G and returns “No” and the proof
value and is produced by a pseudo-random function gk with ( ||10001001|| 0) . Verify could test the proof and inform the
the random key k. Without the random key k, which is only
shared in authorized users, the cloud server cannot forge a user the symbol “a5” is a child of the node being verified. So, the
valid proof. The outcome returned by the cloud server can be user will find the cloud server is dishonest.
divided into two situations: successful and unsuccessful.
(1)If the outcome is successful, the outcome will contain
IDSet and proof. Firstly, the user can verify the completeness of
the IDSet, which consists of {FIDWi || g k ( FIDWi )}1i  n . The user
can extract the FIDWi and compute g k ( FIDWi ) , where FIDWi
is the concatenation of identifiers received by the user. Then the
user can test whether g k ( FIDWi ) is equal to the received
g k ( FIDWi ) . If they are equal, the user can consider the search
result is complete. Otherwise, the search result is incomplete.
After the first step, the user will utilize the proof to verify the
correctness of the search outcome. Similar to the first step, the
user computes the gk(path||memory) and tests whether it is equal
to the received gk(path||memory) , where the path||memory is
the former part of proof. If they are not equal, the user can see Fig. 5. A symbol-based index trie
the cloud server is not worth being trusted.
V. SECURITY ANALYSIS
(2)If the outcome is unsuccessful, the user could directly
verify the correctness of the search outcome. The proof is Data privacy:
returned in the format of Go , yo [r1 ] || ... || G j , y j [r1 ] || j . b[j] is (1)Document Privacy: The documents are encrypted separately
and their confidentiality is guaranteed under the cipher. With
defined as a j-bit vector, where the last bit is set “0”, other bits secure enough cipher, it would assure that the encrypted
are set “1”. This part of the process is to verify each unit documents leak no information except their length, size and
{G j , y j [r1 ]} of proof , which contains three steps below: document IDs.
(a)The user computes gk(path||memory) and tests whether it (2)Query Privacy: Tqi   (k , qi )  {ai ,1 , ai ,2 ,..., ai , L / } can be
is equal to received gk(path||memory), where path||memory seen as a collection of L /  symbols. Its confidentiality can be
is the former part of the proof.
guaranteed by a one-way hash function  with the secret key k
(b)If the first step pass, the user tests whether the received path
which is shared among data owners and authorized users.With the
is equal to corresponding {ai ,1 , ai ,2 ,..., ai , j } stored in user side.
hash function, only authorized user can generate valid trapdoors.
(c)If the second step pass again, the user continue testing Without knowing the secret key k, the cloud server cannot
whether memory[ord (Tqi [ j  1])] is equal to b[j+1], where generate valid trapdoors. Due to the limited number of keywords
Tqi [ j  1] is the next symbol of the current node according to (2500 distinct words according to Oxfords dictionary), the cloud
server can easily perform brute-force attack. So here a key-based
the sequence symbol in the trapdoor. If not equal, the cloud
hash function is used to resist the cloud server’s brute-force attack.
server is lazy, that means, it only executed a fraction of search.
(3)Index Privacy: In the index trie, each node is a tuple
After verifying that the outcome is correct and complete,
(r0,r1), where r0 and r1 is a hash value. Every keyword can be
the user can decrypt IDSet with decryption key z and sort the
returned data files. hashed as a L-bit binary string and thus each keyword
corresponds to a L /  long path in the index G, which is
C. A Concrete Example irrelevant to length of words stored in it. The length of path in
To make the search scheme clearer, an example is shown as Chai’s scheme [4] depends on length of corresponding word,
illustrated in Fig.5. In this trie, W={w1,w2,w3,w4} and which could indicate some presence of particulate words, such
  {a1 , a2 , a3 , a4 , a5 , a6 , a7 , a8 } . This index trie G is constructed as “Floccinaucinihilipilification”. The proposed method
by GenIndex where each node is a tuple (r0,r1), e.g., the root node addresses this problem correctly. And the relevance score is
r0=0, and r1=10001001 where “10001001” represents that the node encrypted with a block cipher as well. Such scheme not only
“a1” , “a5”, “a8” are in its child sets. Note that the gk(path||memory) protects the “shape” of the index, but also prevents leaking
of the r1 is omitted here. To search the term w2, ignoring keyword distribution of score to the cloud server effectively.

Authorized licensed use limited to: Zhejiang University. Downloaded on December 04,2023 at 14:46:04 UTC from IEEE Xplore. Restrictions apply.
768 IEEE Transactions on Consumer Electronics, Vol. 60, No. 4, November 2014

Verifiable Searchability: TABLE II


SOME EXAMPLES OF TERM SIMILARITY
To prove the scheme’s verifiable searchablity is just to
prove the cloud server not to forge a valid proof to tamper the
Query Word m-best Tree
search result [4]. There are three possible ways below:
(1)try to generate a valid r1 with
path || memory '  path || memory ; (package,1),(procedure,0.95),(function,0.95
software
(2)randomly generate a r1 to replace the original one; ),(program,0.95),(OS,0.95)
(3)use r1 of another node. (server,1),(computer,0.95),(organization,0.9
The first two methods can successfully cheat the proposed host
2),(client,0.90),(site,0.90)
algorithm with a negligible probability without the random key k. (computation),(calculate,1),(cipher,1),(engi
For method 3, r1 of each node in G is a globally unique value. Any computing
neering,0.95),(operation,0.95)
r1 of another node is rejected by algorithm Verify. In conclusion, (operate,1),(hold,1),(checking,1),(command
the cloud server cannot forge a valid proof without key k. control
,1),(assure,1)
(conversion,1),(shift,1),(transformation,0.9
VI. PERFORMANCE ANALYSIS transition
4),(change,0.93),(modification,0.93)
A. Performance Comparison (carry,1),(channel,1),(send,1),(transmit,1),(t
transport
The proposed scheme is compared with previous SSE ransfer,1)
scheme SSE-1 [13], Wang’s scheme [10] and Chai’s scheme (necessary,1),(essential,1),(demand,1),(resp
requirement
[4].To make comparison clearer, n is defined as the number of onsibility,0.94),(require,0.90)
distinct keywords and M as the maximum size of fuzzy (task,1),(lines,1),(problems,1),(business,1),(
job
keyword set and C as the size of m-best tree. The comparison application,0.96)
among the above schemes is shown in Table I. (client,0.95),(node,0.95),(host,0.95),(predict
computer
TABLE I or,0.95),(server,0.95)
COMPARISON OF THE SCHEMES
(post,1),(send,1),(transport,0.93),(communi
Proposed mail
SSE-1 Wang’s Chai’s cation,0.93),(message,0.91)
scheme

Storage in cloud O(n) O(Mn) O(n) O(n)


Search in cloud - - - O(nC) Testing of Building Symbol-based Trie Tree
Search O(1) O(1) O(L) O(1)
In the experiment, θ=4 is chosen and SHA-1 is used as hash
function with output length of L=160 bits. So, the height of
Verifiable Searchablity No Yes Yes Yes
Verify cost - O(1) O(L) O(1) the symbol-based trie is L /  =40 and that means every path
Fuzzy search No Yes No No in the trie is 40 long. Fig.6 shows the trie construction time. It
Semantic search No No No Yes shows that the construction time increases linearly with the
number of distinct keywords, as seen from Fig.6. And the
Comparing to the traditional SSE-1, the proposed scheme construction time is very fast and it can be conducted off-line
provides the verifiable searchablity. Comparing to Wang’s and just one-time cost.
scheme, the proposed scheme does not support fuzzy keyword
search but support keyword-based semantic search, which is
more flexible in practical application. Compare to Chai’s scheme,
the proposed scheme reduces the search cost and verify cost from
O(L) to O(1), where L is the length of keyword. The generation of
trapdoor in Chai’s scheme requires L times but in the proposed
scheme requires only one time. And the process of verify in
Chai’s scheme require L times decryption operation but the
proposed scheme needs no decryption operation. However, in
order to achieve semantic search, the owner needs to pre-calculate
the similarity between keywords, produce the m-best tree for each
keyword and store them in the user side. And the extra storage
cost O(nC) is within reasonable limits that the user can afford.
To evaluate the efficiency and practicality, a thorough
experiment evaluation of the proposed scheme on real data set
is conducted: Request for comments database (RFC). Fig. 6. Time cost to build trie

B. Performance Evaluation Testing of GenQuery


Testing of m-best Tree With the help of m-best tree, if users input a single query term,
In the data owner side, the construction of m-best tree for he will find its synonyms, various morphological forms and similar
each keyword is very slow but it can be conducted off-line and words. And when users input several query terms, he will find
just one-time cost. Some examples are shown in Table II. some words close to the whole query under the construction of

Authorized licensed use limited to: Zhejiang University. Downloaded on December 04,2023 at 14:46:04 UTC from IEEE Xplore. Restrictions apply.
Z. Fu et al.: Smart Cloud Search Services: Verifiable Keyword-based Semantic Search over Encrypted Cloud Data 769

term similarity tree. Therefore, the proposed semantic search a valid proof is constant. For completeness, the time cost is shown
scheme supports both single keyword and multi-keyword search. In in Fig.9 to verify valid proofs, with counts varying from 500 to
the test, 20 users are invited to conduct a large quantity of queries to 5500 in step of 500. The estimation of throughput of verify is
test the performance of keywordbased semantic extension. During obtained: 7400 trapdoors/second, which shows that the proposed
the test, the variable m and v of the term similarity tree are scheme is deemed quite efficient and practical even for end users
continually adjusted according to the feedback of users. Fig.7 with constrained resource.
shows the time cost to generate query of a single keyword. 8

total verify cost (×10‐2s)


7
6
5
4
3
2
1
0
0.5 1.5 2.5 3.5 4.5 5.5
Number of valid proofs (×103)

Fig. 9. Time cost to verify valid proofs

VII. CONCLUSION
Fig. 7. Time cost to generate query This paper propose an efficient verifiable keyword-based
semantic search scheme. Comparing to most of the existing
Testing of Search searchable encryption schemes, the proposed scheme is more
Given the search index comprised of 5806 distinct keywords, practical and flexible, better suiting users’ different search
the time cost of search operation is measured, shown in Fig.8. An intentions. Moreover, the proposed scheme protects data privacy
estimation of throughput of search is obtained: 4000 and supports verifiable searchability, in the presence of the semi-
words/second. In addition, searching for irrelevant word is much honest server in the cloud computing environment. Through
faster, which is because search will stop if there is a mismatch ample theoretical analysis and experimental study using the real
between symbols of trapdoor and the index. This “incomplete data set, the proposed scheme is quite efficient.
traversing” saves a lot of operating time. The future work is to study the semantic search over
outsourced encrypted cloud data. It may include the problems of
conjunction and sequence or location of keywords, sentence
semantic querying and even semantic search on mobile devices.

REFERENCES
[1] W. Noh and T. Kim, “Flexible communication-bus architecture for distributed
multimedia service in cloud computing platform,” IEEE Trans. Consumer
Electron., vol. 59, no. 3, pp. 530-537, 2013.
[2] T. Shahnaza and Y. Kim, “Realistic IEEE 802.11 e EDCA model for QoS-
aware mobile cloud service provisioning,” IEEE Trans. Consumer Electron.,
vol. 58, no. 1,pp. 60-68, 2012.
[3] P. A. Cabarcos, F. A. Mendoza, R. S. Guerrero, A. M. Lopez, and D. Diaz-
Sanchez, “SuSSo: seamless and ubiquitous single sign-on for cloud service
continuity across devices,” IEEE Trans. Consumer Electron.,vol. 58, no. 4, pp.
1425-1433, 2012.
[4] Q. Chai and G. Gong, “Verifiable Symmetric Searchable Encryption for Semi-
Honest-but-Curious Cloud Servers,” Proceedings of IEEE International
Conference on Communications (ICC’12), pp. 917-922, 2012.
[5] Z. Fu, X. Sun, N. Linge, and L. Zhou, “Achieving Effective Cloud Search
Services: Multi-keyword Ranked Search over Encrypted Cloud Data
Fig. 8. Time cost to search Supporting Synonym Query,” IEEE Trans. Consumer Electron., vol. 60, no. 1,
Testing of Verify & Rank pp. 164-172, 2014.
[6] D. Diaz-Sanchez, F. Almenarez, A. Marin, D. Proserpio, and P.A. Cabarcos,
The search returns the proof and encrypted result set which has “Media cloud: an open cloud computing middleware for content management,”
to be decrypted by the user. The cost of rank operation is IEEE Trans. Consumer Electron., vol. 57, no. 2, pp. 970-978, 2011.
determined by the number of retrieved result set rather than the [7] S. Grzonkowski and P. M. Corcoran, “Sharing cloud services: user
index size or document collection. Here, the cost of verify is authentication for social enhancement of home networking,” IEEE Trans.
Consumer Electron., vol. 57, no. 3, pp. 1424-1432, 2011.
emphatically measured, it plainly depends on L (the output length [8] R. Sanchez, F. Almenares, P. Arias, D. Diaz-Sanchez, and A. Marin,
of hash function  ) and θ (θ-bit grouping), rather than the length “Enhancing privacy and dynamic federation in IdM for consumer cloud
of query word in Chai’s scheme [4]. And so the cost of verifying computing,” IEEE Trans. Consumer Electron., vol. 58, no. 1, pp. 95-103, 2012.

Authorized licensed use limited to: Zhejiang University. Downloaded on December 04,2023 at 14:46:04 UTC from IEEE Xplore. Restrictions apply.
770 IEEE Transactions on Consumer Electronics, Vol. 60, No. 4, November 2014

[9] J. Li, Q. Wang, C. Wang, N. Cao, K. Ren, and W. J. Lou, “Fuzzy keyword Jiangang Shu received his BE in Network Technology
search over encrypted data in cloud computing,” Proceedings of IEEE and Engineering from Nanjing University of
INFOCOM 2010, San Diego, CA, USA, pp. 1-5, 2010. Information Science & Technology (NUIST), Nanjing,
[10] C. Wang, N. Cao, J. Li, K. Ren, and W. J. Lou, “Secure ranked keyword search China, in 2012. He is currently pursuing his MS in
over encrypted cloud data,” Proceedings of IEEE 30th International computer science and technology at the School of
Conference on Distributed Computing Systems (ICDCS), pp. 253-262, 2010. Computer & Software, Nanjing University of
[11] N. Cao, C. Wang, M. Li, K. Ren, and W. J. Lou, “Privacy-preserving multi- Information Science and Technology, China. His
keyword ranked search over encrypted cloud data,” Proceedings of IEEE research interests include cloud security,
INFOCOM 2011, pp. 829-837, 2011. steganography, network and information security.
[12] J. Wang, H. Ma, and Q. Tang, “A new efficient verifiable fuzzy keyword search
scheme,” Journal of Wireless Mobile Networks, Ubiquitous Computing and
Dependable Applications, vol. 3, no. 4, pp. 61-71, 2012. Xingming Sun is a professor in the School of
[13] R. Curtmola, J. Garay, S. Kamara, and R. Ostrovsky. “Searchable symmetric Computer and Software, Nanjing University of
encryption: improved definitions and efficient constructions,” Proceedings of Information Science and Technology, China from
ACM Conference on Computer and Communications Security (CCS’06), 2011. He received the B.S.degree in Mathematical
Alexandria, VA, pp. 79-88, 2006. Science from Hunan Normal University and M.S.
[14] J. Zhao, Q. Jin, and B. Xu, “Semantic computation for text retrieval,” Chinese degree in Mathematical Science from Dalian
Journal of Computers, vol.28, no. 12, pp. 2068-2078, 2005. University of Technology in 1984 and 1988,
[15] G. A. Miller, R. Beckwith, C. D. Fellbaum, D. Gross, and K. Miller, “WordNet: respectively. Then, he received the Ph.D degree in
An online lexical database,” Int. J. Lexicograph. vol.3, no. 4, pp. 235–244, Computer Engineering from Fudan University in 2001.
1990. His research interests include mobile systems,
applications of networking technology, information systems, cryptography
BIOGRAPHIES and ubiquitous computing.
Zhangjie Fu (M’2013) received his BS in education
technology from Xinyang Normal University, China, in Nigel Linge received his BSc degree in Electronics from
2006; received his MS in education technology from the the University of Salford, UK in 1983, and his PhD in
College of Physics and Microelectronics Science, Hunan Computer Networks from the University of Salford, UK,
University, China, in 2008; obtained his PhD in in 1987. He was promoted to Professor of
computer science from the College of Computer, Hunan Telecommunications at the University of Salford, UK in
University, China, in 2012. Currently, he works as an 1997. His research interests include location based and
assistant professor in College of Computer and Software, context aware information systems, protocols, mobile
Nanjing University of Information Science and systems and applications of networking technology in
Technology, China. His research interests include information systems, areas such as energy and building monitoring.
protocols, mobile systems and cloud computing.

Authorized licensed use limited to: Zhejiang University. Downloaded on December 04,2023 at 14:46:04 UTC from IEEE Xplore. Restrictions apply.

You might also like