0% found this document useful (0 votes)
58 views

Bloom Filter Based Index For Query Over Encrypted Character Strings in Database

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views

Bloom Filter Based Index For Query Over Encrypted Character Strings in Database

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

2009 World Congress on Computer Science and Information Engineering

Bloom Filter Based Index for Query over Encrypted Character


Strings in Database

Lianzhong Liu, Jingfen Gai


Key Laboratory of Beijing Network Technology
School of Computer Science and Engineering
Beihang University, Beijing, 100191 China
[email protected], [email protected]

Abstract on it and convert the bloom filter into a numeric data


as the index of encrypted data. We analyze the relation
It is important to construct more efficient index used of the parameters that affect the query performance,
for query over encrypted character strings in database. such as the length of index m, number of hash
A few of approaches to deal with such issues have been functions k, and the length of the character string.
worked out. In this paper, an bloom filter based index Suitable length m for the index attribute and the
to support fuzzy query over encrypted character data is numbers of functions k are selected after we balance
proposed on the principle of two-phase query. Firstly a the tradeoff between security and efficiency. The
triple is used to express a character string, then we use experimental results shows that the performance of the
bloom filter compression algorithm on this triple to queries in our scheme has improved compared with the
build encoded index, which will be saved in database pairs coding method and the traditional method.
as a numeric data. Optimal parameters are selected
with the tradeoff between security and efficiency. In 2. Related work
this way, the scheme minimizes not-match records by
simple bit-and operation on the numeric index. Finally H. Hacigumus etc. proposed bucket index to support
the improved query performance is proofed by the range query for the numeric data. And they
experiment result. combined privacy homomorphism technique to support
arithmetic computation [4]. Bijit Hore further optimized
1. Introduction the bucket index method on how to partition the bucket
to get the trade between the security and query
Advances in networking and internet technologies performance [5]. The methods based on index are
have elevated the significance of data security in supported by DBMS, and focused on the query
information systems. Database as core of information performance at the cost of storage space.
system are threatened by various attacks. Even if many Zhengfei Wang proposed pairs coding function to
protect mechanisms such as defense of network support fuzzy query over the encrypted character data
[6][7]
perimeter, access control of DBMS, audit mechanism . Pairs coding method encodes every adjacent two
or user authentication is used, sensitive data in characters in sequence and converted original string
database are still vulnerable to attack. Database directly to another characteristic string by a hash
encryption technology is a straightforward solution to function. This method can’t deal with singe character
protect data privacy and integrity. It meets the data pattern-matching query, and could perform badly for
confidentiality requirements and has become an big character string. Paper [8] had proposed
indispensable aspect of enterprise database security [1]. characteristics matrix to express string and the matrix
Many researches adopt two-phase query method to will also be compressed into a binary string, which
improve query efficiency, in which constructing index then be used in the query. Every character string need a
for encrypted data is the key to improve performance. matrix size of 259 × 256 , it takes much memory space
We perform fuzzy query over characters effectively and will lead to much computation; in addition, the
by using the bloom filter as the encrypted index in our length of index has come to more than hundred bits,
encryption scheme. Firstly, we construct a triple to which is not suitable for storage in database.
express a character string with a set, build bloom filter

978-0-7695-3507-4/08 $25.00 © 2008 IEEE 303


DOI 10.1109/CSIE.2009.979
A scheme to disclose entries which match a boolean Application
query is proposed in [9]. By using a bloom filter as an
encoded index, the scheme reduces the frequencies of IR QR
comparison and the size of matching data, at the Re sultR
Store Module Query Translator
sacrifice of faultlessness. The space efficiency of the Mapping
bloom filter makes it very appealing in network Index
Construction
Querying-
condition
applications, such as file search in P2P networks, QR S QT
encrypt
packet classification, trajectory sampling, en-route Querying Result Filter
filtering of false data in the network, false hash table Security Key
decrypt

lookup and many other applications [11]. Dictionary I RS ResultRS

3. The Encryption Scheme Plain Encrypted


Index
Data Data
Our proposed scheme extends two-phase query
framework [3] as shown in Figure.1, a middle layer is Figure 1. Architecture of storage and query of
added between application and DBMS. Every encrypted data
application request will be processed according to SQL
type, using metadata in Security Dictionary. The Store 4. Construct Index for Character String
Module will turn the insert SQL into a corresponding
one with data encrypted and corresponding index for In this Section, firstly we introduced a triple which
the encrypted data. The Query Translator will split an extracts the characteristics of the string, then we build
original query over unencrypted relations into: (1) a bloom filter on it, we convert the m-bits bloom filter
coarse-query over encrypted relations. (2) an exact- into a numeric data for storing in database, since a
query for post-processing results of the coarse-query. character string’s match is usually not as quick as the
Querying Result Filter decrypts the encrypted result, match of a numeric data [10]. Lastly we analyze the
and then does the exact-query to get query result. parameters that affect the query performance.
Purpose of such design is to implement encrypted
storage and query efficiently over encrypted data both 4.1. Triple Construction
transparent to DBMS and application, the essential of
two-phase query is to filter out as many as records not Objective of constructing a triple is to express a
meeting the query conditions by checking the index character string with a set, which preserves
field, which decides the filtering efficiency. characteristic of the string and has as few elements as
The construction of index should follow two possible. The triple defined below expresses the
principles, firstly the index filters false record characters in the string and the relations of the
efficiently, and secondly it should be safe enough not characters.
to leak the true value. Our encryption scheme base on Definition 1 Triple for a character string: For a
column-level, we got encrypted column and index string s=’ c1 c2 ….. cn’, we have w present the set of
column for every sensitive column, encrypted column words split by blank、 comma 、 full stop and the
is used for equation query. Moreover there are many wildcard characters {‘%’, ‘_’}. u presents the set of
data types in DBMS, and there will be many types of characters in w. r is used to describe the relationship of
queries. For example, character data needs fuzzy adjoining characters of the word in w. So the triple is
query, numeric data needs range query, and date data comprised of these three. At last we got t = w, u , r .
needs query with “between…and”. It is not possible an
index supporting all computations, so we built different For example, we have string s=‘oympic games’, we
types of indexes according to the data type. Large get w={olympic, games}, u={o,l,y,m,p,i,c,g,a, e, s},
quantities of string data are used in many applications, r={ol,ly,ym,mp,pi,ic,ga,am,me,es}.
and fuzzy query is used frequently. How to execute a Elements of u and r are obtained from words in w.
fuzzy query efficiently over the encrypted string data is Different from pairs coding method, we didn’t get
the focus of this paper. elements from the character string directly, for many
big strings which contained independent words will be
used often, if we split these strings directly, some
elements containing blank or other punctuations are
meaningless. We improve it by split the character
string into separate words first and then split each word

304
into elements. In this way, we minimize the number of String s=“ a 1a2… an”
elements in subset u and r.
Subset r is most often used, so it is important to get Triple Construction
a proper r for a character string. We split the string in
succession. For adjacency characters with the same
length of len will be taken as an element of the set r,
∑ length(wi ) − len + 1 . The
compress u &r with
and the size of r is Nr= Bloom Filter Alogrithm
wi ∈w m-bits array filled with
bigger len is, the better relationship of characters is ‘ 0’ or‘ 1’
expressed, but size of len restricts the match pattern, in Turn into numeric
other words, we can’t execute a query with a match
pattern smaller than len. When len is 1, u=r, we can get m

any length match. But the relationship of characters in ∑ hi ⋅ 2i


i =1
s can’t be expressed. If len equals 2, one character
match query can be supported by u. So len=2 is Figure 2. Process of constructing index
appropriate.
‘o_comment like ‘%unusual%’’, we get the index value
480497654 with the same parameters. Then a
4.2. Bloom filter based Index converted query condition is “bitand(e_comment,
480497654)= 480497654”.
A Bloom filter is a simple randomized data
structure that answers membership query with no false
negative and a small false positive probability [11], It is
4.3. Analysis of query efficiency
space-efficient, using m-bits array for representing a
The key of our query method is to minimize the
set S={s1,s2,…,sn}. Initially, all bits in the bit array are
false records returned, and the performance is
set to 0. A bloom filter uses k independent random
determined by false positive probability of the bloom
hash functions h1,…,hk with range {0,…,m-1}. For each
filter, while the false positive probability depends on
element s ∈ S , the bits hi(s) are set to 1 for 1 ≤ i ≤ k . To
multiple factors. The probability of a false positive for
check if an element x is in S, is to check whether all an element not in the set taken as a member is f.
hi(x) are set to 1, if not, then clearly x is not a member k
of S, if all hi(x) are set to 1, we think that x is in S with ⎛ ⎛ 1⎞ ⎞
kn k
f = ⎜1 − ⎜1 − ⎟ ⎟ ≈ ⎛⎜1 − e m ⎞⎟ [12]
− kn
false positive probability [12]. ⎜ ⎝ m ⎠ ⎟⎠ ⎝ ⎠
We have prepared a triple for a character string, so ⎝
we can build bloom filter on u and r simultaneously as Where n is the number of elements, m is the total
shown in Figure2. Using bloom filter compression bits size of the index and k stands for the number of
algorithm, every element in the set will be encoded hash functions used. A small n and a large m will both
into multiple positions of the binary string with the tag get a small f; this is why we need a set representing a
of ‘1’, then the character string will be mapped to a m- character string contains fewer elements when
bits array comprised of ‘0’ or ‘1’, which will then be construct the triple. In addition, when
stored in the database as index in the form of numeric. ( )
k = (ln 2) ⋅ m
n
[12]
, f gets minimized most, in this case
In this way, a fuzzy query over character data is
equivalent to a membership query over the bloom the false positive probability is f = 1 ( 2) k
= (0.6185)
m
n .
filter. While the latter query can be completed only by
In practice, k must be an integer, and smaller k might
simple ‘&’ operation of numeric data, not full-table
be preferred since they reduce the amount of
decryption any more.
computation necessary.
Assume the condition for a query is “WHERE Ai
When a query condition is “LIKE matchStr”, the
like ‘matchStr”, where matchStr represents the match
probability of a not-matched record returned is
string containing wildcard characters, such as ‘%’, ‘_’,
‘[]’ and ‘^’. We first turn the matchStr into a triple, in f SIZE (matchStr ) , where SIZE(matchStr) stands for the
which w is split by blank, comma and full stop and number of elements that matchStr has. So we have the
wildcard characters, then we get bloom filter index as larger matchStr is, the smaller of a false record
value, in this way the condition changed to “WHERE returned probability is.
A i S & value=value”. Take ORDERS relation as an When k=1, we will see that it becomes pairs coding
example , we make index for the sensitive column function, we can say pairs coding is a special case of
o_comment, let len=2, m=31, k=4. When a query is bloom filter. By the conclusion above, pairs coding

305
⎛ 1⎞
n
−n
function has f = 1 − ⎜1 − ⎟ ≈ 1 − e m , k = 1 . The
more k gets close to (ln 2) ⋅ m ( n ) , the smaller of false
⎝ m⎠ positive probability gets. As we also see from the two
( )
more (ln 2) ⋅ m is close to 1, the better of the false
n
graphs, the length of character string in PART is
smaller than that in ORDERS; the false positive
positive probability is. So if n is large, a larger m will proportion is smaller given the same query condition
be needed, or else the false positive probability will be including the size of m, k and match string.
high. The second group experiments, we compare our
method to other methods. Figure6 is query about
5. The experiment and the analysis ORDER.o_comment, four points on the x-axis from left
to right in turn representing ‘request’, ‘unusual’,
Purpose of our experiments is to verify the effect of ‘special’ and ‘furiously’. We compare the method of
the bloom filter parameters and to test the query pairs coding and bloom filter when k=1, which shows
performance. According to TPC-H benchmark, the the method of splitting a string into words is effective.
database is automatically created at scale=1 by Also we compare our method to extended pairs coding
utilizing the tool of Benchmark Factory for Database, method and traditional method. The experiment is done
we use table ORDERS and PART as experimental data on column PART.p_container and six points
source. The encryption algorithm is DES, length of the repectively represent ‘BAG, ‘BOX, ‘PKG’ , ‘CAN,
Key is 128; programming language is Java; the ‘PACK’ and ‘DRUM’, the result shows the
environments are windows XP, P4 2.66 CPU, 1G improvement of the query performance as shown in
RAM and database is Oracle 10g. Figure7.
In first group experiments, we test how does the effect of m,k=1 %request%
different bloom filter parameters, including the length 1 %unusual%
of index m, number of hash functions k, length of the 0.9 %special%
false query proportion

character string, affect the false query probability. We 0.8 %furiously%


have the following definition of false positive 0.7
proportion: 0.6
Definition 2 false positive proportion: We assume 0.5
0.4
the number of tuples returned in the first phase of the
0.3
query is n1, and the number of tuples satisfying the 0.2
original query is n2, n1 ≥ n2 , then the false positive 0.1
n − n2 0
proportion is f Q = 1 . m=17 m=24 m=31 m=61
n1 like '%string%'
By the way f Q is different from f, f is the
probability of an element not in the set being taken as a Figure 3. The effect of m to false positive
proportion
member, and f Q is the proportion of get false records,
while f Q can reflect f. We get f Q by executing the the effec of k,m=31,order.o_comment
1
fuzzy query over encrypted relation, and got n1, n2 0.9
false query proportion

recorded. 0.8
We verify the relation between f Q and m; it is 0.7
0.6
executed on the o_comment column of ORDERS table, 0.5
as we can see the f Q get smaller when the m gets larger 0.4 %request%
0.3
as shown in Fig3. 0.2
%unusual%
Fig4 is about query on the encrypted table of %special%
0.1
%furiously%
ORDERS, and the encrypted column is o_comment, we 0
evaluate the average length of this column less than 50, k=1 k=2 k=3 k=4
we can see the false query probability is the smallest
when k=1. Figure5 is about query over PART table, Figure 4. The effect of k to false positive
and the p_container column is encrypted, which has proportion (1)
average length less than 10, we can see that, when k=2,
f Q approaches zero. Validate the conclusion that the

306
0.7 according to different length of sensitive data to get the
the effect of k,m=31,part.p_container
0.6
minimal false positive probability. Experimental
results show the performance improved compared with
false query proportion

K=1
0.5
K=2
conventional queries and the pairs coding method.
0.4 K=3

0.3
K=4 References
0.2
[1] Gang Chen, Ke Chen and J. X. Dong, “A Database
0.1 Encryption Scheme for Enhanced Security and Easy
Sharing”, Proceedings of the 10th international
0
1 2 3 4 5 6
Conference on Computer Supported Cooperative Work
like %string% in Design, Nanjing, China, 2006, pp.1-6.
[2] H. Hacigumus, B. Iyer and S. Mehrotra, “Providing
Figure 5. The effect of k to false positive Database as a Service”, Proceedings of the International
proportion (2) Conference on Data Engineering (ICDE), San Jose,
USA, 2002, pp. 29-38.
k=1,m=31 bloom filter [3] H. Hacigumus, B. Iyer, Chen Li and S. Mehrotra,
0.9
pairs coding
“Executing SQL over encrypted data in the database
0.8 service provider model”, In ACM SIGMOD Conference,
0.7 New Jork, USA, 2002, pp. 216-227.
false query proportion

0.6 [4] H.Hacigumus, B. Iyer, Sharad Mehrotra, “Efficient


0.5
execution of aggregation queries over encrypted
relational databases”, In the proceedings of Database
0.4
Systems for Advanced Applications (DASFAA), Jeju
0.3
Island, Korea, 2004, pp. 125-136.
0.2 [5] Bijit Hore, Sharad Mehrotra and Gene Tsudik, “A
0.1 Privacy-Preserving Index for Range Queries”,
0 Proceedings of the 30th VLDB Conference, Toronto,
1 2 3
order.comment like %String%
4 Canada, 2004, pp. 720–731.
[6] Zheng-Fei Wang, Jing Dai, Wei Wang and Bai-Le Shi,
Figure 6. Effect of split string into w “Fast Query over Encrypted Character Data in
Database”, Communications In Information and
bloom filter,k=2 Systems, 2004, Vol.4, No.4, pp.289-300.
comparsion between extended pairs coding,k=3 [7] Zheng-Fei Wang, Wei Wang and Bai-Le Shi, “Storage
different methods extended pairs coding,k=2 and Query over Encrypted Character and Numerical
1
pairs coding,k=1 Data in Database”, The Fitth International Conference
fulltable decrypt
0.9 on Computer and Information Technology, Shanghai,
China, 2005, pp. 591-595.
false query proportion

0.8
0.7 [8] Hong Zhu, Jing Cheng and Renchao Jin, “Execution
0.6 Query over Encrypted Character Strings in Databases”,
0.5 Frontier of Computer Science and Technology, Wuhan,
0.4 China, pp. 90-97, 2007.
0.3 [9] Yasuhiro Ohtaki, “Partial Disclosure of Searchable
0.2 Encrypted Data with Support for Boolean Queries”,
0.1 Availability, Reliability and Security (ARES08),
0 Barcelona, Spain, 2008, pp. 1083-1090.
1 2part.container
3 like4 %string%5 6
[10] Yong Zhang, Wei-xin Li and Xia-mu Niu, “A Method
of Bucket Index over Encrypted Character Data in
Figure 7. Comparison of different index methods Database”, Intelligent Information Hiding and
Multimedia Signal Processing, Kaohsiung, 2007, pp.
186-189.
6. Conclusion [11] Jehoshua Bruck, Jie Gao and Anxiao Jiang, “Weighted
Bloom Filter”, Information Theory, Seattle, WA, 2006,
We have proposed a bloom filter based index to pp. 2304-2308.
support the fuzzy query over encrypted character [12] Michael Mitzenmacher, “Compressed Bloom Filter”,
string. Firstly, which make the query over the whole IEEE/ACM Transactions on Networking, Vol. 10, No. 5,
encrypted database performed with a small 2002, pp. 604-612.
computation cost. Thirdly, we analyzed the parameters
of bloom filter, and determined the parameters

307

You might also like