Balancing Security Overhead and Performance Metrics Using A Novel Multi-Objective Genetic Approach
Balancing Security Overhead and Performance Metrics Using A Novel Multi-Objective Genetic Approach
Balancing Security Overhead and Performance Metrics Using A Novel Multi-Objective Genetic Approach
(1)
Where f
D(i)
, f
D (i)
represent the frequency of the i
th
item in
the dataset D and D
respectively, and n is the number of
distinct items in the original dataset D.
Miss Cost (MC) quantifies the percentage of the
nonrestrictive patterns that are hidden as side-effects of
the sanitization process. It is computed as follows:
) (
) ( ) (
MC
R
R R
P
P P
D
D D
'
' '
'
= (2)
Where
R P
'
(D) is the set of all non sensitive rules in the
original database D and
R P
'
( D' ) is the set of all non
sensitive rules in the sanitized data baseD' . As one can
notice that there exists a compromise between the miss
cost and the hiding failure, since the more sensitive
association rules need to hide, the more legitimate
association rules are expected to miss.
Similar to the measure of miss cost, Side-Effect Factor
(SEF) is used to quantify the amount of non-sensitive
association rules that are removed as an effect of the
sanitization process. It is defined as follows:
) (
) ) ( (
SEF
D Rp P
D Rp P P
+ '
= (3)
Artificial patterns (AF) quantify the percentage of the
discovered patterns that are artifacts. It is computed as
follows:
P
P P P
'
' '
=
AP (4)
Where P is the set of association rules discovered in the
original dataset D and P' is the set of association rules
discovered inD' .
Hiding Failure (HF) quantifies the percentage of the
sensitive patterns that remain exposed in the sanitized
dataset. It is defined as the fraction of the restrictive
association rules that appear in the sanitized database
divided by the ones that appeared in the original dataset,
formally:
) (
) (
HF
R
R
P
P
D
D'
= (5)
where Rp( D' ) corresponds to the sensitive rules
discovered in the sanitized dataset D' , RP (D) to the
sensitive rules appearing in the original dataset D.
Ideally, the hiding failure should be 0%. The performance
metrics for privacy preserving association rule mining
algorithms are given in [12].
3. PROBLEM FORMULATION
A sample transaction database D taken from [13] is
shown in Table 1. TID shows unique transaction number.
Binary valued item shows whether an item is present or
absent in that transaction. Suppose MST and MCT are
selected to be 50%, 70% respectively. Table 2 shows
sensitive rules satisfying MST, generated from sample
database D.
International Journal of EmergingTrends & Technology in Computer Science(IJETTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 2, Issue 3, May June 2013 ISSN 2278-6856
Volume 2, Issue 3 May June 2013 Page 410
So, the possible number of association rules satisfying
MST and MCT, generated by Apriori algorithm [14] are
given: , , , . Suppose the
rules and are specified as sensitive and
should be hidden in sanitized database.
The problem of privacy preserving in association rule
mining (so called association rule hiding) that is focused
by this paper can be formulated as follows:
Given a transaction database (D), minimum support
threshold (MST), minimum confidence threshold (MCT),
a set of significant association rules R mined from (D)
and a set of sensitive rules to be hide.
Generate a new databaseD' .
Such that the rules in can be
mined from D' under the same MST and MCT.
Where no normal rules in are falsely hidden
(lost rules), and no extra spurious rules (ghost rules) are
mistakenly will mined after the rule hiding process.
Table 1: Sample database D
TID Item Item (Binary From)
0 0 1 3 1101
1 1 0100
2 0 2 3 1011
3 0 1 1100
4 0 1 3 1101
Table 2: Sensitive rules
R1
R2
4. PROPOSED SOLUTION
4.1 Security and Association rule Mining Trade
The association rule hiding problem can be considered as
a deviation of the well identified database inference
control problem in statistical and multilevel databases.
The primary goal in database inference control is to guard
access to sensitive information that can be obtained
through non sensitive data and inference rules. In
association rule hiding, we think about that it is not the
data itself but somewhat the sensitive association rules
that produce a breach to privacy.
For the simplicity of presentation and without loss of
generality, we make the following assumptions in this
implementation:
We want to extract all association rules which satisfy
minimum support transaction (MST), minimum
confidence transaction (MCT). Support is a measure of
the frequency of a rule. The confidence is a measure of
the strength of the relation between sets of rules.
Association rule mining algorithms scan the database of
transactions and calculate the support and confidence of
the candidate rules to determine if they are considerable
or not. A rule is considerable, if its support and
confidence is higher than the user specified minimum
support and minimum confidence threshold. In this way,
algorithms do not retrieve all possible association rules
that can be derivable from a dataset, but only a small
subset that satisfies the minimum support and minimum
confidence requirements set by the users.
Apriori association rule-mining algorithm works as
follows. It finds all the sets of rules that appear frequently
enough to be considered relevant and then it derives from
them the association rules that are strong enough to be
considered interesting. The major goal here is to
preventing some of these rules that we refer to as
"sensitive rules", from being revealed. We want to hide
association rules using the best way by multi objective
genetic algorithm. Also we are interested in investigating
the performance of association rules (hiding failure (HF),
dissimilarity (DIS), artificial pattern (AF), side effect
(SEF), and miss cost (MC)).
Figure (1) presents the basic architecture of a database
system with the association rule mechanism.
Figure 1 Architecture of a database application with the
association rule procedure
4.2 Security and Association Rule Mining Trade
using Optimization
In this paper we are studying the privacy breaches which
incurred from certain type association rules. In doing so
we suppose that a certain subset of association rule, which
is extracted from specific datasets, is considered as
sensitive/critical rules. Our major goal then is
modification of original data source in such a way that it
would be impossible for the adversary to mine the
sensitive rules from the modified data set as long as all
the remaining non sensitive information and/or
knowledge remains as close as possible to this of the
original set, as our minor goal.
The method developed in this paper uses binary
transactional dataset as an input and modifies the original
dataset based on the concept of genetic algorithms for
privacy preserving of association rule to find the best
solution for sanitizing original dataset based on multi-
objective optimization. In such a way that all of sensitive
rules become hide and minimum modification performed
International Journal of EmergingTrends & Technology in Computer Science(IJETTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 2, Issue 3, May June 2013 ISSN 2278-6856
Volume 2, Issue 3 May June 2013 Page 411
in original dataset. The most famous possible style for
transaction modification is distortion of original database
(i.e., by replacing 1s by 0s and vice versa). We select
this style of modification in our method. Modification of
the dataset causes so many side-effect problems.
The modification process can affect the original set of
rules, that can be mined from the original database, either
by hiding rules which are not sensitive (lost rules), or by
introducing rules in the mining of the modified database,
which were not supported by the original database (ghost
rules). We have tried to minimize these unpleasant results
by minimum and suitable modification of original dataset.
The steps our work are explained in Figure 2.
Figure 2 Multi objectives privacy preserving (MOPP)
The following steps illustrate the methodology of the
proposed solution:
Step 1: Consider a transactional database with a set of
items and transactions.
Step 2: Write two external files one for original data set
and one for sensitive rules.
Step 3: Convert every chromosome to double value and
store in population then convert that value to binary
value.
Step 4: Create file for Apriori algorithm.
Step 5: Apriori algorithm is used to find the frequent item
sets based on the minimum support threshold.
Step 6: From the frequent item sets, the set of association
rules can be generated based on the minimum support
and confidence thresholds.
Step 7: Select the sensitive rules from the set of
association rules.
Step 8: Read association rules from output file and put in
structure for comparison with sensitive rules.
Step 9: Compare association rules with sensitive to
calculate Fitness Vector (1).
Step 10: Compare chromosome with original dataset to
calculate Fitness Vector (2).
Step 11: Genetic algorithm is used for modifying the
items based on the fitness function.
Step 12: Repeat the steps 5, 6 and 7 for the modified data
set.
Step 13: Verify (i) all the sensitive rules are hidden, (ii)
no non-sensitive rules are hidden (iii) no false rules.
To emphasis such activities in mathematical concepts, the
mathematical formulation of multi-objective optimization
problem could be defined as:
Find the vector
T
n
y y y y ] ,..., , [
2 1
= which satisfies the
m inequality constraints and the p equality constraints:
0 ) ( > x g
i
i =1,2,,m (6)
0 ) ( = x h
i
i =1,2,,p (7)
And optimizes (here we assume minimization) the vector
function:
T
k
x f x f x f x f )] ( ),..., ( ), ( [ ) (
2 1
= (8)
Where
T
n
x x x x ] ,..., , [
2 1
= is the vector of decision
variables, and the constraints given by equations (6) and
(7) define the feasible region F.?
Traditional Technique
Convert the multi-objective optimization problem into
one objective problem i.e. to find one optimal solution by
combining the objectives through weighting.
) ( ) 2 ( 2 ) 1 ( 1
...,
xn n x x
f w f w f w F + + =
Where 1 ... w
2 1
= + + +
n
w w
Proposed technique
Keep the problem AS multi-objective optimization
problem i.e. to find the pareto optimal solution
We say that a vector of decision variables F y e is
optimal if there is no other F x e such that
) ( ) ( y f x f
i i
s for all i =1, . . . , k and ) ( ) ( y f x f
j j
<
for at least one j.
A vector ) ,... , (
2 1 k
u u u u = is said to dominate
) ,.. , (
2 1 k
v v v v = (denoted by) v u if and only if u is
partially less thanv , i.e.
i i i i
v u k i v u k i < e - . s e : } ...., 2 , 1 { }, ,..., 2 , 1 {
.
Our fitness vector consists from two elements:
1
f =Hiding Failure =
) (
) (
D Sen
D Sen
R
R
'
(9)
2
f =Dissimilarity= ] [
1
) ( ) (
1
1
) (
i D i D
n
i
n
i
i D
f f
f
'
=
=
(10)
Where
) (i D
f ,
) (i D
f
'
represents the frequency of the ith
item in the dataset D, and D' respectively, and n is the
number of distinct items in the original dataset D.
Farther, we can choose menu of optimal solutions
according to our problem.
The main contributions are focused on three points: first,
a new proposed algorithm for hiding sensitive association
rules using multi objective genetic algorithm and
Modification old Math Model, the second contribution is
achieving balance between security and performance in
International Journal of EmergingTrends & Technology in Computer Science(IJETTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 2, Issue 3, May June 2013 ISSN 2278-6856
Volume 2, Issue 3 May June 2013 Page 412
database, the last point of the contribution is evaluation of
hiding performance in our work.
5. DISCUSSION AND EXPERIMENTAL
RESULTS
5.1 Experimental Setup
The data set Congress Voting Data set [15] includes votes
for each of the U.S. House of Representatives
Congressmen on the 16 key votes identified by the CQA.
The CQA lists nine different types of votes: voted for,
paired for, and announced for (these three simplified to
yea), voted against, paired against, and announced
against (these three simplified to nay), voted present,
voted present to avoid conflict of interest, and did not vote
or otherwise make a position known (these three
simplified to an unknown disposition). Number of
Instances: 435 (267 democrats, 168 republicans) Number
of Attributes: 16 +class name =17 (all Boolean valued).
A sample transaction database D taken from [15] is
shown in Table (3). TID shows unique transaction
number, Suppose MST and MCT are selected 25% and
58% respectively.
Table 3: Sample data set
TID
C
l
a
s
s
N
a
m
e
h
a
n
d
i
c
a
p
p
e
d
-
i
n
f
a
n
t
s
w
a
t
e
r
-
p
r
o
j
e
c
t
-
c
o
s
t
-
s
h
a
r
i
n
g
a
d
o
p
t
i
o
n
-
o
f
-
t
h
e
-
b
u
d
g
e
t
-
r
e
s
o
l
u
t
i
o
n
p
h
y
s
i
c
i
a
n
-
f
e
e
-
f
r
e
e
z
e
e
l
-
s
a
l
v
a
d
o
r
-
a
i
d
r
e
l
i
g
i
o
u
s
-
g
r
o
u
p
s
-
i
n
-
s
c
h
o
o
l
s
1 republican N Y N y Y Y
2 republican N Y N y Y Y
3 democrat ? Y Y ? Y Y
4 democrat N Y Y n ? Y
5 democrat Y Y Y n Y Y
6 democrat N Y Y n Y Y
7 democrat N Y N y Y Y
8 republican N Y N y Y Y
9 republican N Y N y Y Y
10 democrat Y Y Y n N N
5.2 Association Rules Mining Methodology using
optimization
Table (4) shows frequent rules satisfying MST, generated
from sample database D, in following; the possible
number of association rules satisfying MST and MCT,
generated by Apriori algorithm are given: (20). Suppose
the rule: (el-Salvador-aid=y 212 religious-groups-in-
schools=y 197) are specified as sensitive and should be
hidden in sanitized database, the transactions which
contain the sensitive items are called population. The
chromosomes of this population the fitness function has
applied. After applying the crossover and mutation
operations, based on fitness function the sensitive items of
the original database are modified and for keeping the
privacy of the database. After modification, Apriori
algorithm has been applied to verify all the sensitive rules
are hidden with the same support and confidence. Then
we evaluated the performance and security metrics
(hiding failure, dissimilarity, artificial pattern, side effect,
miss cost).
Table 4: Best rules inference extracted from original
dataset with MCT=0.58 and MST=0.25
TID Rules
1 adoption-of-the-budget-resolution=y physician-fee-freeze=n 219
Class Name=democrat
2 adoption-of-the-budget-resolution=y physician-fee-freeze=n aid-
to-nicaraguan-contras=y 198 Class Name=democrat
3 physician-fee-freeze=n aid-to-nicaraguan-contras=y 211
Class Name=democrat 210
4 physician-fee-freeze=n education-spending=n 202 Class
Name=democrat 201
5 physician-fee-freeze=n 247 Class Name=democrat 245
6 Class Name=democrat el-salvador-aid=n 200 aid-to-
nicaraguan-contras=y 197
7 el-salvador-aid=n 208 aid-to-nicaraguan-contras=y 204
8 el-salvador-aid=y 212 religious-groups-in-schools=y 197
Table 5: shows the association rule evaluation
performance results
Parameters Results
HF 0%
MC 36%
AP 27%
DISS 26%
SEF 4
As shown in Table (5), and figure(3.a) the number of
sensitive rules in sanitized data set equal to zero, most of
the developed privacy preserving algorithms are designed
with the goal of obtaining zero hiding failure. Thus, we
hide all the patterns considered sensitive from the
original data set. The number of non- sensitive patterns
discovered from the original database D, and the sanitized
database is the different, since we hide most the patterns
considered sensitive from the original data set, thus the
MC is equal to 36% as obviously in figure (3.b). The
percentage of the discovered patterns that are artifacts
(AP) is 27% as obviously in figure (3.c). The percentage
of the dissimilarity (DISS) between the original and the
sanitized datasets is 26% as obviously in figure (3.d). The
amount of non-sensitive association rules that are
removed as an effect of the sanitization process is four as
obviously in figure (3.e).
International Journal of EmergingTrends & Technology in Computer Science(IJETTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 2, Issue 3, May June 2013 ISSN 2278-6856
Volume 2, Issue 3 May June 2013 Page 413
6. Conclusion
The drawbacks of the traditional techniques in [2, 3, 4, 5
and 6] are weights values are unknown so its assumed in
advance. Also, it is no warranty to achieve hiding failure
with high performance. But these methods add more
adjustable parameters which require profound domain
knowledge which is usually not available, In addition, the
solutions generated in [2,3,4,5 and 6] are usually very
sensitive to small changes in these weights or penalties
functions.
The proposed approach penetrate the problem of
Balancing Security and Performance Metrics in generic
way since we do optimize between hiding failure as
security over head and ((AF), (Diss), (SEF), (MC)) as
database performance metrics.
The approach generates multiple solutions rather than
only biased solution.
This approach could be used in a tailored fashion base
especially in military applications or in civilian
application with dynamic policies concentrating on either
the security or the performance or both.
Reference
[1] D. Whitley, A genetic algorithm tutorial, Colorado
State University, 1994.
[2] M. Chirag, et al, An Efficient Solution for Privacy
Preserving Association Rule Mining, (IJCNS)
International Journal of Computer and Network
Security, Vol. 2, No. 5, May 2010.
[3] M. Dehkordi.. A Novel Method for Privacy
Preserving in Association Rule Mining Based on
Genetic Algorithms, Journal of software-JSW,
volume 4, no 6, 2009.
[4] S. Wang, B. Parikh, A. Jafari, Hiding informative
association rule sets, ELSEVIER, Expert Systems
with Applications, pp. 316323, 2006.
[5] S. Wang, D. Patel, et al, Hiding collaborative
recommendation association rules, Published
Springer, Science Business Media, LLC 2007.
[6] S. Wang, R. Maskey, et al , Efficient sanitization of
informative association rules, ACM , Expert
Systems with Applications: An International Journal,
Volume 35, Issue 1-2, July, 2008 .
[7] G. Moustakides, V. Verykios, A maxmin approach
for hiding frequent itemsets, Data and Knowledge
Engineering, pp.7589, 2008.
[8] G. Moustakides, V. S. Verykios, A maxmin
approach for hiding frequent itemsets, In
Workshops Proceedings of the 6th IEEE
International Conference on Data Mining (ICDM),
pp. 502506, 2006.
[9] X. Sun, P. Yu, Hiding sensitive frequent itemsets by
a borderbased approach, Computing science and
engineering, pp.7494, 2007.
[10] A. Divanis, V. Verykios, An Integer Programming
Approach for Frequent Itemset Hiding, In Proc
ACM Conf Information and Knowledge
Management (CIKM 06), Nov. 2006.
[11] A. Divanis, V. Verykios, Exact Knowledge Hiding
through Database Extension, IEEE Transactions on
Knowledge and Data Engineering, vol. 21(5), pp.
699713, May 2009.
[12] C. Aggarwal, P.Yu, Privacy-Preserving Data
Mining: Models and Algorithms, Springer,
Heidelberg, pp. 267286, 2008.
[13] K. Duraiswamy, D. Manjula, Advanced Approach
in Sensitive Rule Hiding, Modern Applied Science,
Vol.3, no. 2, 2009.
[14] C. Clifton, M. Kantarcioglu, J. Vaidya, Defining
Privacy for Data Mining, In Proceedings US Nat'l
Science Foundation Workshop on Next Generation
Data Mining, pp. 126-133, 2002.
[15] J. Schlimmer, Concept acquisition through
representational adjustment, Doctoral dissertation,
Department of Information and Computer Science,
University of California, Irvine, CA. 1987