0% found this document useful (0 votes)
131 views6 pages

A Hybrid Approach of Privacy Preserving Data

The document summarizes a paper on using a hybrid approach of suppression and perturbation techniques for privacy preserving data mining. The paper proposes a method that focuses on preserving privacy of quasi-identifiers in customer data by suppressing and perturbing the quasi-identifier values without loss of information. The experiment is conducted on a local server to compare the results with anonymization and show that privacy is achieved without information loss.

Uploaded by

Kanna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
131 views6 pages

A Hybrid Approach of Privacy Preserving Data

The document summarizes a paper on using a hybrid approach of suppression and perturbation techniques for privacy preserving data mining. The paper proposes a method that focuses on preserving privacy of quasi-identifiers in customer data by suppressing and perturbing the quasi-identifier values without loss of information. The experiment is conducted on a local server to compare the results with anonymization and show that privacy is achieved without information loss.

Uploaded by

Kanna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

International Conference on Innovative Mechanisms for Industry Applications

(ICIMIA 2017)

A Hybrid Approach of Privacy Preserving Data


Mining using Suppression and Perturbation
Techniques
Arshveer Kaur
Computer Science Department
PEC University of Technology
Chandigarh, India
[email protected] m

Abstract—In this epoch of growing technology the data for any research or business purposes. Thus Privacy
collected by organizations has the requirement to preserve the Preserving Data Mining is a real challenge these days.
privacy of the individuals. The techniques like anonymization, For most of the applications like hospital, insurance, online
randomization are used to achieve the goal. But unfortunately
customers requiring data mining for analysis purposes, the
anonymization leads to certain level of information loss while
preserving privacy. Due to drift of data over the centralized data is stored in columnar way. The attributes can be
server has added to the more demand of preserving privacy to divided into following categories [10]:
dominate the trust of individuals. The application like the data i. Identifying attributes: These attributes like name, email
of online shopping customers, patients, insurance is stored id can explicitly identify the person.
online over the centralized repository these days and it needs to ii. Quasi identify attributes: The attributes like age,
maintain privacy of the individuals. In all the quoted
applications the data contains many fields like email address, gender, zip code when linked with some other database
zip code, age, nationality etc. The quasi-identifiers like zip or attributes can easily reveal a person’s identity.
code, age, gender of a person does not seem to be very iii. Sensitive attributes: This includes the data which
important to protect but these fields when linked with some
should not be disclosed or published against a person’s
other attributes can expose the identity or sensitive
information of an individual. Thus the quasi-identifiers need a identity. For e.g. while analyzing the sale of particular
special scrutinity in the purpose of achieving privacy. The product in online shopping, the customer’s identity
proposed hybrid approach combining suppression and should not be revealed against any product.
perturbation for Privacy Preserving data mining takes care of
these requirements. The method focuses on the goal of iv. Non Sensitive attributes: These are the fields which if
preserving privacy by suppressing and perturbing the quasi disclosed publically do not lead to any problem.
identifiers in the data of online shopping customers stored on
centralized data repository without causing any loss to the To achieve the Privacy Preserving Data Mining the
information in the process. The method targets to overcome techniques being used are[11]:
the limitation of information loss while preserving privacy. The i. Anonymization: It is done to remove the personally
experiment is carried by setting up a local server on the system
and the simulation results are compared with anonymization to identifiable information from the data sets by
show that the goal to achieve privacy of quasi identifiers generalizing the attributes.
without information loss is successfully achieved. ii. Perturbation: Perturbation refers to distorting the data
with the help of noise. This noise can be categorized
Keywords— Privacy Preserving Data mining, quasi-
identifiers, anonymization, perturbation, Suppression into two- additive and multiplicative noise.
iii. Randomization: Here the records are shuffled vertically
I. INT RODUCT ION in the way that the semantic meaning or the record in
Each In this modern era of thriving technology, the data the attribute is not distorted just vertical position of the
being collected by private as well as public organizations is record is changed hiding the correct identity.
escalating day by day. This in turn leads to the obligation of
iv. Cryptography: The sensitive information and identity
transferring this data online on the centralized server while
sustaining the trust of the individuals. The collected data is here is preserved with the help of encryption of the
used for various analysis or decision making purposes by records.
data mining. But today’s generation is more conscious about v. Condensation: In condensation process, the privacy is
their privacy being preserved while use of their data in any achieved by forming the clusters in such a way that the
way. Privacy here means identity of the person not being
divulged while unveiling any sort of data or using the data

978-1-5090-5960-7/17/$31.00 ©2017 IEEE 306


International Conference on Innovative Mechanisms for Industry Applications
(ICIMIA 2017)

size of each cluster is different from that of any other 4, technology used in section 4, simulation in section 6,
cluster. result and analysis in section 7 and conclusion in section 8.

These methods are proved to be successful in achieving the II. RELATED WORK
Privacy to certain level but still suffer from many XiaolinZhang, Hongjing Bi[1] proposed the method of
limitations. It is at highest priority to maintain the trust of random perturbation for Privacy Preservation Data Mining.
the individuals that their information and identity is not The author secured the data by replacing the attribute values
revealed. Privacy Preserving Data Mining techniques can with the code values(1,2,3…,n) and arranging these values
prove to be useful and efficient in achieving the goal of in a square matrix which were then randomly perturbed. The
gaining trusts by preserving identity and privacy of the data is extracted from these matrices using if then rules after
individuals. The application selected here for research pruning it with post pruning algorithms. The method helps
purpose is data of online shopping customers. The attributes in improving accuracy and achieving better data mining
are categorized as: results. The best part of the method is that the accuracy
increases with the increase of data.
TABLE I. DESCRIP TION OF ATRRIBUTES
Attribute Category Khaled M. Khan [2] discussed the trust issues in the cloud
Name Identifying environment and the reason for these trust issues. Nobody
Age Quasi identifier trust the system with least control in their hands and no
Gender Quasi identifier transparency to the way how data is stored. Its human nature
Zip code Quasi identifier that one feels safe with in house system. Any organization
Product name Sensitive can gain their customer’s trust by providing access control
to the individuals and guarantee of giving compensation due
The Quasi identifiers are most vulnerable to the linkage to any loss or data leakage.
attack. If original values of these attributes are correctly
published publicly even when hiding the name or other R. Mahesh and T. Meyyappan [3] proposed the method to
personal information can give an idea of a customer’s achieve privacy preservation through generalization of quasi
identity with the help of some other publically available identifier by setting in the range and deleting the duplicate
database by linking the values of these attributes in the two record. This approach of duplicate record elimination helps
databases [9]. For e.g. there are two applications stored on in reducing information loss and gives better performance in
the web - online shopping customers and publically terms of privacy gain when compared to k anonymity or l
available e. aadhaar data or voting list, both the applications diversity. Also the method gives protection from the two
contain these three fields age, gender and zip code. When types of attacks record linkage and attribute-linkage. The
the values of these fields from customer dataset are located only problem with method is that it only works well with
and linked with the values in the aadhaar dataset can infer the numeric data.
the customer’s identity by little efforts applied
P.Usha, et. al.[4] came up with the method based on the
categorization of attributes into four groups and then using
non homogeneous anonymization i.e. generalization or
suppression only on the quasi identifiers. The motive behind
this method was to reduce the information loss caused
because of homogenous anonymization. The proposed
method can be beneficial for Privacy Preservation Data
Publishing as it achieves high degree of data utility and data
integrity.

Fig 1: Linkage attack in quasi identifiers Xuyun Zhang et. al.[5] presented two phase top down
Although these attributes seem to be very simple and does specialization approach for anonymizing the data. The
not contain any critical information on their own but this author used specialization approach instead of
linkage attack attracts the attackers and fraud people to steal generalization to solve the privacy concerns and achieve the
the information and misuse the information of the scalability requirement for the large datasets. The health
individuals. Due to this linkage attack, hiding and securing dataset is used here for data analysis and other experimental
the original data in this type of attributes in a critical issue. purposes. The results and simulation proved that the
To preserve the identity of customers it is necessary to hide scalability could be achieved with the discussed method.
original values in the quasi identifier attributes.
The rest of the paper is organized as related work in section Xuyun Zhang et. al.[6] proposed a hybrid approach for
improving the scalability of the anonymization technique
2, Anonymization in section 3,proposed method in section
over the cloud. The author combined the specialization and

978-1-5090-5960-7/17/$31.00 ©2017 IEEE 307


International Conference on Innovative Mechanisms for Industry Applications
(ICIMIA 2017)

generalization approach and compared the efficiency and


scalability with the k anonymity. The author used TDS and
BUG i.e Top–Down Specialization and Bottom–Up
Generalization to overcome the scalability limitation of the
k anonymity. The generalization approach causes
information loss while improving scalability.

Dilpreet Kaur, Divya Bansal and Sanjeev Sofat[7] discussed


and presented the comparative study of the anonymization Fig 2: Anonymized data
techniques. After implementing the techniques on different
data sets the author came with different inferences like As IV. PROPOSED METHOD
the no. of attributes increases the information loss gets The Quasi identifiers are the crucial ones in the task of
increased showing that the information loss is directly preserving privacy. Among these the gender field having
proportional to the no. Of attributes to be anonymized. T- Boolean type values adds more difficulty in the challenge.
Closeness has less information loss than L-Diversity and K- Generalization of this field while anonymizing preserves
Anonymity but these techniques still leads to extensive privacy but at the cost of information loss. Using
information loss anonymization technique for privacy preservation
generalizes the values in the attribute to the level where
Mebae Ushida, Kouichi Itoh[8] made a Proposal of Privacy- there is complete information loss and the original data
Preserving using Data Aggregation for fulfilling the major cannot be regained. Achieving privacy by generalizing the
requirements of cloud which are guarantee against leakage numeric fields again become hindrance in the reverse
of stored information and providing aggregation results as process. The proposed method is an initiative to preserve
per authority. So the data is masked by the user before privacy on the online centralized data repository with
storing on cloud and then the masked aggregation results are minimum information loss. The method in this paper is the
obtained. The user get the aggregation results by unmasking hybrid approach for Privacy Preserving Data Mining based
the masked results by their own secret private keys assigned on Suppression and Perturbation implemented over the
as per the authority of the user. centralized server. The approach focuses on preventing the
III. ANONYMIZATION identification being revealed through quasi identifier
attributes. The stepwise execution of the approach is
Anonymization technique means that if the records are explained below:
defined in k-1 different forms then the data is k anonymized. i. Remove the identifying field
The generalization of the data is usually done to anonymize ii. Choose two numeric values.
the original data. The process of anonymization is achieved iii. Suppress the gender field :
with respect to the quasi identifiers. [12]All the quasi- Male → First value
identifiers are suppressed with the same character or Female → Second value
represented in the form of same intervals for the purpose of iv. Calculate appropriate additive noise value.
generalisation to hide the original information of the fields. v. Perturb the age attribute:
The generalization process suffers from a high information Age = Age + noise
loss when trying to obtain the original data and the original vi. Find the value for noise to be used for zip code.
values of the records can never be retrieved again. The vii. Apply perturbation on zip code:
figures below show the anonymized data achieved by Zip code = Zip code * noise
generalizing the records.

Fig 1: Original data

978-1-5090-5960-7/17/$31.00 ©2017 IEEE 308


International Conference on Innovative Mechanisms for Industry Applications
(ICIMIA 2017)

VI. SIMULATION
Start
The implementation is done using local server apache
tomcat 7.0 in installed in windows to make a virtual
centralized server type environment. The local server
Collect the online customer dataset
requires 4GB RAM, Windows 7, 8 or 8.1and JAVA 8 to be
installed on the machine. A driver ucanaccess is required to
Categorize the attributes into four categories.

be installed on the machine to perform the connectivity with


Remove the identifying attributes the database. The suppression and perturbation is achieved
using SQL queries and JAVA and JavaScript language to
Find suitable values for the gender field to be
suppressed with
run the scripts on the server. The following figures show the
difference of the original data and the perturbed data.
Perturb the age by adding the appropriate noise

Perturb the zip code using multiplicative


suitable noise

Check Privacy Preserved and information loss

The new data set formed can be used for any


decision making or analysis purpose
Fig 4: Perturbed data with new approach
End
Calculation of information loss:
Gloss = [ original value(gender)

Fig 3: Diagrammatic view of algorithm − new value(gender)]

Zloss = [ original value(Zip code)

The mathematical formula used to calculate the information − new value(Zip code)]
loss of the perturbed data in the reverse process is defined as:
Aloss = [ original value(age) − new value(age )]
( )
ℎ =
+ +
=
∑ +∑
V. TECHNOLOGY USED
The experiment is carried by running a local server on the
windows system. A virtual internet like environment is
setup and the data is displayed on the web browser with the
help of local server apache tomcat, the way it is displayed
on the server side. Apache tomcat is the open source
software used for the implementation of the JAVA and
JavaScript based projects, the Java Server Pages and
Servlet. JAVA and JavaScript language is used for the Fig 5: Values after reverse of approach
development and the implementation part of the proposed
approach. MS-Access is used to store the original as well as Thus the information loss in proposed method:
perturbed data on the machine .
( + + ) × 100
=
( + + )

978-1-5090-5960-7/17/$31.00 ©2017 IEEE 309


International Conference on Innovative Mechanisms for Industry Applications
(ICIMIA 2017)

TABLE II. I NFORMATION LOSS Fig 7: Comparison of Execution T ime


Customer Anonymization Pe rturbation Hybrid

100 83.53 1.15 0.0


Privacy Preserved:
500 82.72 1.91 0.0
The level of privacy preserved is calculated in the form of
1000 83.46 2.79 0.0
2500 85.02 1.40 0.0 number of characters preserved. More are the number of

4000 85.15 2.43 0.0 characters preserved better is the approach.


5000 85.21 1.47 0.0
TABLE IV. NUMBER OF CHARACT ERS PRESERVED IN EACH
APPROACH

Customer Anonymization Pe rturbation Hybrid

100 7369 7401 18116

500 36897 50746 86160


1000 73708 73904 178616

2500 184049 192150 518226


4000 256290 456255 740452
5000 329733 598458 926784

Fig 6: Comparison of Information loss

Execution Time:
Execution time for all the three techniques is calculated
in terms of milliseconds (ms).

TABLE III. EXECUTION T IME


Customer Anonymization Pe rturbation Hybrid

100 293 286 223

500 358 312 248

1000 445 426 414

2500 804 766 753

4000 1063 952 939


Fig 8: Graph for Privacy preserved
5000 1699 1588 1206

VII. RESULT ANALYSIS


The above figures show that the Boolean field gender which
is a challenge to hide is successfully hidden by using two
numeric values which vary with changing the dataset. The
values of zip code and age are modified hiding the actual
information of customers with the help of multiplicative and
additive noise respectively. The values of the Zip code field
are distorted the values which do not exist for zip code.
Thus this shows that privacy is achieved to a great extend.
While preserving privacy there is equal need for the entire
data to be reversed to its original form for the sake of trust
of users. The bonus point for using the proposed method is
that when these fields are required to be reversed to their
original values they can be reversed back to the original
accurate values. There is no information loss in the reverse

978-1-5090-5960-7/17/$31.00 ©2017 IEEE 310


International Conference on Innovative Mechanisms for Industry Applications
(ICIMIA 2017)

process of the approach. The above figure Fig. 3 shows that the Cloud Computing. In Network-Based Information Systems (NBiS),
there is 0 information loss in the process of retrieving the 2013 16th International Conference on (pp. 141-148). IEEE
original data again. The Fig. 4 proves that the execution [9] Malik, M. B., Ghazi, M. A., & Ali, R. (2012, November). Privacy
time for the hybrid approach is significantly less than the preserving data mining techniques: current scenario and future
anonymization approach. Thus the main purpose of Privacy prospects. In Computer and Communication Technology (ICCCT),
2012 Third International Conference on (pp. 26-32). IEEE
Preserving Data Mining without information loss is
achieved by this hybrid technique in significantly less [10] Zhu, J. (2009, August). A new scheme to privacy-preserving
collaborative data mining. In Information Assurance and Security,
execution time. 2009. IAS'09. Fifth International Conference on (Vol. 1, pp. 468-
471). IEEE.
VIII. CONCLUSION
[11] M. Suriyapriya, A. Joicy, Attribute Based Encryption with Privacy
The paper proposes a hybrid Privacy Preserving Data Preserving In Clouds. International Journal on Recent and
Mining technique using suppression and perturbation over Innovation Trends in Computing and Communication ISSN: 2321 -
8169 Volume: 2 Issue: 2
the centralized server environment. The simulation and
[12] Zhan, J., & Matwin, S. (2006, December). A crypto-based approach
result analysis show that the method has been successfu l to to privacy-preserving collaborative data mining. In Data Mining
a great extend in hiding the identity of customers and Workshops, 2006. ICDM Workshops 2006. Sixth IEEE International
preserving their privacy. The original values of the data can Conference on (pp. 546-550). IEEE.
also be retrieved while performing the reverse process so
there is no information loss. The critical issue of securing
the Boolean gender field without information loss is
resolved by the described algorithm. The major challenge of
information loss in the process of privacy preservation has
been successfully achieved. Also the execution time for
achieving privacy is less for the new hybrid approach. The
proposed method resolved the critical conflict between the
privacy preservation and information loss.

REFERENCES
[1] Zhang, X., & Bi, H. (2010, October). Research on privacy preserving
classification data mining based on random perturbation. In
Information Networking and Automation (ICINA), 2010 International
Conference on (Vol. 1, pp. V1-173). IEEE.
[2] Khaled M. Khan and Qutaibah Malluhi (2010) “How can cloud
providers earn thir customers’ trust when a third party is processing
data and technologies used to address these challenges” by IEEE
Computer Society in IEEE Setember/October 2010
[3] Mahesh, R., & Meyyappan, T . (2013, February). Anonymization
technique through record elimination to preserve privacy of published
data. In Pattern Recognition, Informatics and Mobile Engineering
(PRIME), 2013 International Conference on (pp. 328-332). IEEE.
[4] Usha, P., Shriram, R., & Sathishkumar, S. (2014, February). Sensitive
attribute based non-homogeneous anonymization for privacy
preserving data mining. In Information Communication and
Embedded Systems (ICICES), 2014 International Conference on (pp.
1-5). IEEE.
[5] Zhang, X., Yang, L. T., Liu, C., & Chen, J. (2014). A scalable two -
phase top-down specialization approach for data anonymization using
mapreduce on cloud. Parallel and Distributed Systems, IEEE
Transactions on, 25(2), 363-373.
[6] Zhang, X., Liu, C., Nepal, S., Yang, C., Dou, W., & Chen, J. (2014).
A hybrid approach for scalable sub-tree anonymization over big data
using MapReduce on cloud. Journal of Computer and System
Sciences, 80(5), 1008-1020
[7] Arora, D. K., Bansal, D., & Sofat, S. Comparative Analysis of
Anonymization T echniques. In International Journal of Electronic and
Electrical Engineering.ISSN 0974-2174 Volume 7, Number 8 (2014),
pp. 773-778
[8] Ushdia, M., Itoh, K., Katayama, Y., Kozakura, F., & Tsuda, H. (2013,
September). A Proposal of Privacy-Preserving Data Aggregation on

978-1-5090-5960-7/17/$31.00 ©2017 IEEE 311

You might also like