A Hybrid Approach of Privacy Preserving Data
A Hybrid Approach of Privacy Preserving Data
(ICIMIA 2017)
Abstract—In this epoch of growing technology the data for any research or business purposes. Thus Privacy
collected by organizations has the requirement to preserve the Preserving Data Mining is a real challenge these days.
privacy of the individuals. The techniques like anonymization, For most of the applications like hospital, insurance, online
randomization are used to achieve the goal. But unfortunately
customers requiring data mining for analysis purposes, the
anonymization leads to certain level of information loss while
preserving privacy. Due to drift of data over the centralized data is stored in columnar way. The attributes can be
server has added to the more demand of preserving privacy to divided into following categories [10]:
dominate the trust of individuals. The application like the data i. Identifying attributes: These attributes like name, email
of online shopping customers, patients, insurance is stored id can explicitly identify the person.
online over the centralized repository these days and it needs to ii. Quasi identify attributes: The attributes like age,
maintain privacy of the individuals. In all the quoted
applications the data contains many fields like email address, gender, zip code when linked with some other database
zip code, age, nationality etc. The quasi-identifiers like zip or attributes can easily reveal a person’s identity.
code, age, gender of a person does not seem to be very iii. Sensitive attributes: This includes the data which
important to protect but these fields when linked with some
should not be disclosed or published against a person’s
other attributes can expose the identity or sensitive
information of an individual. Thus the quasi-identifiers need a identity. For e.g. while analyzing the sale of particular
special scrutinity in the purpose of achieving privacy. The product in online shopping, the customer’s identity
proposed hybrid approach combining suppression and should not be revealed against any product.
perturbation for Privacy Preserving data mining takes care of
these requirements. The method focuses on the goal of iv. Non Sensitive attributes: These are the fields which if
preserving privacy by suppressing and perturbing the quasi disclosed publically do not lead to any problem.
identifiers in the data of online shopping customers stored on
centralized data repository without causing any loss to the To achieve the Privacy Preserving Data Mining the
information in the process. The method targets to overcome techniques being used are[11]:
the limitation of information loss while preserving privacy. The i. Anonymization: It is done to remove the personally
experiment is carried by setting up a local server on the system
and the simulation results are compared with anonymization to identifiable information from the data sets by
show that the goal to achieve privacy of quasi identifiers generalizing the attributes.
without information loss is successfully achieved. ii. Perturbation: Perturbation refers to distorting the data
with the help of noise. This noise can be categorized
Keywords— Privacy Preserving Data mining, quasi-
identifiers, anonymization, perturbation, Suppression into two- additive and multiplicative noise.
iii. Randomization: Here the records are shuffled vertically
I. INT RODUCT ION in the way that the semantic meaning or the record in
Each In this modern era of thriving technology, the data the attribute is not distorted just vertical position of the
being collected by private as well as public organizations is record is changed hiding the correct identity.
escalating day by day. This in turn leads to the obligation of
iv. Cryptography: The sensitive information and identity
transferring this data online on the centralized server while
sustaining the trust of the individuals. The collected data is here is preserved with the help of encryption of the
used for various analysis or decision making purposes by records.
data mining. But today’s generation is more conscious about v. Condensation: In condensation process, the privacy is
their privacy being preserved while use of their data in any achieved by forming the clusters in such a way that the
way. Privacy here means identity of the person not being
divulged while unveiling any sort of data or using the data
size of each cluster is different from that of any other 4, technology used in section 4, simulation in section 6,
cluster. result and analysis in section 7 and conclusion in section 8.
These methods are proved to be successful in achieving the II. RELATED WORK
Privacy to certain level but still suffer from many XiaolinZhang, Hongjing Bi[1] proposed the method of
limitations. It is at highest priority to maintain the trust of random perturbation for Privacy Preservation Data Mining.
the individuals that their information and identity is not The author secured the data by replacing the attribute values
revealed. Privacy Preserving Data Mining techniques can with the code values(1,2,3…,n) and arranging these values
prove to be useful and efficient in achieving the goal of in a square matrix which were then randomly perturbed. The
gaining trusts by preserving identity and privacy of the data is extracted from these matrices using if then rules after
individuals. The application selected here for research pruning it with post pruning algorithms. The method helps
purpose is data of online shopping customers. The attributes in improving accuracy and achieving better data mining
are categorized as: results. The best part of the method is that the accuracy
increases with the increase of data.
TABLE I. DESCRIP TION OF ATRRIBUTES
Attribute Category Khaled M. Khan [2] discussed the trust issues in the cloud
Name Identifying environment and the reason for these trust issues. Nobody
Age Quasi identifier trust the system with least control in their hands and no
Gender Quasi identifier transparency to the way how data is stored. Its human nature
Zip code Quasi identifier that one feels safe with in house system. Any organization
Product name Sensitive can gain their customer’s trust by providing access control
to the individuals and guarantee of giving compensation due
The Quasi identifiers are most vulnerable to the linkage to any loss or data leakage.
attack. If original values of these attributes are correctly
published publicly even when hiding the name or other R. Mahesh and T. Meyyappan [3] proposed the method to
personal information can give an idea of a customer’s achieve privacy preservation through generalization of quasi
identity with the help of some other publically available identifier by setting in the range and deleting the duplicate
database by linking the values of these attributes in the two record. This approach of duplicate record elimination helps
databases [9]. For e.g. there are two applications stored on in reducing information loss and gives better performance in
the web - online shopping customers and publically terms of privacy gain when compared to k anonymity or l
available e. aadhaar data or voting list, both the applications diversity. Also the method gives protection from the two
contain these three fields age, gender and zip code. When types of attacks record linkage and attribute-linkage. The
the values of these fields from customer dataset are located only problem with method is that it only works well with
and linked with the values in the aadhaar dataset can infer the numeric data.
the customer’s identity by little efforts applied
P.Usha, et. al.[4] came up with the method based on the
categorization of attributes into four groups and then using
non homogeneous anonymization i.e. generalization or
suppression only on the quasi identifiers. The motive behind
this method was to reduce the information loss caused
because of homogenous anonymization. The proposed
method can be beneficial for Privacy Preservation Data
Publishing as it achieves high degree of data utility and data
integrity.
Fig 1: Linkage attack in quasi identifiers Xuyun Zhang et. al.[5] presented two phase top down
Although these attributes seem to be very simple and does specialization approach for anonymizing the data. The
not contain any critical information on their own but this author used specialization approach instead of
linkage attack attracts the attackers and fraud people to steal generalization to solve the privacy concerns and achieve the
the information and misuse the information of the scalability requirement for the large datasets. The health
individuals. Due to this linkage attack, hiding and securing dataset is used here for data analysis and other experimental
the original data in this type of attributes in a critical issue. purposes. The results and simulation proved that the
To preserve the identity of customers it is necessary to hide scalability could be achieved with the discussed method.
original values in the quasi identifier attributes.
The rest of the paper is organized as related work in section Xuyun Zhang et. al.[6] proposed a hybrid approach for
improving the scalability of the anonymization technique
2, Anonymization in section 3,proposed method in section
over the cloud. The author combined the specialization and
VI. SIMULATION
Start
The implementation is done using local server apache
tomcat 7.0 in installed in windows to make a virtual
centralized server type environment. The local server
Collect the online customer dataset
requires 4GB RAM, Windows 7, 8 or 8.1and JAVA 8 to be
installed on the machine. A driver ucanaccess is required to
Categorize the attributes into four categories.
The mathematical formula used to calculate the information − new value(Zip code)]
loss of the perturbed data in the reverse process is defined as:
Aloss = [ original value(age) − new value(age )]
( )
ℎ =
+ +
=
∑ +∑
V. TECHNOLOGY USED
The experiment is carried by running a local server on the
windows system. A virtual internet like environment is
setup and the data is displayed on the web browser with the
help of local server apache tomcat, the way it is displayed
on the server side. Apache tomcat is the open source
software used for the implementation of the JAVA and
JavaScript based projects, the Java Server Pages and
Servlet. JAVA and JavaScript language is used for the Fig 5: Values after reverse of approach
development and the implementation part of the proposed
approach. MS-Access is used to store the original as well as Thus the information loss in proposed method:
perturbed data on the machine .
( + + ) × 100
=
( + + )
Execution Time:
Execution time for all the three techniques is calculated
in terms of milliseconds (ms).
process of the approach. The above figure Fig. 3 shows that the Cloud Computing. In Network-Based Information Systems (NBiS),
there is 0 information loss in the process of retrieving the 2013 16th International Conference on (pp. 141-148). IEEE
original data again. The Fig. 4 proves that the execution [9] Malik, M. B., Ghazi, M. A., & Ali, R. (2012, November). Privacy
time for the hybrid approach is significantly less than the preserving data mining techniques: current scenario and future
anonymization approach. Thus the main purpose of Privacy prospects. In Computer and Communication Technology (ICCCT),
2012 Third International Conference on (pp. 26-32). IEEE
Preserving Data Mining without information loss is
achieved by this hybrid technique in significantly less [10] Zhu, J. (2009, August). A new scheme to privacy-preserving
collaborative data mining. In Information Assurance and Security,
execution time. 2009. IAS'09. Fifth International Conference on (Vol. 1, pp. 468-
471). IEEE.
VIII. CONCLUSION
[11] M. Suriyapriya, A. Joicy, Attribute Based Encryption with Privacy
The paper proposes a hybrid Privacy Preserving Data Preserving In Clouds. International Journal on Recent and
Mining technique using suppression and perturbation over Innovation Trends in Computing and Communication ISSN: 2321 -
8169 Volume: 2 Issue: 2
the centralized server environment. The simulation and
[12] Zhan, J., & Matwin, S. (2006, December). A crypto-based approach
result analysis show that the method has been successfu l to to privacy-preserving collaborative data mining. In Data Mining
a great extend in hiding the identity of customers and Workshops, 2006. ICDM Workshops 2006. Sixth IEEE International
preserving their privacy. The original values of the data can Conference on (pp. 546-550). IEEE.
also be retrieved while performing the reverse process so
there is no information loss. The critical issue of securing
the Boolean gender field without information loss is
resolved by the described algorithm. The major challenge of
information loss in the process of privacy preservation has
been successfully achieved. Also the execution time for
achieving privacy is less for the new hybrid approach. The
proposed method resolved the critical conflict between the
privacy preservation and information loss.
REFERENCES
[1] Zhang, X., & Bi, H. (2010, October). Research on privacy preserving
classification data mining based on random perturbation. In
Information Networking and Automation (ICINA), 2010 International
Conference on (Vol. 1, pp. V1-173). IEEE.
[2] Khaled M. Khan and Qutaibah Malluhi (2010) “How can cloud
providers earn thir customers’ trust when a third party is processing
data and technologies used to address these challenges” by IEEE
Computer Society in IEEE Setember/October 2010
[3] Mahesh, R., & Meyyappan, T . (2013, February). Anonymization
technique through record elimination to preserve privacy of published
data. In Pattern Recognition, Informatics and Mobile Engineering
(PRIME), 2013 International Conference on (pp. 328-332). IEEE.
[4] Usha, P., Shriram, R., & Sathishkumar, S. (2014, February). Sensitive
attribute based non-homogeneous anonymization for privacy
preserving data mining. In Information Communication and
Embedded Systems (ICICES), 2014 International Conference on (pp.
1-5). IEEE.
[5] Zhang, X., Yang, L. T., Liu, C., & Chen, J. (2014). A scalable two -
phase top-down specialization approach for data anonymization using
mapreduce on cloud. Parallel and Distributed Systems, IEEE
Transactions on, 25(2), 363-373.
[6] Zhang, X., Liu, C., Nepal, S., Yang, C., Dou, W., & Chen, J. (2014).
A hybrid approach for scalable sub-tree anonymization over big data
using MapReduce on cloud. Journal of Computer and System
Sciences, 80(5), 1008-1020
[7] Arora, D. K., Bansal, D., & Sofat, S. Comparative Analysis of
Anonymization T echniques. In International Journal of Electronic and
Electrical Engineering.ISSN 0974-2174 Volume 7, Number 8 (2014),
pp. 773-778
[8] Ushdia, M., Itoh, K., Katayama, Y., Kozakura, F., & Tsuda, H. (2013,
September). A Proposal of Privacy-Preserving Data Aggregation on