0% found this document useful (0 votes)
8 views9 pages

PP Data Stream Classification ICECIT Dec 2012

The paper discusses privacy-preserving data stream classification using data perturbation techniques, focusing on the need for privacy in data mining due to concerns from data owners. It evaluates existing methods and proposes new algorithms that minimize information loss while maximizing privacy gain, specifically for dynamic data streams. The proposed methods include sliding window-based perturbation and multiplicative perturbation using rotation, demonstrating effective classification with maintained privacy.

Uploaded by

rajipe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views9 pages

PP Data Stream Classification ICECIT Dec 2012

The paper discusses privacy-preserving data stream classification using data perturbation techniques, focusing on the need for privacy in data mining due to concerns from data owners. It evaluates existing methods and proposes new algorithms that minimize information loss while maximizing privacy gain, specifically for dynamic data streams. The proposed methods include sliding window-based perturbation and multiplicative perturbation using rotation, demonstrating effective classification with maintained privacy.

Uploaded by

rajipe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/259493155

Privacy Preserving Data Stream Classification Using Data Perturbation


Techniques

Conference Paper · December 2012


DOI: 10.13140/2.1.1339.2964

CITATIONS READS

6 864

3 authors, including:

Dr. Hitesh Chhinkaniwala Sanjay Garg


Adani Institute of Infrastructure Engineering Jaypee University of Engineering and Technology
31 PUBLICATIONS 171 CITATIONS 113 PUBLICATIONS 795 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Dr. Hitesh Chhinkaniwala on 01 January 2014.

The user has requested enhancement of the downloaded file.


International Conference on Emerging Trends in Electrical, Electronics and Communication
Technologies-ICECIT, 2012

Privacy Preserving Data Stream Classification Using Data


Perturbation Techniques
1
Hitesh Chhinkaniwala*, 1Kiran Patel, 2Sanjay Garg
1
U V Patel College of Engineering, Kherva, Mehsana, 382711, India
2
Institute of Technology, Nirma University, Ahmedabad, 382481, India

Abstract

Data stream can be conceived as a continuous and changing sequence of data that continuously arrive at a system to store or
process. Examples of data streams include computer network traffic, phone conversations, web searches and sensor data etc.
These data sets need to be analyzed for identifying trends and patterns, which help us in isolating anomalies and predicting future
behavior. However, data owners or publishers may not be willing to exactly reveal the true values of their data due to various
reasons, most notably privacy considerations. Hence, some amount of privacy preservation needs to be done on the data before it
can be made publicly available. To preserve data privacy during data mining, the issue of privacy preserving data mining has
been widely studied and many techniques have been proposed. However, existing techniques for privacy preserving data mining
is designed for traditional static data sets and are not suitable for data streams. So the privacy preservation issue of data streams
mining is need for the time. This paper focused on describing a method that extends the process of data perturbation on data sets
to achieve privacy preservation. Classification characteristics of original data streams and perturbed data streams using proposed
algorithms have been evaluated in terms of less information loss, response time, and more privacy gain.

Keywords: Data Perturbation, Sliding Window, Orthogonal Matrix, Decision Tree, Hoeffding Tree;

1. Introduction

In the field of information processing, data mining refers to the process of extracting the useful knowledge from
the large volume of data. There is plenty of area where the data mining is widely applied such as Healthcare which
includes Medical diagnostics, Insurance claims analysis, Drug development, Business, Finance, Education, Sports
and Gambling, Stock market, Retail, Telecommunication etc. Widely used data mining techniques in such area of
application includes Clustering, Classification, Regression analysis and Association rule / Pattern mining.
The data stream paradigm has recently emerged in response to the issues and challenges related with continuous
data [2]. Mining data streams is concerned with extracting knowledge structures represented in models and patterns
in non-stopping, continuous streams (flow) of information. Algorithms written for data streams can naturally cope
with data sizes many times greater than memory, and can be extended to challenge real-time applications not
previously tackled by machine learning or data mining. The assumption of data stream processing is that training
examples can be briefly inspected during single scan of input data stream, that is, they arrive in a high speed stream,
and then must be discarded to make room for subsequent examples. The algorithm processing the stream has no
control over the order of the examples seen, and must update its model incrementally as each example is inspected.
An additional desirable property, the so-called anytime property, requires that the model is ready to be applied at
any point between training examples. Traditional data mining approaches have been used in applications that have
persistent data available and generated learning models are static in nature. Statistical information of the data

* Corresponding author. Tel.: +0-942-690-5105.


E-mail address: [email protected]
Hitesh Chhinkaniwala, Kiran Patel, Sanjay Garg/ Privacy Preserving Data Stream Classification Using Data Perturbation Techniques

distribution can be known in advance because entire data set is available before pass it to machine learning
algorithm. The task performed by the mining process is centralized and produce static learning model. But
nowadays, in the field of information processing, an emergence of applications that do not fit this data model [3]
Instead, information naturally occurs in the form of a sequence (stream) of data values. A data stream is a real-time,
continuous, and ordered sequence of items. It is not possible to control the order in which items arrive, nor feasible
to locally store a stream in its entirety. Likewise, queries over streams run continuously over a period of time and
incrementally return new results as new data arrive.

2. Privacy Concern for Data Stream

Mining data streams is concerned with extracting knowledge structures represented in models and patterns in
non-stopping streams of information. The general process of data stream mining is depicted in Figure 1.

Figure 1. General Process of data stream mining

Motivated by the privacy concerns on data mining tools, a research area called privacy-preserving data mining
has been emerged. Verykios et al. [4] classified privacy- preserving data mining techniques based on five
dimensions – data distribution, data modification, data mining algorithms, data or rule hiding, and privacy
preservation. In the dimension of data distribution, some approaches have been proposed for centralized data and
some for distributed data. Du and Zhan [5] utilized the secure union, secure sum and secure scalar product to
prevent the original data of each site from revealing during the mining process. At the end of the mining process,
every site will obtain the final result of mining the whole data. The disadvantage is that the approach requires
multiple scans of the database and hence is not suitable for data streams, which flows in fast and requires immediate
response. In the dimension of data modification, the confidential values of a database to be released to the public are
modified to preserve data privacy. Adopted approaches include perturbation, blocking, aggregation or merging,
swapping, and sampling. Agrawal and Srikant [6] used the random data perturbation technique to protect customer
data and then constructed the decision tree. For data streams, because data are produced at different time, not only
data distribution will change with time, but also the mining accuracy will decrease with perturb data. Privacy
preservation techniques can be classified into three categories namely heuristic-based techniques, cryptography-
based techniques, and reconstruction-based techniques. From the review of previous research, it can be seen that
existing techniques for privacy-preserving data mining are designed for static databases with an emphasis on data
security. These existing techniques are not suitable for data streams.
Perturbation techniques are often evaluated with two basic metrics: level of privacy guarantee and level of
model-specific data utility preserved, which is often measured by the loss of accuracy for data classification and data
clustering. An ultimate goal for all data perturbation algorithms is to optimize the data transformation process by
maximizing both data privacy and data utility achieved. Data privacy is commonly measured by the difficulty level
in estimating the original data from the perturbed data. Given a data perturbation technique, the higher level of
Hitesh Chhinkaniwala, Kiran Patel, Sanjay Garg/ Privacy Preserving Data Stream Classification Using Data Perturbation Techniques

difficulty in which the original values can be estimated from the perturbed data, the higher level of data privacy this
technique supports. Data utility typically refers to the amount of mining-task/model specific critical information
preserved about the data set after perturbation. Different data mining tasks, such as classification mining task vs.
association rule mining, or different models for the same task, such as decision tree model vs. k-Nearest- Neighbor
(kNN) classifier for classification, typically utilize different sets of data properties about the data set.

3. Privacy Preserving Data Stream Classification

3.1. Data stream classification cycle


A classification algorithm must meet several requirements in order to work with the assumptions and be suitable
for learning from data streams. Figure 2 illustrate the typical use of a data stream classification algorithm, and how
the requirements fit in. The general model of data stream classification follows three steps in a repeating cycle [11]:
The algorithm is passed the next available example from the stream (requirement1).
The algorithm processes the example, updating its data structures. It does so without exceeding the memory
bounds set on it (requirement 2), and as quickly as possible (requirement 3).
The algorithm is ready to accept the next example. On request it is able to supply a model that can be used
to predict the class of unseen examples (requirement 4).

Figure 2. Data stream classification cycle

Yabo Xu et al. [7] considered the classification problem where the training data are several private data streams.
Joining all streams violates the privacy constraint of such applications and suffers from the blow-up of join. With
this technique, one can build exactly the same Naive Bayesian Classifier as using the join stream without
exchanging private information. The processing cost is linear in the size of input streams and independent of the join
size. But there are some drawback related time and data arrival rate that is Having a much lower processing time per
input tuple, the proposed method is able to handle much higher data arrival rate and deal with the general many-to-
many join relationships of data streams. Ching-Ming Chao et al [8] proposed the method Privacy Preserving
Classification of Data Streams (PCDS) claiming better accurate result and overcome drawback cited for [7]. They
proposed two stage processes – data streams preprocessing for data perturbation and data stream mining. Data
streams preprocessing (DSP) algorithm has higher security and less information loss. In the stage of data streams
mining, Weighted Average Sliding Window (WASW) algorithm is used to mine perturbed data streams. Experiment
results of accuracy measurement showed that the error rate of the Very Fast Decision Tree Learner (VFDT)
algorithm increases constantly along with continuous arrival of the data stream but the error rate of the WASW
Hitesh Chhinkaniwala, Kiran Patel, Sanjay Garg/ Privacy Preserving Data Stream Classification Using Data Perturbation Techniques

algorithm is kept under the predetermined threshold value. Therefore, the WASW algorithm has higher accuracy. In
conclusion, the PCDS method not only can preserve data privacy but also can mine data streams accurately.
Multiplicative Perturbation has two basic forms of multiplicative noise which have been studied by the statistics
community [1]. One multiplies each data element by a random number that has a truncated Gaussian distribution
with mean one and small variance. The other takes a logarithmic transformation of the data first, adds multivariate
Gaussian noise, and then takes the exponential function exp (.) of the noise-added data. Neither of these
perturbations preserves pair wise distance among data records.
To facilitate large scale data mining applications, Liu et al. [13] proposed an approach where the data is
multiplied by a randomly generated matrix – in effect, the data is projected into a lower dimensional random space.
This technique preserves distance on expectation. Oliveira and Zaiane [14], Chen and Liu [15] discussed the use of
random rotation for privacy preserving clustering and classification. These authors observed that the distance
preserving nature of random rotation enables a third party to produce exactly the same data mining results on the
perturbed data as if on the original data. However, they did not analyze the privacy limitations of random rotation.
Liu et al. [16] addressed the privacy issues of distance preserving perturbation (including rotation) by studying how
well an attacker can recover the original data from the transformed data and prior information. They proposed two
attack techniques: the first is based on basic properties of linear algebra and the second on principal component
analysis. Their analysis explicitly illuminated scenarios where privacy can be breached.

4. Problem Description

The initial idea of it was to extend traditional data mining techniques to work with the perturbed stream data to
mask sensitive information. The key issue is to get accurate stream mining results using perturb data. The solutions
are often tightly coupled with the data stream mining algorithms under consideration.

Figure 3. Framework for privacy preserving in data stream classification

The goal is to transform a given data set D into perturbed version D’ that satisfies a given privacy requirement
and loss minimum information for the intended data analysis task. In this paper data perturbation algorithms have
been proposed for data set perturbation. Classification characteristics in terms of accuracy, response time, and
privacy gain on Hoeffding tree algorithm have been analyzed.
Hitesh Chhinkaniwala, Kiran Patel, Sanjay Garg/ Privacy Preserving Data Stream Classification Using Data Perturbation Techniques

5. Data Perturbation Algorithms

5.1. Data Perturbation Using Sliding Window Concept


The stage of data streams pre-processing uses perturbation algorithm to perturb confidential data. Users can
flexibly adjust the data attributes to be perturbed according to the privacy need. Therefore, threats and risks from
releasing data can be reduced effectively.

Algorithm: Sliding Window based Data Perturbation


Input: Data set S (.ARFF or .CSV file)
Output: Perturbed Data set S’ (.ARFF or .CSV file)
Algorithm Steps:
1. Read Original Data set S from file.
2. Display set of attributes with data type and total number of attribute that are in the data set S.
3. Display total number of tuples in data set S.
4. Select sensitive attribute (numerical only) from set of available attributes.
5. Suppose selected attribute is F* then
a) Assign window to F* attribute (window store received data set according to order of arrival).
b) Suppose size of window is w (selection of size at run time) than it contains only w tuples of selected
attribute F* values.
c) Find mean of w set of tuples.
d) Replace first tuple value of window by mean computed from step 5c.
e) Rest of the tuples’ value in w remains unchanged.
f) Popped off perturbed value and append next tuple into w. Sliding window size remains same.
6. Repeat 5a to 5f until all the values of attribute F* is perturbed.
7. Store perturbed data set S’ into new file.

5.2. Multiplicative Data Perturbation Using Rotation Perturbation

Multiplicative data perturbations include three types of perturbation techniques: Rotation Perturbation, Projection
Perturbation, and Geometric Perturbation. They all preserve (or approximately preserve) distance or inner product,
which are important to many classification and clustering models. As a result, the classification and clustering
models based on the perturbed data through multiplicative data perturbation show very similar performance to those
based on the original data. The main challenge for multiplicative data perturbations thus is how to maximize the
desired data privacy. In contrast, many other data perturbation techniques focus on seeking for the better trade-off
between the level of data utility and accuracy preserved and the level of data privacy guaranteed.
Rotation perturbation is orthogonal transformation-based data perturbation. Suppose the data owner has a private
data set , with each column of X being an attribute and each row a Record. The data owner generates an m × m
orthogonal matrix , and computes G(X) = R*X. The perturbed data G(X) is then released for future usage.
Where
- Original data set
- Random orthogonal matrix (RTR = RRT = I)
- Perturbed data set

Algorithm: Multiplicative Data Perturbation (Rotation perturbation)


Input: Data set X (.ARFF or .CSV file)
Output: Perturbed Data set X’ (.ARFF or .CSV file)
Hitesh Chhinkaniwala, Kiran Patel, Sanjay Garg/ Privacy Preserving Data Stream Classification Using Data Perturbation Techniques

Algorithm Steps:
1) Read Original Data set X file.
2) Consider only numeric data type attribute from data set X. and called Xnum
3) Select the slot of m rows (Xm) for perturbation from Xnum and consider it as a matrix. Perform following steps.
a. Suppose matrix size is m×n and create an orthogonal matrix Rm×m .
b. Multiply above both matrixes. G(X) = Rm×m*Xmxn. Derive G(X) matrix which has same number of
columns and rows of original data set.
c. Replace Xmxn with G(X)mxn
d. Select next m rows (Xm) and repeat step 3.
e. Store perturbed data set X’ into new file.
For model creation Hoeffding tree induction algorithm [9] has been used. Decision tree learning is one of the
most effective classification methods. A Decision tree is learned by recursively replacing leaves by test nodes,
starting at the root. Each node contains a test on the attributes and each branch from a node corresponds to a possible
outcome of the test and each leaf contains a class prediction. All training data stored in main memory so it’s
expensive to repeatedly read from disk when learning complex trees so our goal is design decision tree learners than
read each example at most once, and use a small constant time to process it.

6. Experimental setup and Results

We have conducted experiments to evaluate the performance of data perturbation algorithms. For experiment we
take two data sets. Agrawal data set is generated using Massive Online Analysis (MOA) – an open source
framework for data stream mining [11] with 5K instances and 10 attributes. Bank Marketing data set [12] is taken
from UCI data set repository which is related with direct marketing campaigns of a Portuguese banking institution,
and it contains 45K instances and 17 attributes. We applied data perturbation using sliding window algorithm with
window size w = 2, w = 3, and w = 4 and rotation perturbation algorithm with R = 5, R = 10 and R = 20. WEKA
(Waikato Environment for Knowledge Analysis) [10] data mining tool has been integrated with MOA to test the
accuracy of Hoeffding tree algorithm with Split Criterion is InfoGain, Tie Threshold set to 0.05 and Split
Confidence to 0. The data perturbation algorithms have been implemented in Java and integrated within MOA
framework. Result in table 2 and table 3 shows that privacy has been achieved with little over 2% of loss in
information.
Table 2. SLIDING WINDOW BASED DATA PERTURBATION

Original Perturbed Data set


DATA SET ANALYSIS
Data set Window Size
w=2 w=3 w=4
Time taken to
build model 0.09 0.09 0.10 0.11
ADULT DATA (sec)
SET
Correctly
82.52 80.46 80.49 80.50
classified (%)

Time taken to
BANK build model 0.20 0.17 0.13 0.13
MARKETING (sec)
DATA SET Correctly
89.04 88.48 82.32 86.12
classified (%)
Hitesh Chhinkaniwala, Kiran Patel, Sanjay Garg/ Privacy Preserving Data Stream Classification Using Data Perturbation Techniques

Table 3. MULTIPLICATE DATA PERTURBATION USING ROTATION PERTURBATION

Original Perturbed Data set


DATA SET ANALYSIS
Data set Rotation Perturbation
R=5 R = 10 R = 20
Time taken to
build model 0.09 0.09 0.09 0.11
ADULT DATA (sec)
SET
Correctly
82.52 81.41 81.42 81.42
classified (%)

Time taken to
BANK build model 0.20 0.14 0.13 0.14
MARKETING (sec)
DATA SET Correctly
89.04 87.23 87.23 88.40
classified (%)

7. Conclusion

An approach has been discussed for privacy-preserving classification of data streams, which consists of two
steps: date streams pre-processing and data streams mining. In the step of data streams pre-processing, we proposed
two algorithms for data perturbation that are the data perturbation using sliding window concept algorithm and
multiplicative data perturbation using rotation perturbation. Perturbation techniques are often evaluated with two
basic metrics: level of privacy guarantee and level of model-specific data utility preserved, which is often measured
by the loss of accuracy for data classification. Using data perturbation algorithm, we generate different perturbed
data set. And in the second step we apply the Hoeffding tree algorithm on perturbed data set. We carried out set of
experiments to generate classification model of original data set and perturbed data set. Classification results have
been evaluated on accuracy parameters. Proposed algorithms can perturb sensitive attributes with numerical values.
Two standard data sets have been perturbed and tested against original classification results. The classification result
of perturb data set using proposed algorithms shows data privacy with minimal information loss using proposed
algorithms.

References

1. J. J. Kim and W. E. Winkler, Multiplicative noise for masking continuous data, Statistical Research Division, U.S. Bureau of the
Census, 2003.
2. A. Bifet, G. Holmes, R. Kirkby and B. Pfahringer, Data Stream Mining-A Practical approach, 2011.
3. L. Golab and M. T. Ozsu, Data Stream Management Issues -A Survey Technical Report, 2003.
4. V.S. Verykios, K. Bertino, I. N. Fovino, L.P. Provenza, Y.Saygin and Theodoridis, State-of-the-Art in Privacy Preserving Data
Mining, ACM SIGMOD Record, Vol. 33, pp. 50-57, 2004.
5. W. Du and Z. Zhan, Building Decision Tree Classifier on Private Data, Proceedings of IEEE International Conference on Privacy
Security and Data Mining, pp. 1-8, 2002.
6. R. Agrawal and R. Srikant, Privacy-Preserving Data Mining, Proceedings of ACM SIGMOD International Conference on
Management of Data, pp. 439-450, 2000.
7. Y. Xu, K. Wang, A.W.Ch. Fu, R. She and J. Pei, Privacy-Preserving Data Stream Classification, pp. 489-510, Springer, 2008.
8. C. Chao, P. Chen and C. Sun, Privacy-Preserving Classification of Data Streams, Tamkang Journal of Science and Engineering, Vol.
12, No. 3, pp. 321-330, 2009.
9. A. Bifet, G. Holmes, R. Kirkby and B. Pfahringer, Data Stream Mining A Practical Approach, 2011.
Hitesh Chhinkaniwala, Kiran Patel, Sanjay Garg/ Privacy Preserving Data Stream Classification Using Data Perturbation Techniques

10. The Weka Machine Learning Workbench. http://www.cs.waikato.ac.nz/ml/weka.


11. A. Bifet, R. Kirkby, P. Kranen, P. Reutemann, MOA: Massive Online Analysis Manual, Journal of Machine Learning Research
(JMLR), 2010.
12. S. Moro, R. Laureano and P. Cortez. Using Data Mining for Bank Direct Marketing: An Application of the CRISP-DM Methodology,
In P. Novais et al. (Eds.), Proceedings of the European Simulation and Modelling Conference - ESM'2011, pp. 117-121, Portugal,
October, 2011.
13. K. Liu, H. Kargupta, and J. Ryan, Random projection-based multiplicative data perturbation for privacy preserving distributed data
mining, IEEE Transactions on Knowledge and Data Engineering (TKDE), vol. 18, No. 1, pp. 92–106, 2006.
14. S. R.M. Oliveira and O. R. Zaıane, Privacy preservation when sharing data for clustering, In Proceedings of the International
Workshop on Secure Data Management in a Connected World, Toronto, Canada, pp. 67–82, 2004.
15. K. Chen and L. Liu, “Privacy preserving data classification with rotation perturbation”, In Proceedings of the 5th IEEE International
Conference on Data Mining (ICDM’05), Houston, Texas, pp. 589–592, 2005.
16. K. Liu, C. Giannella, and H. Kargupta, An attacker’s view of distance preserving maps for privacy preserving data mining, In
Proceedings of the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD’06), Berlin,
Germany, pp. 297–308, 2006.

View publication stats

You might also like