A Survey On Cluster Based Outlier Detection Techniques in Data Stream

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 6

Integrated Intelligent Research (IIR) International Journal of Data Mining Techniques and Applications

Volume 5, Issue 1, June 2016, Page No.96-101


ISSN: 2278-2419

A Survey on Cluster Based Outlier Detection


Techniques in Data Stream
S.Anitha1, Mary Metilda2
1
Research Scholar, Bharathiar University, Coimbatore, Tamil Nadu, India
2
Asst. Prof., Queen Mary’s College, Chennai, Tamil Nadu, India
E-mail: [email protected], [email protected]

Abstract: In recent days, Data Mining (DM) is an emerging on the observations, an outlier detection algorithms are applied
area of computational intelligence that provides new various machine learning categories; the three fundamental
techniques, algorithms and tools for processing large volumes approaches are supervised outlier detection techniques, Semi-
of data. Clustering is the most popular data mining technique supervised outlier detection techniques and unsupervised
today. Clustering used to separate a dataset into groups that outlier detection techniques. In supervised techniques,
finds intra-group similarity and inter-group similarity. Outlier classification is an essential machine learning concept. The
detection (Anomaly) is to find small groups of data objects that primary aim of supervised approach is to learn a set of labeled
are different when compared with rest of data. The outlier data instance (training) and then classify an unseen instance
detection is an essential part of mining in data stream. Data into one of the class (testing). Except the 2 classes, the entire
Stream (DS) used to mine continuous arrival of high speed data portions outside the classes, represented as outlier. Various
Items. It plays an important role in the fields of types of classification algorithms are used for detecting outlier
telecommunication services, E-Commerce, Tracking customer such as neural networks, Bayesian networks support vector
behaviors and Medical analysis. Detecting outliers over data machines (SVM), decision trees and regression models etc.
stream is an active research area. This survey presents the These techniques are used to classify a new observations as
overview of fundamental outlier detection approaches and normal (or) outliers. In Semi-supervised outlier detection
various types of outlier detection methods in data stream. techniques, some applications have used only trained data for
normal class or only the abnormal classes. These techniques
Keywords: Clustering, Outlier Detection, Anomaly Detection, represented as semi supervised method. In this method, the one
Data Stream, class classifier learns a limit around the normal objects and
specifies any test objects outside this limit as an outlier. [5]
I. INTRODUCTION Both supervised and semi supervised methods, some time
unseen data objects are declared as an outlier, in such cases, a
Detection refers to the process of finding patterns in data that threshold is required to specify the particular data objects as an
do not conform to expected normal behavior. [2] Outlier outlier. [Chow et al 1970]The generalization and rejection
detection is an (or Anomaly) essential problem for many problems have been used to solve this problem. [Jain et al
Applications such as credit card fraud detection, insurance, risk 1988]
analysis, weather prediction, Medical diagnosis, network
intrusion for cyber security, detecting novelties in images and In unsupervised method, cluster analysis a popular machine
military surveillance for enemy activities and many other learning technique to group similar data objects into cluster. In
research areas. Hawkins [2] defines an outlier as an this type determining the outliers with no prior knowledge of
observation that deviates so much from other observations as data, outlier maybe detected by clustering, where values which
to arouse suspicion that it was generated by a different are similar are organized into groups or clusters. The values
mechanism. An Outlier is defined as “an observation (or subset that fall outside of the set of cluster may be considered as
of observation) which appears to be inconsistent with the outliers.[Lourcioco et al 2004]. Many data mining algorithms
remainder of that set of data”. Aggarwal states that “outlier try to minimize the influence of outliers or eliminate them all
may be considered as noise points lying outside of a set of together [4].
defined clusters or outliers may be considered as the points that
lie outside of the set of clusters but also are separated from the Clustered based outlier techniques belong to large and dense
noise” [1]. cluster, while outlier form very small cluster. Comparing with
the supervised approaches, unsupervised data mining
Outlier detection (or anomaly) has been resulted to be directly approaches are more feasible. There are two clustering
involved in lot of domains. Outlier detection has been used to methods: density based and partitioned based clustering. The
detect and remove unwanted data instance from large dataset. density based method can produce outlying objects along with
These unusual patterns are often known as outliers, anomalies, normal cluster. Partitioning techniques divides the object in
exceptions, aberrations, surprises in different domains of multiple partitions; every single partition is called as cluster.
applications.[3] In data mining, outlier detection methods are The objects with in single clusters are of similar characteristics
non parametric that designed to manage large databases from where the objects of different cluster have dissimilar
high dimensional spaces. The separation of outliers for characteristics in terms of dataset attributes distance measure is
improving the quality of the data and reducing the impact of one of the feature space used to identify similarity or
wrong values in the process of research. The importance of dissimilarity of patterns between data objects k-Mean, k-
isolating outliers can improve the quality of stored data. Based Medoid and CLARAN are partitioning algorithms [4]. The

1
Integrated Intelligent Research (IIR) International Journal of Data Mining Techniques and Applications
Volume 5, Issue 1, June 2016, Page No.96-101
ISSN: 2278-2419
partition based clustering method is used for distance based Distance-based methods were originally proposed by Knorr
outlier detection. There are many techniques that are used to and Ng [10]. The notion of outliers studied here is defined as
represents the outliers. Two steps are involved in outlier follows: An object 0 in a dataset T is a DB (p, D)-outlier if
detection when the user analyzes it. First identifies outliers at least fraction p of the objects in T lies greater than distance
around the data set using set of inliers. Second, data request are D from 0. Where DB (p, D) - Distance-Based outlier (detected
analyzed and identified as outliers when attributes are different using parameters p and D). We use the term DB(p, D)-outlier
from attribute of inliers [4]. as shorthand notation for a Distance-Based outlier (detected
using parameters p and D). [3] Where the values of p and D are
Applications of cluster analysis are Economic data, decided by the user.It is suitable for situations where the
classification, Pattern Recognition, Image Processing, text observed distribution does not fit any standard distribution. it
mining etc. Single algorithm is not efficient to track is well-defined for k-dimensional datasets for any value of k.
problems from different area. In this research work, some [mahalanobi, 1936] ,[ silvia cateni and Valentina colla
algorithms are presented that are based on distance between constructed Various distance matrix were used for outlying
two objects. The purpose of the study is to minimize the degree.
distance of each and every object from the centre of the cluster
to which the object belongs. Clustering technique can group
the data in to number of clusters. It reduces the size of database
that will reduce computation time. To each cluster user can
give certain radius to find outliers in data streams. A wide
range of streaming data, such as network flows, sensor data
have been generated. Analyzing and mining streams of data are
interesting challenges for developer [1, 2]. The dynamic nature
of evolving data streams, the role of outliers and clusters are
often exchanged, and consequently new clusters often emerge,
while old clusters fade out. It becomes more complex when
noise exists. Generally, outliers do not belong to any cluster
and belong to very small clusters but they can force to belong
to a cluster where they are very different from other data
member. This paper is organized as follows, Chapter II
describes about the different methods and approaches on Figure 1: example of Outliers
outlier detection. Chapter III discusses the various clustering
methods in outlier detection techniques. Chapter IV provides a For example, Mahalanobis distance is defined as equation 1
compact survey of existing outlier detection techniques using Where x is a data vector, μ is center of mass of data set and C
k-Means and k-Medoids in partitioning cluster method. Finally is the covariance matrix. Distance between each point and its
chapter V presents a conclusion and further enhancements. center of mass defined by the mahalanobis distance if the
covariant matrix is the identity matrix, the mahalanobis
II. METHODS AND APPROACHES OF OUTLIER distance become the Euclidean distance, data points located
DETECTION from the centre of mass are declared as outliers[30]. Unlike
static data, streaming data are not in fixed length. Data streams
In recent days many approaches are used to detect outliers over may be a time series and multi dimensional [12]. The distance
streaming data such as statistical distribution, Distance-based, based outlier detection for streaming data can be solved using
Depth-based, Density-based outlier detection .In statistical dynamic cluster maintenance research. Distance based
distribution based approach; many tests are performed for approach deals to Operate on whole data, cannot give number
single attributes. As in one-dimensional procedures, the of clusters. Even the Computation time will increases, it gave
distribution mean (measuring the location) and the variance- only one value as most expected outlier. The figer1.0, in a
covariance (measuring the shape) are the two most commonly given set of n data points, objects which are significantly differ
used statistics for data analysis in the presence of outliers by from other data called outliers.
Rousseeuw and Leory in 1987. The use of robust estimates of
the multidimensional distribution parameters can often
improve the performance of outlier detection [9].Statistical √
DM (X) = ( X−μ ) T C 1 ( X−μ )❑❑
outlier detection model generate a distribution for the given
Bakar, Zuriana Abu, and Rosmayati Mohemad have proposed
data set. It is represented by a multidimensional data where
some attributes are discrete variables (e.g. IP address, etc.) the performance of control chart, linear regression, and
while others are continuous ones (time, duration, source bytes, Manhattan distance techniques for outlier detection in data
etc.). The survey of Kenji Yamanishiet al in [6] proposed that mining were discussed. Experimental results showed that
Gaussian mixture model used for continuous data. Here a outlier detection technique using control chart is better than the
Gaussian mixture model takes a form of a linear combination technique modelled from linear regression because the number
of outlier data detected by control chart is smaller and better
of a finite number of Gaussian distributions.In statistical
method, the data distribution may be unknown. It requires than linear regression. Further, experimental studies showed
knowledge about parameters of the data set, such as the data that Manhattan distance technique is best when compared with
distribution. A distance-based approach was constructed to the other technique, (distance-based and statistical-based
overcome the problem arise from statistical approach. approaches) when the threshold values increased [15]. Depth-
based methods are data-driven and avoid strong distributional
2
Integrated Intelligent Research (IIR) International Journal of Data Mining Techniques and Applications
Volume 5, Issue 1, June 2016, Page No.96-101
ISSN: 2278-2419
expectations. They provide visualization of the data set via Figure 2: k-Means clustering algorithm.
depth bounded for a low dimensional input space. However, Markus M. Breunig, Hans-Peter Kriege et al, have approached
most of the current depth-based methods do not scale up with local outlier factor (LOF) used to measured the local outliers,
the dimensionality of the input space. [13] Chen and Yaxin is a ratio of local density of this point and local density of its
presented survey of a novel statistical depth, the kernelized nearest neighbor. LOF value of data point is high is declared as
spatial depth (KSD). Choosing a proper kernel, the KSD can outlier. [Sander, Jörg, Martin Ester and et al, 1998] In this
capture the local structure of a data set while the spatial depth research, clustering algorithm GDBSCAN generalizing the
fails. They demonstrated that by the half-moon data and the density-based algorithm DBSCAN (Ester et al., 1996) in two
ring-shaped data. Based on the KSD, they proposed a novel important ways. GDBSCAN can cluster point objects as well
outlier detection algorithm, by which an observation with a as spatially extended objects according to both, their spatial
depth value less than a threshold is declared as an outlier using and their non-spatial attributes. A performance evaluation,
synthetic data and data sets from real applications. The analytical as well as experimental, showed the effectiveness
proposed outlier detector is compared with existing methods. and efficiency of GDBSCAN on large spatial databases.
The KSD based outlier detection demonstrates competitive
performance on all data sets tested when compared with other III. OUTLIER DETECTION USING CLUSTERING
methods. Dang, Xin, and Serfling et al, carried out the research METHODS
work in 2006, Based on depth functions, which order
multidimensional data points by “outlyingness” measures and Clustering algorithms are classified into the following types:
generate outline following the shape of the data set partitioned clustering, hierarchical clustering, density-based
multivariate outlier detection is nonparametric and, with clustering and grid-based clustering [4]. Clustering is a process
typical choices of depth function, robust. For depth-based of partitioning data sets into sub classes known as clusters. It
outlier identifiers, masking and swamping breakdown points gives us the natural grouping or cluster from the data set. It is
are defined. The values of these robustness measures are unsupervised classification that means it has no predefined
constructed for three depth functions, the spatial, the classes. This paper presents the various partitioning
projection, and generalized Tukey[13][31]. techniques in clustering algorithms and the advantages
individually. The outlier detection is one of the challenging
Density-based Method can be seen as a non-parametric areas in data stream. In CluStream [1], the algorithm
approach, where clusters are designed as areas of high density continuously maintains a fixed number of micro-clusters. Such
(relying on some unknown density-distribution) by Sander, an approach is especially risky when the data stream contains
Jörg, Martin Ester in 1998. In parametric approaches that try noise. Because a lot of new micro-clusters will be created for
to approximate the unknown density-distribution generating the outliers, many existing micro-clusters will be deleted or
the data by mixtures of k densities (e.g., Gaussian merged. Ideally, the streaming algorithm should provide some
distributions), density-based clustering methods do not require mechanism to distinguish the seeds of new clusters from the
the number of clusters as input and do not make any specific outliers. Discovery of the patterns hidden in streaming data
assumptions concerning the nature of the density distribution. imposes a great challenge for cluster analysis in that paper,
As a result, density-based methods do not readily provide Cao et al. have proposed a new algorithm named as
models, or otherwise compressed descriptions for the DenStream, for clustering an evolving data stream. In this
discovered clusters. A computationally efficient method for method, clusters of arbitrary shape in data streams, and it is
density-based clustering on static data sets is, e.g., DBSCAN insensitive to noise were discussed. The structures of p-micro-
[Ester, Martin, Hans-Peter Kriegel, 1996]. In density based clusters and o-microclusters maintain sufficient information for
approach, Outlier detection is done by a density of a clustering, and a novel pruning strategy is designed to reduce
particular data point is compared with density of its neighbor. the memory utilization. The results of the research work
The data points having a low density are declared as outliers. carried out by a number of synthetic and real data sets gives
Density based models require the careful settings of several the efficiency of DenStream in discovering clusters of arbitrary
parameters. It requires quadratic time complexity. It may rule shape in data streams. HPStream [1] introduces the concept of
out outliers close to some non-outliers patterns that has low projected cluster to data streams. However, it cannot be used to
density [17]. discover clusters of arbitrary orientations in data streams.
Discovery of the patterns hidden in streaming data are a good
Algorithm k-Means (k, D) effort for cluster analysis. The aim of clustering is to group the
1 choose streaming data into meaningful classes. [16]
k data points as the initial cancroids (cluster centres)
2 repeat Irene Ntoutsi, Arthur Zimek et al have developed a density-
3 for each data point x ∈D Do based projected clustering algorithm, HDDStream, for high
4 compute the distance from x to each centred; dimensional data streams. This work summarizes both the data
5 assign x to the closest centred // a centred points and the dimensions where these points are grouped
represents a cluster together and maintains these summaries online, as new points
6 end for arrive over time and old points expire due to ageing. The
7 recomputed the centred using the current cluster results illustrated the effectiveness and the efficiency of
memberships HDDStream. The Forest Cover Type dataset from UCI KDD
8 until the stopping criterion are met; Archive contains data on different forest cover types,
containing 581,012 records. The challenge in this dataset is to
predict the correct cover type from cartographic variables. The
3
Integrated Intelligent Research (IIR) International Journal of Data Mining Techniques and Applications
Volume 5, Issue 1, June 2016, Page No.96-101
ISSN: 2278-2419
problem is defined by 54 variables of different types: 10 BIRCH with CLARANS and BIRCH with k-Means clustering
quantitative variables, 4 binary wilderness area attributes and algorithm for detecting outliers. They have used two biological
40 binary soil type variables. The class attribute contains 7 data sets that are Pima Indian diabetes and breast cancer
different forest cover types. In, this research, clustering quality (Wiscosin). From that research, the clustering and outlier
of HDDStream was superior to the clustering quality of the detection accuracy is more efficient in BIRCH with
canonical competitor. [28] CLARANS clustering than BIRCH with k-Means with
clustering [24]. S. Vijayarani and P. Jothi have compared two
S. D. Pachgade and S. S. Dhande et al constructed a Hybrid clustering algorithms namely CURE with k-Means and CURE
approach for outlier detection method. Due to reduction in with CLARANS is used to find the outliers in data streams.
size of dataset, the computation time reduced. Then threshold Various types of data sets and two performance factors such as
value from user have taken and calculated outliers according to clustering accuracy and outlier detection accuracy are used for
given threshold value for each cluster to get outliers within a analysis. From that research work, the proposed CURE with
cluster. Hybrid approach taken less computation time. The CLARANS clustering algorithm performance is more accurate
approach needed to be implemented on more complex datasets. than the existing algorithm CURE with k-Means.
Experiments were conducted in Matlab 7.8.0 (R2009a) on
various data sets. Data are collected from UCI machine IV. OUTLIER DETECTION BY k-MEANS AND
learning repository that provided various types of datasets. k-MEDOIDS
This dataset can be used for clustering, classification and
regression. Dataset has multiple attribute and instances. Data In this paper, well known partitioning based methods k-Means
File Format is in .data and .xls excel file or .txt or .csv file and k-Medoids are compared. k-Mean algorithm is one of the
format. This data are useful to find cluster based the outliers. centroid based technique. It takes input parameter k and
[19] "Density-based clustering for real-time stream data” by partition a set of n object from k clusters. The similarity
Chen, Yixin, and Li Tu constructs an algorithm D-Stream for between clusters is measured in regards to the mean value of
clustering stream data using a density-based approach. The the object. The random selection of k object is first step of
algorithm uses an online component which defines every input algorithm which represents cluster mean or center. By
data into a grid and an offline component which computes the comparing most similarity other objects are assigning to the
grid density and clusters the grids depends on the density of the cluster .The survey given here explores the behaviour of these
cluster. The algorithm can find clusters of arbitrary shape. The two methods. Elahi, et al. has proposed Efficient Clustering-
researchers compare the qualities of the clustering results by Based Outlier Detection Algorithm for Dynamic Data Stream.
D-Stream and those by CluStream.[13] Due to the on- This research work depends on the clustering based approach
convexity of the synthetic data sets, CluStream cannot get a that splits the stream to clusters and chunks. In the stable
correct result. Its quality cannot be compared to that of D- number of clusters, each chunk using k-Mean. It also retains
Stream. Therefore, they have compared only the sum of the applicant outliers and means value of every cluster for the
squared distance (SSQ) of the two algorithms on the network next fixed number of steam chunks as a replacement for
intrusion data from KDD CUP-99the computations made to keeping only the summary of information that is utilized in
detect and remove the sporadic grids in order to dramatically clustering data stream to assured that the discovered candidate
improve the space and time efficiency without affecting the outliers are real. It is better to decide outlines for data stream
clustering results with high speed data stream clustering. Both objects by utilizing the mean values of the current chunk of
algorithms are tested on the KDD CUP-99. D-Stream is 3.5 to stream with the mean value of the clusters of previous chunk
11 times faster than CluStream and scales better results. [22].

Frank Rehm and Frank Klawonn et al discussed an algorithm Rajendra Pamula and Jatindra Kumar Deka et al carried out the
to calculate the noise distance in noise clustering based on the research work titled “An Outlier Detection Method based on
preservation of the hyper volume of the feature space. They Clustering” based on clustering method to capture outliers.
have applied NC on FCM, other clustering algorithms, such as They have applied k-Means clustering algorithm is used to
GK, GG and other prototype based clustering algorithms can divide the data set into clusters. The points which are lying
be adapted. The aim of the study is not only to reduce the near the centroid of the cluster are not probable candidate for
influence of outliers, but also to clearly identify them. [21] outlier and to prune out such points from each cluster. A
distance based outlier score for remaining points were
S. Vijayarani et al discussed the research work, two calculated. The computations needed to calculate the outlier
partitioning clustering algorithms CLARANS and -CLARANS score reduces considerably due to the pruning of some points.
(Enhanced Clarans) are used for detecting the outliers in data The results demonstrate that even though the number of
streams. Two performance factors such as clustering accuracy computations is less, the proposed method performs better than
and outlier detection accuracy are used for observation. By the existing methods [23].
examining the computational results, it is observed that the
proposed ECLARANS clustering algorithm performance is Christopher. T and T. Divya have analysed the performance
more accurate than the existing algorithm CLARANS. In this of CURE with k-Means and CURE with CLARANS clustering
paper, they have analysed the performance of CLARANS and algorithm. From the experimental results it is observed that the
ECLARANS clustering algorithm s. the result of the outlier detection accuracy is more efficient in CURE with
computation showed that the proposed ECLARANS is more CLARANS clustering while compared to CURE with k-Means
efficient than CLARANS clustering. Vijayarani, S., and P. with clustering. Neeraj Bansal and Amit Chugh have compared
Jothi have analysed the clustering and outlier performance of to the result of different Clustering techniques in terms of time
4
Integrated Intelligent Research (IIR) International Journal of Data Mining Techniques and Applications
Volume 5, Issue 1, June 2016, Page No.96-101
ISSN: 2278-2419
complexity and proposed a new solution by adding fuzziness to Set
already existing Clustering [25]. Aruna Bhat explored a novel 23 k-Mean Medical Data the number of
technique for face recognition by performing classification of Set computations is
the face images using unsupervised learning approach through less and better
k-Medoids clustering. Partitioning around Medoids algorithm performance
(PAM) has been used for performing k-Medoids clustering of 24 BIRCH Pima accuracy is more
the data. k-Medoid clustering using PAM was observed to be with Indian Diabetes Efficient in BIRCH
more than that of k-Means clustering for all the data sets where k-Means Dataset with CLARANS.
outliers and noise was present. The PAM algorithm recognised and
the faces with different expressions more accurately as BIRCH
summarised [26]. Rani, Deevi Radha, et al proposed weighted with
k-Means assigned weights to the variables by using weighted CLARAN
k-Means for the dynamic data streams. The k-Means process S
cannot select the variables automatically for clustering and it is
not efficient for large data sets, weights are assigned to the 25 Breast Cancer More accuracy in
variables. In this process they considered three variables and CURE Wisconsin CURE with
identify the cluster initial centroid and assign the initial with and Pima CLARANS
weights to the variable as 0.42, 0.64, and 0.13 and identify the k-Mean Indian data set
value of the function before assigning weights. So the and ( 768 instances
clustering process is again recomputed with the newly arriving CLARAN and 8
data that becomes as inliers so that useful information may not S attributes).
be loosed and it is carried out until the user specified threshold 26 k-Means (PAM) k-mediod
values is reached. The experiment results shows that weighted and Eigen Faces produces better
k-Means is more efficient for detecting outliers. The Table 1 (PAM)k- results.
shows that the summary of various methods proposed by mediods
different researchers to find an outlier detection in DM clusters 28 HDDstrea Forest cover detecting drastic
and data streams. m type data from changes in the
UCI underlying stream
Table 1: Comparison of Various Articles population

Paper Methods Data Results


Ref.N Used Sets/Applicatio V. CONCLUSION
o ns Used
Accurac Sensitivi Outlier detection in Data streams has become a subject of
y ty dynamic research in computer science such as, distributed
13,31 D- Network 96.5 -- systems, database systems, and data mining. Lot of research
STREAM Intrusion work has been carried in this field to develop an efficient
Detection clustering algorithm for data streams. In this paper, popular
Stream outlier detection algorithms are surveyed and discussed based
Data( Mit on clustering. In statistical method, the data distribution may
Lincoln be unknown. Streaming data are not in fixed length like static
Laboratory) data .The distance based outlier detection may be useful for
16 DENstrea Network 94% 95% working with time series and multi dimensional streaming
m Intrusion data. In density based approach accuracy is guaranteed in text
Detection data and image data. These methods are structured into many
set criteria depending upon whether they work directly with data
Charitable streams. Most clustering algorithms are not capable to find
Donation data outliers in data stream. In addition, this paper discusses about
set partition based clustering methods k-Mean and (PAM) k-
(KDD Medoid with data streams. k-Mean is computationally
CUP’99 ) expensive but it is most useful for dynamic data streams.
17 DBSCAN SEQUOIA Suitable for large CLARAN is the best algorithm for high dimensional data. But
2000 spatial databases. (PAM) k- Mediods are only used for small set of data items, k-
benchmark Mediods has less accuracy while compared to k-Means. The
data future work determines to develop an effective clustering
19 Modified Medical Data Results are algorithm for detecting outliers in data stream, considering the
and hybrid Set, WDBC visualized, less merits and demerits of the surveyed methodology.
clustering (Diagnosis) computation time,
approach References
21 FCM Benchmark Outliers are clearly [1] Aggarwal, Charu C., Jiawei Han, Jianyong Wang, and
Data Set And identified Philip S. Yu, "A framework for projected clustering of
Weather Data high dimensional data streams", Proc. of the Thirtieth
5
Integrated Intelligent Research (IIR) International Journal of Data Mining Techniques and Applications
Volume 5, Issue 1, June 2016, Page No.96-101
ISSN: 2278-2419
international conference on Very large data bases, Vol. [18] Ester, Martin, Hans-Peter Kriegel, Jörg Sander, and
30, 2004, pp. 852-863. Xiaowei Xu. "A density-based algorithm for discovering
[2] Hawkins, Douglas M., Identification of outliers, Vol. 11. clusters in large spatial databases with noise." In Kdd,
London: Chapman and Hall, 1980. vol. 96, no. 34, pp. 226-231. 1996.
[3] Chandola, Varun, Arindam Banerjee, and Vipin Kumar, [19] Pachgade, Ms SD, and Ms SS Dhande. "Outlier detection
"Anomaly detection: A survey", ACM computing over data set using cluster-based and distance-based
surveys, Vol. 41, No. 3, 2009. approach." International Journal of Advanced Research in
[4] Han, Jiawei, and Micheline Kamber. "Data Mining: Computer Science and Software Engineering 2, no. 6
Concepts and Techniques, 2nd editionMorgan Kaufmann (2012): 12-16.
Publishers." San Francisco, CA, USA (2006). [20] Chen, Yixin, and Li Tu. "Density-based clustering for
[5] Chawla, Sanjay, and Pei Sun. "Outlier detection: real-time stream data." In Proceedings of the 13th ACM
Principles, techniques and applications." In Proceedings SIGKDD international conference on Knowledge
of the 10th Pacific-Asia Conference on Knowledge discovery and data mining, pp. 133-142. ACM, 2007.
Discovery and Data Mining (PAKDD), Singapore. 2006. [21] Rehm, Frank, Frank Klawonn, and Rudolf Kruse. "A
[6] Yamanishi, Kenji, Jun-Ichi Takeuchi, Graham Williams, novel approach to noise clustering for outlier detection."
and Peter Milne. "On-line unsupervised outlier detection Soft Computing 11, no. 5 (2007): 489-494.
using finite mixtures with discounting learning [22] Elahi, Manzoor, Kun Li, Wasif Nisar, Xinjie Lv, and
algorithms." Data Mining and Knowledge Discovery 8, Hongan Wang. "Efficient clustering-based outlier
no. 3 (2004): 275-300. detection algorithm for dynamic data stream." In Fuzzy
[7] Sivaram, Saveetha, “An Efficient Algorithm For Outlier Systems and Knowledge Discovery, 2008. FSKD'08.
Detections” , Global Journal Of Advance Engineering Fifth International Conference on, vol. 5, pp. 298-304.
And Technologies, Vol 2,PP,35-40,January 2013. IEEE, 2008.
[8] Yamanishi, J. T. K., and Y. Maruyama. "Data mining for [23] Pamula, Rajendra, Jatindra Kumar Deka, and Sukumar
security." NEC journal of advanced technology 2, no. 1 Nandi. "An outlier detection method based on clustering."
(2005): 63. In Emerging Applications of Information Technology
[9] Leroy, Annick M., and Peter J. Rousseeuw. "Robust (EAIT), 2011 Second International Conference on, pp.
regression and outlier detection." Wiley Series in 253-256. IEEE, 2011.
Probability and Mathematical Statistics, New York: [24] Vijayarani, S., and P. Jothi. "An efficient clustering
Wiley, 1987 1 (1987). algorithm for outlier detection in data streams."
[10] Knor, Edwin M., and Raymond T. Ng. "Algorithms for International Journal of Advanced Research in Computer
mining distance based outliers in large datasets." In and Communication Engineering 2, no. 9 (2013): 3657-
Proceedings of the International Conference on Very 3665.
Large Data Bases, pp. 392-403. 1998. [25] Christopher, T., and T. Divya. "A Study of Clustering
[11] He, Zengyou, Xiaofei Xu, and Shengchun Deng. Based Algorithm for Outlier Detection in Data streams."
"Discovering cluster-based local outliers." Pattern In Proceedings of the UGC Sponsored National
Recognition Letters 24, no. 9 (2003): 1641-1650. Conference on Advanced Networking and Applications.
[12] Bu, Yingyi, Lei Chen, Ada Wai-Chee Fu, and Dawei Liu. 2015.
"Efficient anomaly monitoring over moving object [26] Bhat, Aruna. "K-MEDOIDS CLUSTERING USING
trajectory streams." In Proceedings of the 15th ACM PARTITIONING AROUND MEDOIDS FOR
SIGKDD international conference on Knowledge PERFORMING FACE RECOGNITION." Int. J. Soft
discovery and data mining, pp. 159-168. ACM, 2009. Comp. Mat. Cont 3, no. 3 (2014): 1-12.
[13] Chen, Yixin, Xin Dang, Hanxiang Peng, and Henry L. [27] Singh, Shalini S., and N. C. Chauhan. "K-Means v/s K-
Bart. "Outlier detection with the kernelized spatial depth medoids: A Comparative Study." In National Conference
function." Pattern Analysis and Machine Intelligence, on Recent Trends in Engineering & Technology, vol. 13.
IEEE Transactions on 31, no. 2 (2009): 288-305. 2011.
[14] Dang, Xin, and Robert Serfling. "Nonparametric depth- [28] Ntoutsi, Irene, Arthur Zimek, Themis Palpanas, Peer
based multivariate outlier identifiers and robustness Kröger, and Hans-Peter Kriegel. "Density-based
properties." submitted for journal publication (2006). Projected Clustering over High Dimensional Data
[15] Bakar, Zuriana Abu, Rosmayati Mohemad, Akbar Streams." In SDM, pp. 987-998. 2012.
Ahmad, and Mustafa Mat Deris. "A comparative study [29] Rani, Deevi Radha, Navya Dhulipala, Tejaswi
for outlier detection techniques in data mining." In Pinniboyina, and Padmini Chattu. "OUTLIER
Cybernetics and Intelligent Systems, 2006 IEEE DETECTION FOR DYNAMIC DATA STREAMS
Conference on, pp. 1-6. IEEE, 2006. USING WEIGHTED K-MEANS." International Journal
[16] Cao, Feng, Martin Ester, Weining Qian, and Aoying of Engineering Science and Technology 1, no. 3 (2011):
Zhou. "Density-Based Clustering over an Evolving Data 7484-7490.
Stream with Noise." In SDM, vol. 6, pp. 328-339. 2006. [30] Mahalanobis, Prasanta Chandra. "On the generalized
[17] Sander, Jörg, Martin Ester, Hans-Peter Kriegel, and distance in statistics."Proceedings of the National
Xiaowei Xu. "Density-based clustering in spatial Institute of Sciences (Calcutta) 2 (1936): 49-55.
databases: The algorithm gdbscan and its applications." [31] Dang, Xin, and Robert Serfling. "Nonparametric depth-
Data mining and knowledge discovery 2, no. 2 (1998): based multivariate outlier identifiers, and masking
169-194. robustness properties." Journal of Statistical Planning and
Inference 140, no. 1 (2010): 198-213.
6

You might also like