Anomaly Detection On Data Streams With H
Anomaly Detection On Data Streams With H
Abstract: Classification consists of assigning a class label to a set of unclassified cases. Supervised and unsupervised
classification methods are used to assign class labels. Classification is performed in two steps learning or training (model
construction) and testing (model usage). Learning process is used to identify the class patterns from the labeled
transactions. In training phase unlabeled transactions are assigned with the class values with reference to the learned class
patterns. An outlier is an observation that deviates so much from other observations as to arouse suspicions. Distance based
outlier detection methods are used to identify records that are different from the rest of the data set. The anomaly detection
is referred as outlier detection process.
Batch mode based anomaly detection scheme is not suitable for large scale data values. Batch mode scheme
requires high computational and memory resources. Principal component analysis (PCA) is a unsupervised dimension
reduction method. PCA determines the principal directions of the data distribution. online oversampling principal
component analysis (osPCA) algorithm is used to detect outliers from a large amount of data via online.
The over sampling based Principal Component Analysis (osPCA) method is enhanced to handle high dimensional
data values. The learning process is improved to manage dimensionality differences. The system is tuned to handle data
with multi cluster structure. The system is enhanced to perform anomaly detection on streaming data values.
I. INTRODUCTION
In data mining, anomaly detection (or outlier detection) is the identification of items, events or observations which
do not conform to an expected pattern or other items in a dataset. Typically the anomalous items will translate to some kind
of problem such as bank fraud, a structural defect, medical problems or finding errors in text. Anomalies are also referred to
as outliers, novelties, noise, deviations and exceptions.
In particular in the context of abuse and network intrusion detection, the interesting objects are often
not rare objects, but unexpected bursts in activity. This pattern does not adhere to the common statistical definition of
an outlier as a rare object, and many outlier detection methods will fail on such data, unless it has been aggregated
appropriately. Instead, a cluster analysis algorithm may be able to detect the micro clusters formed by these patterns.
Three broad categories of anomaly detection techniques exist. Unsupervised anomaly detection techniques detect
anomalies in an unlabeled test data set under the assumption that the majority of the instances in the data set are normal by
looking for instances that seem to fit least to the remainder of the data set. Supervised anomaly detection techniques require
a data set that has been labeled as "normal" and "abnormal" and involves training a classifier. Semi-supervised anomaly
detection techniques construct a model representing normal behavior from a given normal training data set, and then testing
the likelihood of a test instance to be generated by the learnt model.
Anomaly detection is applicable in a variety of domains, such as intrusion detection, fraud detection, fault
detection, system health monitoring, event detection in sensor networks, and detecting Eco-system disturbances. It is often
used in preprocessing to remove anomalous data from the dataset. In supervised learning, removing the anomalous data
from the dataset often results in a statistically significant increase in accuracy.
Copyright @ IJIRCCE www.ijircce.com 3829
ISSN(Online): 2320-9801
ISSN (Print): 2320-9798
In the past, many outlier detection methods have been proposed [4], [3], [5]. Typically, these existing approaches
can be divided into three categories: distribution, distance and density-based methods. Statistical approaches assume that
the data follows some standard or predetermined distributions, and this type of approach aims to find the outliers which
deviate form such distributions. However, most distribution models are assumed univariate, and thus the lack of robustness
for multidimensional data is a concern. Moreover, since these methods are typically implemented in the original data space
directly, their solution models might suffer from the noise present in the data. Nevertheless, the assumption or the prior
knowledge of the data distribution is not easily determined for practical problems.
For distance-based methods [1], [6], the distances between each data point of interest and its neighbors are
calculated. If the result is above some predetermined threshold, the target instance will be considered as an outlier. While
no prior knowledge on data distribution is needed, these approaches might encounter problems when the data distribution is
complex. In such cases, this type of approach will result in determining improper neighbors, and thus outliers cannot be
correctly identified.
To alleviate the aforementioned problem, density-based methods are proposed [3]. One of the representatives of
this type of approach is to use a density-based local outlier factor (LOF) to measure the outlierness of each data instance.
Based on the local density of each data instance, the LOF determines the degree of outlierness, which provides suspicious
ranking scores for all samples. The most important property of the LOF is the ability to estimate local data structure via
density estimation. This allows users to identify outliers which are sheltered under a global data structure. However, it is
worth noting that the estimation of local data density for each instance is very computationally expensive, especially when
the size of the data set is large.
Besides the above work, some anomaly detection approaches are recently proposed [4], [5]. Among them, the
angle-based outlier detection (ABOD) method [4] is very unique. Simply speaking, ABOD calculates the variation of the
angles between each target instance and the remaining data points, since it is observed that an outlier will produce a smaller
angle variance than the normal ones do. It is not surprising that the major concern of ABOD is the computation complexity
due a huge amount of instance pairs to be considered. Consequently, a fast ABOD algorithm is proposed to generate an
approximation of the original ABOD solution. The difference between the standard and the fast ABOD approaches is that
the latter only considers the variance of the angles between the target instance and its k nearest neighbors. However, the
search of the nearest neighbors still prohibits its extension to largescale problems, since the user will need to keep all data
instances to calculate the required angle information.
It is worth noting that the above methods are typically implemented in batch mode, and thus they cannot be easily
extended to anomaly detection problems with streaming data or online settings. While some online or incrementalbased
anomaly detection methods have been recently proposed [7], [2], we found that their computational cost or memory
requirements might not always satisfy online detection scenarios. For example, while the incremental LOF in [7] is able to
update the LOFs when receiving a new target instance, this incremental method needs to maintain a preferred (or filtered)
data subset. Thus, the memory requirement for the incremental LOF is O(np) [7], [2], where n and p are the size and
dimensionality of the data subset of interest, respectively.
Anomaly (or outlier) detection aims to identify a small group of instances which deviate remarkably from the
existing data. A well-known definition of “outlier”: “an observation which deviates so much from other observations as to
arouse suspicions that it was generated by a different mechanism,” which gives the general idea of an outlier and motivates
many anomaly detection methods. Practically, anomaly detection can be found in applications such as homeland security,
credit card fraud detection, intrusion and insider threat detection in cyber-security, fault detection, or malignant diagnosis.
However, this LOO anomaly detection procedure with an oversampling strategy will markedly increase the computational
Copyright @ IJIRCCE www.ijircce.com 3830
ISSN(Online): 2320-9801
ISSN (Print): 2320-9798
UT (xi - ) ( xi - )TU;
U is a matrix consisting of k dominant eigenvectors. From this formulation, one can see that the standard PCA can
be viewed as a task of determining a subspace where the projected data has the largest variation.
Data Receiver
Data receiver listens in TCP port to collect data from remote nodes. Received data values are updated in to the
database. Missing data elements are assigned using aggregation functions. Different attribute types are used in the data
collection process.
Data Sender
Upload
Process
Data Receiver
Data
Cleaning
Clustering
Process
Yes
Update
Labeled
? Patterns
No
Anomaly Similarity
Detection Analysis
Learning Process
Learning process is performed to identify the class patterns. Labeled transactions are used in the learning process.
Learning process is initiated on clustered data values. Multi cluster structure is adapted for learning process.
Anomaly Detection
The anomaly detection is applied on streaming data values. Learned patterns are used in the anomaly detection
process. Dimensionality variations are considered in the anomaly detection process. Patterns are compared using similarity
measures.
V. CONCLUSION
Anomaly detection methods are used to detect anomalyous data values in a data collection. Principal Component
Analysis model is used to handle dimensionality reduction process. The system enhances over sampling based Principal
Component Analysis (osPCA) model to support data with multi cluster structure. The system is enhanced to discover
anomalies in high dimensional data values from data streams. High dimensional data classification model is supported by
the system. The system performs the classification on data streams. False positive and false negative errors are reduced by
the system. Classification accuracy is improved by the enhancement of over sampling based principal component analysis
method.
REFERENCES
[1] Angiulli F.,Basta S., and Pizzuti C., “Distance-Based Detection and Prediction of Outliers,” IEEE Trans. Knowledge and Data Eng., vol. 18, no. 2, pp.
145-160, 2006.
[2]Ahmed T., “Online Anomaly Detection using KDE,” Proc. IEEE Conf. Global Telecomm., 2009.
[3] Jin W., Tung A. K. H., Han J., and Wang W., “Ranking Outliers Using Symmetric Neighborhood Relationship,” Proc. Pacific-Asia Conf. Knowledge
Discovery and Data Mining, 2006.
[4] Kriegel H. P., Schubert M., and Zimek A., “Angle-Based Outlier Detection in High-Dimensional Data,” Proc. 14th ACM SIGKDD Int’l Conf.
Knowledge Discovery and data Mining, 2008.
[5] Kriegel H.-P.,Kro¨ger P. , Schubert E., an Zimek A., “Outlier Detection in Axis-Parallel Subspaces of High Dimensional Data,” Proc. Pacific-Asia
Conf. Knowledge Discovery and Data Mining, 2009.
[6] Khoa N. L. D. and Chawla S., “Robust Outlier Detection Using Commute Time and Eigenspace Embedding,” Proc. Pacific-Asia Conf. Knowledge
Discovery and Data Mining, 2010.
[7] Rawat S., Pujari A. k., and Gulati V. P., “On the Use of Singular Value Decomposition for a Fast Intrusion Detection System,” Electronic Notes in
Theoretical Computer Science, vol. 142, no. 3, pp. 215-228, 2006.
[8] Pokrajac D., Lazarevic A., and L. Latecki, “Incremental Local Outlier Detection for Data Streams,” Proc. IEEE Symp. Computational Intelligence and
Data Mining, 2007.
[9]Yeh y.R., Lee , Z.-Y. and Y.-J. Lee, “Anomaly Detection via Oversampling Principal Component Analysis,” Proc. First KES Int’l Symp. Intelligent
Decision Technologies, pp. 449-458, 2009.