0% found this document useful (0 votes)
13 views10 pages

Reverse Accessible in Local Outlier Factor Density Based Recognition

The document discusses a proposed method for outlier detection using Local Outlier Factor (LOF) and Local Distance-Based Outlier Factor (LDOF) algorithms, which improve upon previous systems in terms of speed, complexity, and efficiency. It emphasizes the importance of incremental outlier detection that adapts to changing data profiles over time and provides a broad comparison of various outlier detection models. The methodology includes a greedy algorithm for selecting outliers based on their impact on entropy in a dataset.

Uploaded by

vijayakathari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views10 pages

Reverse Accessible in Local Outlier Factor Density Based Recognition

The document discusses a proposed method for outlier detection using Local Outlier Factor (LOF) and Local Distance-Based Outlier Factor (LDOF) algorithms, which improve upon previous systems in terms of speed, complexity, and efficiency. It emphasizes the importance of incremental outlier detection that adapts to changing data profiles over time and provides a broad comparison of various outlier detection models. The methodology includes a greedy algorithm for selecting outliers based on their impact on entropy in a dataset.

Uploaded by

vijayakathari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

International Journal of Research p-ISSN: 2348-6848

e-ISSN: 2348-795X
Available at https://edupediapublications.org/journals
Volume 03 Issue 10
June 2016

Reverse Accessible in Local Outlier Factor Density Based


Recognition
N V S K Vijaya Lakshmi K1 & David Raju Kuppala2
1
Assistant Professor, Dept of IT, Sir C R Reddy College Of Engineering, Eluru,Andhra Pradesh.
2
Assistant Professor, Dept of CSE, K L University,Vaddeswaram,Guntur, Andhra Pradesh.

Abstract: Recent data mining outlier to recognition data point the expected system to sufficient dataset or
is significantly many data exhibits that as dimensionality increases there exists hubs and anti hubs the
points that frequently occur in k nearest neighbor lists. Ant hubs are points that infrequently model in kNN
lists. .This proposed system to developing and comparing to unsupervised outlier detection models This
proposed method to details about the development and analysis of outlier detection methods is Local
Outlier Factor (LOF), and Local Distance-Based Outlier Factor(LDOF) .Outliers improves the results of
the previous systems to reference to speed, complexity and efficiency . The classification algorithms is
used to finding the relevant features and classify in the criteria in data mining methods. These techniques
suffer to increasing complexity, size and variety of data sets. The proposed incremental LOF algorithm
takes equivalent finding performance as the iterated static LOF algorithm while requiring significantly
less computational time. In addition, the incremental LOF algorithm is dynamically modify the data of
data points. This is a very important application, change data profiles to change over time. Moreover, we
have also given a broad comparison of the number of model the different outlier factors.
Index Terms: Clustering-based; Density-based and Model-based approaches; Nearest Neighbour; Outlier
Detection; Discrimination; Outliers; data mining; Clustering; Neural Network.

1. INTRODUCTION into information. It is basically used in fraud


Outlier detection or anomaly detection means detection, marketing and scientific discovery.
detecting data patterns that do not conform or Data mining actually refers to extracting the
distant from other observations.[5] Outliers can hidden interesting patterns from the large amount
have many anomalous causes. To normal behavior of datasets and databases [2]. Mining is basically
Outliers arise due to changes in system behavior, used to uncover the patterns of the data, but this
fraudulent behavior, human error, instrument error can be carried out on the sample of data. The
or simply through natural deviations in mining process will be completely failed if the
populations. Outliers may contain critical and samples are not the good representation of the
actionable information in fraud detection, large body of the data. Automated identification
intrusion detection and medical diagnosis. Data of suspicious behavior and objects [3] based on
Mining is a non-trivial method of identifying information extracted from video streams is
valid, novel, potentially useful and finally currently an active research area. Other potential
understandable patterns [1]. Now, data mining is applications include traffic control and
becoming an important tool to convert the data surveillance of commercial and residential

Available online:http://internationaljournalofresearch.org/ P a g e | 147


International Journal of Research p-ISSN: 2348-6848
e-ISSN: 2348-795X
Available at https://edupediapublications.org/journals
Volume 03 Issue 10
June 2016

buildings. These tasks are characterized by the


need for real time processing and by dynamic,
non-stationary and often noisy environment.
Hence, there is necessity for incremental outlier
detection that can adapt to novel behavior and
provide timely identification of unusual events.[4]
The errand of distinguishing anomalies can be
classified as administered, semi-regulated, and
unsupervised, contingent upon the presence of
marks for exceptions and/or general occurrences.
Regulated anomaly discovery procedures have an
unequivocal thought of the typical and exception
conduct and thus precise models can be Fig 1: Modes of operation of outlier detection
assembled. The downside here is that precisely techniques
marked preparing. Among these categories, 2. Related Work
unsupervised methods are more widely applied, The classification and recognition of individual
because the other categories require accurate and characteristics and behaviors constitute a
representative labels that are often prohibitively preliminary step and is an important objective in
expensive to obtain.[5] Unsupervised methods the behavioral sciences. Current statistical
include distance-based methods that mainly rely methods do not always give satisfactory
on a measure of distance or similarity in order to results.[3] To improve performance in this area,
detect outliers. Formulation of outlier detection we present a methodology based on one of the
depends upon the various factors such as input principles of artificial neural networks: the back
data type and distribution, availability of data and propagation gradient In classification tasks, data
resource constraints introduced by application set that conforms to a certain representation or a
domain. Detecting unexpected entries in databases classification model was considered. If one were
which ultimately detect errors for data mining, to perturb a few data instances by making small
frauds or valid but unexpected entries,[6] Due to changes to some of their attribute values, the
so many applications precise detection of outliers original classification model representing the data
becomes must. Many outlier detection methods set was changed. Also, if one were to remove
are suggested till date. We will categorize and those data instances, the original model could
review some of the existing methods in the change significantly.[7] The magnitude of
following sections. The overview of the changes to the original model provided clues to
techniques which will be discussed is given in the the criticality of such data instances, as more
figure below: critical data instances tend to impact the model
more significantly than data instances that are
comparatively noncritical. The hubness marvel
has been as of late seen in a few application
territories including sound and picture Data

Available online:http://internationaljournalofresearch.org/ P a g e | 148


International Journal of Research p-ISSN: 2348-6848
e-ISSN: 2348-795X
Available at https://edupediapublications.org/journals
Volume 03 Issue 10
June 2016

quickly say hubness in the setting of chart the usability of traditional similarity and distance
development for semi-directed learning. Also, measures. Parameter-free outlier detection
there have been endeavors to dodge the impact of algorithm [10] to compute Ordered Distance
center points in 1-NN time-arrangement order, Difference Outlier Factor Formulate a new outlier
obviously without clear mindfulness about the score for each instance by considering the
presence of the wonder (Islam et al., 2008), and to difference of ordered distances. Then use this
represent conceivable skewness of the circulation value to compute an outlier score.
of N1 in converse [8] nearest neighbor look where 3. Density Based approaches
Nk(x) indicates the quantity of times point x Distance-based approaches are known to face the
happens among the k closest neighbors of every local density problem created by the various
other point in the Data set. None of the said degrees of cluster density that exist in a dataset. In
papers, be that as it may, effectively break down order to solve the problem, density-based
the reasons for hubness or sum it up to different approaches have been proposed. The basic idea of
applications. Initially proposed outlier detection density-based approaches is that the density
algorithms determine outliers once all the data around an outlier remarkably varies from that
records (samples) are present in the dataset. We around its neighbors [14]. The density of an
refer to these algorithms as static outlier detection object‟s neighborhood is corelated with that of its
algorithms. In contrast, incremental outlier neighbor‟s neighborhood. If there is a significant
detection techniques identify outliers as soon as difference between the densities, the object can be
new data record appears in the dataset. considered as an outlier. To implement this
Incremental outlier detection was also used within idea,[11] several outlier detection methods have
more general framework of activity monitoring been developed recently. The detection methods
[18]. In addition, [19] proposed broad estimate the density around an object in different
requirements that incremental algorithms need to ways. [15] developed the local outlier factor
meet, [21] used on-line discounting distributional (LOF), which is amongst the most commonly a
learning of Gaussian mixture model and scoring used method in outlier detection. LOF is
based on the estimated probability density influenced by variations like local correlation
function. In [8] propose a outlier ranking based on integral (LOCI)[16],Local distance based outlier
the objects deviation in a set of relevant subspace factor(LDOF) [17], and local outlier
projections. It do not include irrelevant probabilities(LoOP)[18]. Now we will review
projections showing no clear difference between some density based outlier detection techniques.
outliers and the relevant objects and find objects Many outlier methods are proposed till date; these
which deviates in multiple relevant subspaces. existing methods can be broadly classified as:
The study in [9] distinguishes three problems distribution (statistical)-based, clustering-based,
occurred by the “curse of dimensionality” in the density based and model-based approaches [13].
context of data mining, searching and indexing Statistical approaches [12] assume that the data
applications like poor inequity of distances caused follows some standard or predetermined
by concentration, presence of irrelevant and distributions, and this type of approach aims to
redundant attributes, all of which make difficult find the outliers which don‟t follow such

Available online:http://internationaljournalofresearch.org/ P a g e | 149


International Journal of Research p-ISSN: 2348-6848
e-ISSN: 2348-795X
Available at https://edupediapublications.org/journals
Volume 03 Issue 10
June 2016

distributions. The methods in this category always neural networks, decision trees or k-means, they
assume the typical example follow a particular require a training dataset to allow the network to
data distribution. Nevertheless, we cannot always learn. They autonomously cluster the input
have this kind of priori data distribution vectors through node placement to allow the
information in practice, mainly for high underlying data distribution to be modeled and the
dimensional real data sets [13]. normal/abnormal classes differentiated [18]. They
assume that related vectors have common feature
values and rely on identifying these features and
their values to topologically model the data
distribution. The neural network uses the class to
adjust the weights and thresholds to ensure the
network that can correctly classify the whole
network. These methods are also used to detect
the noise and novel data [19]. Neural Network is a
very crucial methodology that plays an important
role in the outlier detection.

Fig -2: Classification of Outlier Detection


A. NEURAL NETWORK METHODS:
Neural Network approaches are usually non
parametric and model based and suits well to the
hidden pattern and are capable of learning large
complex class boundaries .The entire data set has
to be traversed various times to allow the network
to settle and model the data correctly .Neural
Networks are comparatively less susceptible to Fig. 3. Structure of a Neural Network
the curse of Dimensionality as compared to the 4. METHODOLOGY OF OUTLIER
statistical methods; the neural networks are DETECTION ALGORITHMS
further of two types – Supervised Neural Methods Clustering and outlier detection is one of the
and Unsupervised Neural major tasks in high dimensional data. Clustering
Methods[16].Supervised Neural Networks use the approaches are supported by outlier detections for
classification of the data to drive the learning new optimistic approaches. The thrust of the new
process. If this classification of the data is optimistic approach applies nearest neighbor
unavailable, then it is known as unsupervised based clustering method and detect outliers in
neural network .Unsupervised neural networks high dimensional data.
contain nodes which compete to represent
portions of the data set. As with Perceptron-based

Available online:http://internationaljournalofresearch.org/ P a g e | 150


International Journal of Research p-ISSN: 2348-6848
e-ISSN: 2348-795X
Available at https://edupediapublications.org/journals
Volume 03 Issue 10
June 2016

A. Local outlier factor (LOF:


In LOF, compare the local density of a
instances with the densities of its
neighborhood instances and then assign
anomaly score to given data instance. For any
data instance to be normal not as an outlier,
LOF score equal to ratio of average local
density of k nearest neighbor of instance and
local density of data instance itself. To find
local density for data instance, find radius of
small hyper sphere centered at the data
instance. The local density for instances is
computed by dividing volume of k,i.e k
nearest neighbor and volume of hyper sphere.
In this assign a degree to each object to being
an outlier known as local outlier factor.
Depends on the degree it determines how the
object is isolated with respect to surrounding
neighborhood.[20] The instances lying in
dense region are normal instances, if their
B. Local distance based outlier
local density is similar to their neighbors, the
factor(LDOF):
instances are outlier if there local density Local distance based outlier factor Measure
lower than its nearest neighbor.LOF is more the objects outlierness in scattered datasets . In
reliable with top-n manner. Hence it is called this uses the relative location of an object to
as top-n LOF means instances with highest its neighbors to determine the object deviation
LOF values consider as outliers. degree from its neighborhood instances. In
this scattered neighborhood is considered.
Higher deviation [21]in degree data instance
has, more likely data instance as an outlier. In
this algorithm calculates the local distance
based outlier factor for each object and then
sort and ranks the n objects having highest
LDOF value. The first n objects with highest
LDOF values are consider as an outlier

Available online:http://internationaljournalofresearch.org/ P a g e | 151


International Journal of Research p-ISSN: 2348-6848
e-ISSN: 2348-795X
Available at https://edupediapublications.org/journals
Volume 03 Issue 10
June 2016

data set was derived by relating a subset of the


original data records into two new classes

A. The Greedy Algorithm


Overview Our greedy algorithm takes the
Fig.4. Proposed System Architecture number of desired outlies(supposed to be –k) as
5. NOVAL BOUNDARY BASED input and selects points as outliers in a greedy
CLASSIFICATION APPROACH manner Initially the set of outliers(denoted by OS)
(NBBC) is specified to be empty and all points are marked
The proposed Novel Boundary based as non-outlier we need k scans over the dataset to
Classification including the imputation select k points as outliers.[3] In each scan,for each
methods and ordinal classification methods point labeled as non-outlier, it is temporally
are explained in this section. The detailed removed from the dataset as outlier and the
description of WDBC dataset as follows. entropy object is reevaluated. A point that achives
WDBC dataset: The Wisconsin Diagnostic maximal entropy impact the maximal decrease in
Breast Cancer (WDBC) contains various entropy experienced by removing this point,is
attributes namely, diagnosis, ID number and selected as outlier in current scan and added to
real valued features. There are ten real valued OS.The algorithm terminates when the size of OS
features namely, radius, area, perimeter, reaches kIn the initialization phase of the greedy
smoothness, texture, compactness, concave Algoeach record is labeled as non-outlier and hash
points, concavity, symmetry and fractal tables for attributes are also constructed and
dimension computed from digitized image of updated(step 01- 04). In the greedy procedure ,we
breast mass.[8] For each class, an added need to scan over dataset, we read each record t
classification representation is built or trained that is labeled as nonoutlier,its label is changed to
by first deriving data set, which is a separation outlier and the changed entropy value is
of the innovative training data set. The new computed. A record that achieves maximal
entropy impact is selected as outlier in current

Available online:http://internationaljournalofresearch.org/ P a g e | 152


International Journal of Research p-ISSN: 2348-6848
e-ISSN: 2348-795X
Available at https://edupediapublications.org/journals
Volume 03 Issue 10
June 2016

scan and added to the set of outliers(step 05-13) In problem in which the item is excluded before we
this algorithm, the key step is computing the can make the choice. The problem formulated in
changed value of entropy. With the use of hashing this way gives rise to many overlapping sub
technique, I o(1) expected time ,we can determine problems--a hallmark of dynamic programming,
the frequency of an attribute value in and indeed, dynamic programming
corresponding hash table. Hence, we can
determine the decreased entropy value in o(m)
expected time since the changed values is only
dependent on the attribute values of the record to
be temporally removed.[17]
One of the simplest methods for showing that a
greedy algorithm is correct is to use a “greedy
stays ahead” argument. This style of proof works
by showing that, according to some measure, the
greedy algorithm always is at least as far ahead as
the optimal solution during each iteration of the
algorithm. Once you have established this, you
can then use this fact to show that the greedy
algorithm must be optimal.

Fig No 5 Greedy Algorithm Example 6. EXPERIMENTAL SETUP AND


For the comparable fractional problem, however, RESULTS
the greedy strategy, which takes item 1 first, does System Requirement Our algorithms have
yield an optimal solution. Taking item 1 doesn't performed on high dimensional dataset that is
work in the 0-1 problem because the thief is Cover Type dataset from UCI machine learning
unable to fill his knapsack to capacity, and the Repository The experiment were performed on an
empty space lowers the effective value per pound Intel core i5 CPU at 2.53 GHz and RAM 4 GB ,
of his load. In the 0-1 problem, when we consider having a windows 7as its operating system. The
an item for inclusion in the knapsack, we must algorithms were implemented in Java to process
compare the solution to the sub problem in which data instances in high dimensional data. Results
the item is included with the solution to the sub The figure 2 shows the data insertion time

Available online:http://internationaljournalofresearch.org/ P a g e | 153


International Journal of Research p-ISSN: 2348-6848
e-ISSN: 2348-795X
Available at https://edupediapublications.org/journals
Volume 03 Issue 10
June 2016

required in five datasets and figure 3 shows the negative rate and improve the efficiency of
comparative study of outlier detection rate with density based outlier detection.
existing and proposed algorithms for outlier 7. CONCLUSIONS
detection. Outlier detection is very important and has
applications in wide variety of fields. So it
becomes important to learn how to detect outliers.
The main objective of this paper is to review
various outlier detection techniques and to study
how the techniques are categorized. So we can
conclude that, methods used for outlier detection
are application specific. The training algorithm
and testing algorithm are used for training and
testing the class. Reducing the search close to the
class boundaries saves computation time in
identifying such nuggets. Results from the
Fig 6: Data insertion time required evaluation on the real-time WDBC data sets
revealed that the proposed approach achieves
better performance than the existing classification
algorithm. Proposed a derived method which
improves in terms of speed and accuracy,
reducing the false positive and false negative rate
and improve the efficiency of density based
outlier detection The future implementation is in
machine learning techniques such as supervised
and semi-supervised methods.
8. FUTURE WORK
Future work on deleting data records from
database is needed. More specifically, it would be
Figure 7: Outlier detected interesting to design an algorithm with
Outlier-Detection Methods and the hubness exponential decay of weights, where the most
phenomenon, extending the previous recent data records will have the highest influence
examinations of (anti)hubness to large values of k, on the local density estimation. In addition, an
and exploring the relationship between hubness extension of the proposed methodology to create
and data sparsity Based on the analysis, we incremental versions of other emerging outlier
formulated the IQR,Greedy,AntiHub method for detection algorithms Connectivity Outlier Factor
semi-supervised and unsupervised outlier (COF) is also worth considering. Additional real-
detection, discussed its properties and proposed a life data sets will be used to evaluate the proposed
derived method which improves in terms of speed algorithm and ROC curves will be applied to
and accuracy, reducing the false positive and false quantify the algorithm performance.

Available online:http://internationaljournalofresearch.org/ P a g e | 154


International Journal of Research p-ISSN: 2348-6848
e-ISSN: 2348-795X
Available at https://edupediapublications.org/journals
Volume 03 Issue 10
June 2016

9. REFERENCES [8] Karanjit Singh and Dr.Shuchita Upadhyaya


[1] Jun Wang. A Knowledge Network 2012. “Outlier Detection: Applications and
Constructed by Integrating Classification, Techniques” IJCSI International Journal of
Thesaurus, and Metadata in Digital Library. Intl. Computer Science Issues, Vol. 9, Issue 1, No 3,
Inform. & Libr. Rev. Vol35 Issue 3. 2003, pp383- January.
397.
[9] Dasgupta, D. and Majumdar, N. 2002
[2] K. Ord. Outliers in statistical data: V. barnett ”Outlier detection in multidimensional data using
and t. lewis, 1994, 3rd edition, (john wiley & negative selection algorithm” In Proceedings of
sons, chichester), 584 pp., [uk pound]55.00, isbn the IEEE Conference on Evolutionary
0-471- 93094-6. International Journal of Computation. Hawaii, 1039 - 1044.
Forecasting, 12(1):175-176, 1996.
[10] K.S.Beyer,J.Goldstein,R.Ramakrishnan and
[3] S. Chawla, D. Hand, and V. Dhar. Outlier U. Shaft, 1999 “When is “nearest neighbor”
detection special issue. Data Min. Knowl.Discov., meaningful?” in Proc 7th Int Conf on Database
20(2):189-190, 2010. Theory (ICDT), pp. 217–235.

[4] Nilam Upasania, Hari Omb, “Evolving fuzzy [11] V. Chandola, A. Banerjee, and V. Kumar,
min-max neural network for outlier detection” in “Anomaly detection: A survey,” ACM Comput
International Conference on Advanced Surv, vol. 41, no. 3, p. 15, 2009.
Computing Technologies and Applications
(ICACTA-2015) Elsevier [12] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and
J. Sander, “LOF: Identifying density-based local
[5] N., Zadrozny, B., and Langford, J. 2006. outliers,” SIGMOD Rec, vol. 29, no. 2, pp. 93–
Outlier detection by active learning. In 104, 2000.
Proceedings of the 12th ACM SIGKDD
International Conference on Knowledge [13] K. Zhang, M. Hutter, and H. Jin, “A new
Discovery and Data Mining. ACM Press, New local distance-based outlier detection approach for
York, NY, USA, 504–509. scattered real-world data,” in Proc 13th Pacific-
Asia Conf on Knowledge Discovery and Data
[6] V. Chandola, A. Banerjee, V. Kumar, Mining (PAKDD), pp. 813–822. 2009.
Anomaly detection: a survey, ACM Comput.41
(2009)15. [14] W. Jin, A. K. H. Tung, J. Han, and W.
Wang, “Ranking outliers using symmetric
[7] Milos Radovanovic, Alexandros Nanopoulos neighborhood relationship,” in Proc 10th Pacific-
and Mirjana Ivanovic, 2014. “Reverse Nearest Asia Conf on Advances in Knowledge Discovery
Neighbors in Unsupervised Distance Based and Data Mining (PAKDD), pp. 577–593, 2006.
Outlier Detection” IEEE Transactions on
Knowledge and Data Engineering,

Available online:http://internationaljournalofresearch.org/ P a g e | 155


International Journal of Research p-ISSN: 2348-6848
e-ISSN: 2348-795X
Available at https://edupediapublications.org/journals
Volume 03 Issue 10
June 2016

[15] C. Lijun, L. Xiyin, Z. Tiejun, Z. Zhongping, Teaching Experience.


and L. Aiyong, “A data stream outlier delection
algorithm based on reverse k nearest neighborspp.
236–239, 2010. David Raju Kuppala is Working As
a Assistant Professor in Dept Of
CSE,KLUniversity,Vaddeswaram,
[16] Shu-Ching Chen, Mei-Ling Shyu, Chengcui Guntur ,Andhra Pradesh.He Is
Zhang, Rangasami L. Kashyap: Video Scene Having 3 Years Teaching
Change Detection Method Using Unsupervised Experience.
Segmentation And Object Tracking. Proc. ICME
2001

[17] Y. Tao, D. Papadias, X. Lian, Reverse kNN


search in arbitrary dimensionality, In Proceedings
of the 30th International Conference on Very
Large Data Bases, Toronto, Canada, September
2004.

[18] Amit Singh, Hakan Ferhatosmanoglu, Ali


Tosun, High Dimensional Reverse Nearest
Neighbor Queries, In Proceedings of the ACM
International Conference on Information and
Knowledge Management (CIKM'03), New
Orleans, LA, November 2003

[19]. Barnett, V. and Lewis, T.: 1994, Outliers in


Statistical Data. John Wiley & Sons.3rd edition.

[20]. Huber, P. 1974. Robust Statistics.Wiley,


New York.

[21] Grubbs, F. E.: 1969, „Procedures for


detecting outlying observations in
samples‟Technometrics11,
Authors Profile:

N V S K Vijaya Lakshmi K is
Working As a Assistant Professor in
Dept Of IT, Sir C R Reddy College
Of Engineering, Eluru,Andhra
Pradesh.She Is Having 5 Years

Available online:http://internationaljournalofresearch.org/ P a g e | 156

You might also like