Cheng 2019
Cheng 2019
Cheng 2019
Factor
Zhangyu Cheng Chengming Zou∗ Jianwei Dong
School of Computer Science and Hubei Key Laboratory of Information Center, People’s
Technology, Wuhan University of Transportation Internet of Things, Hospital of Ningxia Hui
Technology Wuhan University of Technology Autonomous Region
Wuhan, China Wuhan, China Ningxia, China
[email protected] [email protected] [email protected]
ABSTRACT ACM Reference Format:
Outlier detection, also named as anomaly detection, is one Zhangyu Cheng, Chengming Zou, and Jianwei Dong. 2019. Outlier
Detection using Isolation Forest and Local Outlier Factor. In
of the hot issues in the field of data mining. As well-known
Proceedings of International Conference on Research in Adaptive
outlier detection algorithms, Isolation Forest(iForest) and and Convergent Systems, Chongqing, China, September 24–27,
Local Outlier Factor(LOF) have been widely used. However, 2019 (RACS ’19), 8 pages.
iForest is only sensitive to global outliers, and is weak in deal- https://doi.org/10.1145/3338840.3355641
ing with local outliers. Although LOF performs well in local
outlier detection, it has high time complexity. To overcome
the weaknesses of iForest and LOF, a two-layer progressive 1 INTRODUCTION
ensemble method for outlier detection is proposed. It can
Outlier detection is the identification of objects, events or
accurately detect outliers in complex datasets with low time
observations which do not conform to an expected pattern
complexity. This method first utilizes iForest with low com-
or other items in a dataset. As one of the important tasks
plexity to quickly scan the dataset, prunes the apparently
of data mining, outlier detection is widely used in the fields
normal data, and generates an outlier candidate set. In order
of network intrusion detection, medical diagnosis, industrial
to further improve the pruning accuracy, the outlier coef-
system fault, flood prediction and intelligent transportation
ficient is introduced to design a pruning threshold setting
system[7].
method, which is based on outlier degree of data. Then LOF
Many existing research methods about outlier detection
is applied to further distinguish the outlier candidate set and
are divided into the following categories: distribution-based
get more accurate outliers. The proposed ensemble method
methods, distance-based methods, density-based methods,
takes advantage of the two algorithms and concentrates valu-
and clustering methods. Specifically, the distribution-based[1]
able computing resources on the key stage. Finally, a large
method needs to obtain the distribution model of data to be
number of experiments are carried out to verify the ensemble
tested in advance, which depends on the global distribution
method. The results show that compared with the existing
of the dataset, and is not applicable to the dataset with un-
methods, the ensemble method can significantly improve the
even distribution. The distance-based[13] approach requires
outlier detection rate and greatly reduce the time complexity.
users to select reasonable distance, scale parameters and is
less efficient on high-dimensional datasets. In the clustering
CCS CONCEPTS method[18], the outlier is not the target of the cluster result-
• Computer systems organization → Security and privacy; ing that the abnormal point cannot be accurately analyzed.
Intrusion; anomaly detection and malware mitigation; The above outlier detection methods all adopt global anomaly
standards to process data objects, which cannot perform on
KEYWORDS the datasets with uneven distribution. In practical applica-
Outlier detection(OD), isolation forest, local outlier factor, tions, the distribution of data tends to be skewed, and there
ensemble method is a lack of indicators that can classify data. Even if tagged
∗ datasets are available, their applicability to outlier detection
Corresponding author.
tasks is often unknown.
Permission to make digital or hard copies of all or part of this work The density-based local outlier detection method can effec-
for personal or classroom use is granted without fee provided that tively solve the above problems by describing the degree of
copies are not made or distributed for profit or commercial advantage
and that copies bear this notice and the full citation on the first outliers of data points quantified by local density. Local Out-
page. Copyrights for components of this work owned by others than lier Factor[2] calculates the relative density measure outlier
ACM must be honored. Abstracting with credit is permitted. To copy factor of each data point relative to its surrounding points,
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee. Request permissions called the lof value, which is used to describe the degree of
from [email protected]. outlier in the data. Since this method needs to calculate the
RACS ’19, September 24–27, 2019, Chongqing, China lof value of all data points, the calculation cost is very high,
© 2019 Association for Computing Machinery.
ACM ISBN 978-1-4503-6843-8/19/09. . . $15.00 which makes it difficult to apply to the outlier detection of
https://doi.org/10.1145/3338840.3355641 large-scale data. Actually, it is not necessary to calculate the
161
RACS ’19, September 24–27, 2019, Chongqing, China Z. Cheng et al.
lof value of all data points since there are only few outliers Staerman et al.[14] used Isolated Forest to detect anom-
in the dataset. alies in functional data. By randomly dividing the functional
To address these problems, the contributions of this paper space, they solve the problems that the functional space is
are as follows: equipped with different topologies and the anomalous curves
are characterized by different modes. Liefa Liao and Bin
1) A two-layer progressive ensemble method for outlier Luo[9] introduced dimension entropy as the basis for select-
detection is proposed to overcome the weaknesses of ing isolation attributes and isolation points in the training
iForest and LOF. process, called E-iForest.
2) The outlier coefficient is introduced and a filtering
threshold setting method based on outlier degree of
3 PROPOSED ALGORITHM
data is designed. They further ensure the effectiveness
of the pruning strategy. 3.1 Workflow of The Proposed Method
3) Experiments on real-world datasets and synthetic datasets Inspired by the related work, we will prune the dataset instead
demonstrate that our ensemble method outperforms of using original dataset as the data source, which can greatly
other methods in outlier detection rate, and it greatly reduce the amount of data that needs to be processed. In
reduces the time complexity. order to solve the problem that the existing outlier detection
algorithms are sensitive to global outlier points and have
The remainder of the paper is organized as follow: Section
high time complexity, an integrated method based on iForest
2 introduces the related work on outlier detection. Section 3
and LOF is proposed, and the mining - pruning - detection
details the outlier detection algorithm. Section 4 discusses
framework is applied to improve the detection accuracy and
the datasets, metrics for performance evaluation and the ex-
efficiency. Firstly, iForest is used to calculate the anomaly
perimental results compared with other methods and Section
score of each data point in the forest. Then, the apparently
5 concludes the paper.
normal data are pruned to obtain the outlier candidate set.
Finally, LOF is applied to calculate lof values of the data
2 RELATED WORKS objects in the set to further distinguish the outlier candidate
Recently, outlier detection in the field of data mining has set.
been introduced to help detect unknown anomalous behavior Fig.1 shows the overall workflow of the method, which
or potential attacks. Shalmoli Gupta et al.[6] proposed a mainly includes the following three steps:
K-means clustering algorithm based on local search: if it is 1) iForest: Based on the raw datasets, iForest is applied
profitable that using the non-central to exchange current to construct an isolation forest. Then calculate the
center to improve the target, then the local step is made. average path length of each data point by traversing
Xian Teng et al.[16] proposed a unified outlier detection each tree in the forest, and obtain the anomaly score.
framework that not only warns of current system anomalies, 2) Pruning: Prune off some normal data points according
but also provides local outlier structure information in the to the pruning threshold to obtain the outlier candidate
context of space and time. Liu Z et al.[11] proposed an set.
integrated approach to detect anomalies in large-scale system 3) LOF: Calculate the lof value of each data point in the
logs. The K-prototype is used to obtain clusters and filter outlier candidate set and select the first n points with
out obviously normal events, and k-NN is used to classify high lof values as the target outliers.
the accurate anomalies. Raihan Ul Islam et al.[8] proposed
a new belief rule-based association rule (BRBAR) that can
resolve uncertainties associated with sensor data.
The local outlier factor is a popular density-based algo-
rithm. Due to its high time complexity, LOF is not suitable
for large-scale high-dimensional datasets. Therefore, Jialing
Tang and Henry Y.T.[15] proposed a density-based bounded
LOF method (BLOF), which uses LOF to detect anomalies
in dataset after principal component analysis (PCA). Yizhou
Yan et al.[19] proposed a local outlier detection algorithm
based on LOF upper bound pruning (Top-n LOF, TOLF) to
quickly pruning most data points from the dataset, which
greatly improved the detection efficiency. For the accuracy
of LOF, the spectral angle and local outliers (SALOF) al-
gorithm are applied by Bing Tu et al.[17] to improve the
accuracy of supervised classification.
In recent years, the iForest proposed by Liu FT et al.[10] Figure 1: Workflow of the proposed method.
has attracted attention from the industry and academia due
to its low time complexity and high accuracy. Guillaume
162
Outlier Detection using Isolation Forest and Local Outlier Factor RACS ’19, September 24–27, 2019, Chongqing, China
3.2 Isolation Forest: Outlier Candidate Mining dispersion coefficient after sorting, and 𝛼 is an adjustment
The Isolation Forest(iForest) is applied to initially process factor. 𝛼 and 𝑚 depend on a comprehensive consideration of
the dataset, aiming at mining outlier candidates. It is an size and distribution of the dataset.
ensemble-based unsupervised outlier detection method with 𝛼𝑇 𝑜𝑝 𝑚𝐷𝑓
linear time complexity and high precision. The forest consists 𝜃𝐷 = (3)
𝑚
of a group of binary trees constructed from the random
property of the dataset. Then, traverse each tree in the forest Therefore, we set different thresholds for the different
and calculate the anomaly score of each data point in each characteristics of each dataset. According to the anomaly
tree. The isolation tree’s construction algorithm is defined score of each point calculated by iForest, the 1 − 𝜃𝐷 data
as iTree (𝑋, 𝑒, ℎ) function. Here, 𝑋 represents the input points of the dataset are pruned, with the remaining data
dataset, 𝑒 represents the current tree height, ℎ represents points constituting outlier candidate set.
height limit. The steps of the iForest’s construction algorithm
are as follows: 3.4 LOF: Accurate Outlier Detection
Local Outlier Factor(LOF) is a density-based outlier detec-
Algorithm 1 iForest (𝑋, 𝑡, 𝑠) tion algorithm that finds outliers by calculating the local
Input: 𝑋 - input dataset, 𝑡 - number of trees, 𝑠 - subsampling deviation of a given data point, which is suitable for outlier
size. detection of uneven distribution dataset. The determination
Output: a set of 𝑡 iTrees of the outlier is judged based on the density between each
1: Initialize Forest data point and its neighbor points. The lower the density of
2: set height limit 𝑙 = 𝑐𝑒𝑖𝑙𝑖𝑛𝑔log2 𝑠 the point, the more likely it is to be identified as the outlier.
3: for 𝑖 = 1 to 𝑡 do do Some settings of LOF are as follows:
′
4: 𝑋 ← 𝑠𝑎𝑚𝑝𝑙𝑒𝑋, 𝑠 Definition 1. 𝑑𝑝, 𝑞: the distance from point 𝑝 to point 𝑞.
′
5: 𝐹 𝑜𝑟𝑒𝑠𝑡 ← 𝐹 𝑜𝑟𝑒𝑠𝑡 ∪ 𝑖𝑇 𝑟𝑒𝑒𝑋 , 0, 𝑙 Definition 2. 𝑘-distance: sort the distances from point 𝑝 to
6: end for
other data points, and the distance from point 𝑝 to the 𝑘th
7: return 𝐹 𝑜𝑟𝑒𝑠𝑡
data point is recorded as 𝑘-𝑑𝑖𝑠𝑡𝑝.
Definition 3. 𝑘 nearest neighbors: data point set to point
𝑝 distance less than 𝑘-𝑑𝑖𝑠𝑡𝑝, recorded as 𝑁𝑘 𝑝.
Definition 4. reachability distance:
3.3 Pruning: Outlier Candidate Selection
The purpose of the pruning strategy is to prune out the 𝑟𝑒𝑎𝑐ℎ − 𝑑𝑖𝑠𝑡𝑘 𝑝, 𝑟 = max{𝑘 − 𝑑𝑖𝑠𝑡𝑟, 𝑑𝑝, 𝑟} (4)
apparently normal data point while preserving the outlier
Definition 5. local reachability 𝑑𝑒𝑛𝑠𝑖𝑡𝑦𝑙𝑟𝑑: The reciprocal
candidate set for further processing. The current algorithm
of the mean of the reachable distance of the data point 𝑝 and
cannot accurately set a threshold to determine whether a
its 𝑘 nearest neighbors, defined as:
certain point is put into the candidate set, which is due to the
unknown proportion of outliers. According to actual experi- 𝑠∈𝑁𝑘 𝑝 𝑟𝑒𝑎𝑐ℎ − 𝑑𝑖𝑠𝑡𝑘 𝑝, 𝑟
ence, outliers generally increase the dispersion of datasets. 𝑙𝑟𝑑𝑝 = 1 (5)
|𝑁𝑘 𝑝|
Therefore, this paper defines the outlier coefficient to mea-
sure the degree of dispersion of the dataset, and obtains the Definition 6. local outlier factor(𝑙𝑜𝑓 ): The average of the ra-
pruning threshold by calculation. tio of the local reachable density of the point 𝑝 neighborhood
Specify a dataset: 𝐷={𝑑1 , 𝑑2 , ..., 𝑑𝑛 }. Here, 𝑛 is the point to the local reachable density of the point 𝑝, defined
sample number of 𝐷. 𝑑𝑖 is an attribute in 𝐷, and 𝑑𝑖 ={𝑥1 , as:
𝑥2 , ..., 𝑥𝑛 }. 𝑥𝑗 is a certain data value of the attribute 𝑑𝑖 . 𝑙𝑟𝑑𝑡
𝑡∈𝑁𝑘 𝑝 𝑙𝑟𝑑𝑝
The outlier coefficient of the attribute is defined as: 𝑙𝑜𝑓 𝑝 = (6)
√︁ |𝑁𝑘 𝑝|
𝑥𝑗 −¯𝑥2 √︂
𝑛 𝑥𝑗 − 𝑥 ¯2 The steps of the Local Outlier Factor algorithm are shown
𝑓 𝑑𝑖 = = 2
(1)
𝑥
¯ 𝑛¯
𝑥 as follows:
Here, 𝑥
¯ is the mean of the attribute 𝑑𝑖 and 𝑓 𝑑𝑖 is used to
measure the degree of dispersion of the attribute 𝑑𝑖 . Calculate 4 EXPERIMENTS
the outlier coefficient of each attribute in the dataset, and In this section, we empirically evaluate the effectiveness of the
get the outlier coefficient vector 𝐷𝑓 of the dataset, which is proposed method on both synthetic and real-world datasets.
recorded as: Specifically, the experimental results are analyzed from three
𝐷𝑓 = 𝑓 𝑑1 , 𝑓 𝑑2 , ..., 𝑓 𝑑𝑛 (2) aspects: pruning efficiency, accuracy metric and time cost. At
Through the outlier coefficient vector, the pollution amount the same time, we simply implemented the iForest(IF)[10],
of the dataset can be calculated, that is, the trim threshold traditional LOF[2], KMeans-LOF(K-LOF)[12], and R1SVM[5].
𝜃𝐷 . The follow 𝜃𝐷 represents the proportion of outliers in And compare them with the proposed algorithm iForest-
the dataset. Here, 𝑇 𝑜𝑝 𝑚 refers to 𝑚 values having a large LOF(IF-LOF).
163
RACS ’19, September 24–27, 2019, Chongqing, China Z. Cheng et al.
164
Outlier Detection using Isolation Forest and Local Outlier Factor RACS ’19, September 24–27, 2019, Chongqing, China
Figure 2: Illustration of TP & FP & TN & FN. Figure 3: The confusion matrix set of synthetic datasets.
165
RACS ’19, September 24–27, 2019, Chongqing, China Z. Cheng et al.
Figure 5: The Accuracy & F-Measure of synthetic datasets and real-world datasets.
166
Outlier Detection using Isolation Forest and Local Outlier Factor RACS ’19, September 24–27, 2019, Chongqing, China
4.2.3 Time Cost. The time cost refers to the time it takes to
perform outlier detection on a standard hardware/software
system, including the time of data preprocessing and the
computation time of the detection.
As shown in Fig.6, K-LOF is the least efficient on any
scale dataset. The amount of data is slightly increased, and
the efficiency of LOF drops sharply. R1SVM is the most
efficient on small-scale datasets, while as the dataset grows
larger, its advantages are gradually reduced compared to IF
and IF-LOF. Both IF and IF-LOF have demonstrated their
efficiency on large-scale datasets. Although the time cost of
IF-LOF is slightly lower thanIF, its actual computation time
is much less.
Fig.7 shows the pruning time and detection calculation
time for IF-LOF, IF, LOF, K-LOF, and R1SVM on six real-
world datasets, respectively. In the Fig.7, R1SVM is the most
efficient except for the Skin dataset, while its accuracy is too
low. Although IF-LOF is not as efficient as IF, the processing Figure 7: Evaluation of time cost with real-world datasets.
time is very close, and its efficiency is much higher on some
datasets than LOF.
In summary, the above experimental results show that iForest is used to construct binary trees to form a forest,
the integration method IF-LOF performs better than IF, and the anomaly score of each data point in the forest is
LOF, K-LOF and R1SVM. It utilizes different algorithms calculated. Secondly, according to the pruning strategy, the
to achieve a good balance of accuracy and computational apparently normal samples are filtered to obtain the outlier
complexity, resulting in better outlier detection with lower candidate set. Finally, the data objects in the set, which
time complexity. is corresponding to the top lof values, are determined as
the outliers. This method reduces the time complexity of
5 CONCLUSIONS LOF by avoiding calculating the lof value of all data objects
This paper proposes an integrated algorithm of iForest and in raw dataset and overcomes the weakness of iForest in
LOF to perform outlier detection on multiple datasets. Firstly, dealing with local outliers. In order to verify the effect of
167
RACS ’19, September 24–27, 2019, Chongqing, China Z. Cheng et al.
the proposed integrated algorithm, we conduct comparative [17] Bing Tu, Chengle Zhou, Wenlan Kuang, Longyuan Guo, and
experiments on six synthetic datasets and six real-world Xianfeng Ou. 2018. Hyperspectral imagery noisy label detection
by spectral angle local outlier factor. IEEE Geoscience and
datasets, and evaluate the outlier detection algorithm from Remote Sensing Letters 15, 9 (2018), 1417–1421.
three aspects: pruning efficiency, accuracy metric and time [18] Prabha Verma, Prashant Singh, and RDS Yadava. 2017. Fuzzy
c-means clustering based outlier detection for SAW electronic
cost. The experimental results confirm the accuracy and nose. In 2017 2nd international conference for convergence in
effectiveness of the proposed integrated algorithm, which is technology (I2CT). IEEE, 513–519.
superior to IF, LOF, K-LOF and R1SVM. [19] Yizhou Yan, Lei Cao, and Elke A Rundensteiner. 2017. Scalable
top-n local outlier detection. In Proceedings of the 23rd ACM
SIGKDD International Conference on Knowledge Discovery
ACKNOWLEDGMENTS and Data Mining. ACM, 1235–1244.
REFERENCES
[1] Jorge Edmundo Alpuche Aviles, Maria Isabel Cordero Marcos,
David Sasaki, Keith Sutherland, Bill Kane, and Esa Kuusela. 2018.
Creation of knowledge-based planning models intended for large
scale distribution: Minimizing the effect of outlier plans. Journal
of applied clinical medical physics 19, 3 (2018), 215–226.
[2] Markus M Breunig, Hans-Peter Kriegel, Raymond T Ng, and Jörg
Sander. 2000. LOF: identifying density-based local outliers. In
ACM sigmod record, Vol. 29. ACM, 93–104.
[3] D Dua and E Karra Taniskidou. 2017. UCI Machine Learning
Repository [http://archive. ics. uci. edu/ml]. Irvine, CA: Univer-
sity of California. School of Information and Computer Science
(2017).
[4] Jakub Dvořák and Petr Savickỳ. 2007. Softening splits in decision
trees using simulated annealing. In International Conference on
Adaptive and Natural Computing Algorithms. Springer, 721–729.
[5] Sarah Erfani, Mahsa Baktashmotlagh, Sutharshan Rajasegarar,
Shanika Karunasekera, and Chris Leckie. 2015. R1SVM: A ran-
domised nonlinear approach to large-scale anomaly detection.
(2015).
[6] Shalmoli Gupta, Ravi Kumar, Kefu Lu, Benjamin Moseley, and
Sergei Vassilvitskii. 2017. Local search methods for k-means with
outliers. Proceedings of the VLDB Endowment 10, 7 (2017),
757–768.
[7] Riyaz Ahamed Ariyaluran Habeeb, Fariza Nasaruddin, Abdullah
Gani, Ibrahim Abaker Targio Hashem, Ejaz Ahmed, and Muham-
mad Imran. 2018. Real-time big data processing for anomaly
detection: a survey. International Journal of Information Man-
agement (2018).
[8] Raihan Ul Islam, Mohammad Shahadat Hossain, and Karl Ander-
sson. 2018. A novel anomaly detection algorithm for sensor data
under uncertainty. Soft Computing 22, 5 (2018), 1623–1639.
[9] Liefa Liao and Bin Luo. 2018. Entropy Isolation Forest Based on
Dimension Entropy for Anomaly Detection. In International Sym-
posium on Intelligence Computation and Applications. Springer,
365–376.
[10] Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. 2012. Isolation-
based anomaly detection. ACM Transactions on Knowledge
Discovery from Data (TKDD) 6, 1 (2012), 3.
[11] Zhaoli Liu, Tao Qin, Xiaohong Guan, Hezhi Jiang, and Chenxu
Wang. 2018. An integrated method for anomaly detection from
massive system logs. IEEE Access 6 (2018), 30602–30611.
[12] Khaled Ali Othman, Md Nasir Sulaiman, Norwati Mustapha, and
Nurfadhlina Mohd Sharef. 2017. Local Outlier Factor in Rough
K-Means Clustering. PERTANIKA JOURNAL OF SCIENCE
AND TECHNOLOGY 25 (2017), 211–222.
[13] Guansong Pang, Longbing Cao, Ling Chen, and Huan Liu. 2018.
Learning representations of ultrahigh-dimensional data for random
distance-based outlier detection. In Proceedings of the 24th ACM
SIGKDD International Conference on Knowledge Discovery &
Data Mining. ACM, 2041–2050.
[14] Guillaume Staerman, Pavlo Mozharovskyi, Stephan Clémençon,
and Florence d’Alché Buc. 2019. Functional Isolation Forest.
arXiv preprint arXiv:1904.04573 (2019).
[15] Jialing Tang and Henry YT Ngan. 2016. Traffic outlier detection
by density-based bounded local outlier factors. Information
Technology in Industry 4, 1 (2016), 6.
[16] Xian Teng, Muheng Yan, Ali Mert Ertugrul, and Yu-Ru Lin.
2018. Deep into Hypersphere: Robust and Unsupervised Anomaly
Discovery in Dynamic Networks.. In IJCAI. 2724–2730.
168