Cheng 2019

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Outlier Detection using Isolation Forest and Local Outlier

Factor
Zhangyu Cheng Chengming Zou∗ Jianwei Dong
School of Computer Science and Hubei Key Laboratory of Information Center, People’s
Technology, Wuhan University of Transportation Internet of Things, Hospital of Ningxia Hui
Technology Wuhan University of Technology Autonomous Region
Wuhan, China Wuhan, China Ningxia, China
[email protected] [email protected] [email protected]
ABSTRACT ACM Reference Format:
Outlier detection, also named as anomaly detection, is one Zhangyu Cheng, Chengming Zou, and Jianwei Dong. 2019. Outlier
Detection using Isolation Forest and Local Outlier Factor. In
of the hot issues in the field of data mining. As well-known
Proceedings of International Conference on Research in Adaptive
outlier detection algorithms, Isolation Forest(iForest) and and Convergent Systems, Chongqing, China, September 24–27,
Local Outlier Factor(LOF) have been widely used. However, 2019 (RACS ’19), 8 pages.
iForest is only sensitive to global outliers, and is weak in deal- https://doi.org/10.1145/3338840.3355641
ing with local outliers. Although LOF performs well in local
outlier detection, it has high time complexity. To overcome
the weaknesses of iForest and LOF, a two-layer progressive 1 INTRODUCTION
ensemble method for outlier detection is proposed. It can
Outlier detection is the identification of objects, events or
accurately detect outliers in complex datasets with low time
observations which do not conform to an expected pattern
complexity. This method first utilizes iForest with low com-
or other items in a dataset. As one of the important tasks
plexity to quickly scan the dataset, prunes the apparently
of data mining, outlier detection is widely used in the fields
normal data, and generates an outlier candidate set. In order
of network intrusion detection, medical diagnosis, industrial
to further improve the pruning accuracy, the outlier coef-
system fault, flood prediction and intelligent transportation
ficient is introduced to design a pruning threshold setting
system[7].
method, which is based on outlier degree of data. Then LOF
Many existing research methods about outlier detection
is applied to further distinguish the outlier candidate set and
are divided into the following categories: distribution-based
get more accurate outliers. The proposed ensemble method
methods, distance-based methods, density-based methods,
takes advantage of the two algorithms and concentrates valu-
and clustering methods. Specifically, the distribution-based[1]
able computing resources on the key stage. Finally, a large
method needs to obtain the distribution model of data to be
number of experiments are carried out to verify the ensemble
tested in advance, which depends on the global distribution
method. The results show that compared with the existing
of the dataset, and is not applicable to the dataset with un-
methods, the ensemble method can significantly improve the
even distribution. The distance-based[13] approach requires
outlier detection rate and greatly reduce the time complexity.
users to select reasonable distance, scale parameters and is
less efficient on high-dimensional datasets. In the clustering
CCS CONCEPTS method[18], the outlier is not the target of the cluster result-
• Computer systems organization → Security and privacy; ing that the abnormal point cannot be accurately analyzed.
Intrusion; anomaly detection and malware mitigation; The above outlier detection methods all adopt global anomaly
standards to process data objects, which cannot perform on
KEYWORDS the datasets with uneven distribution. In practical applica-
Outlier detection(OD), isolation forest, local outlier factor, tions, the distribution of data tends to be skewed, and there
ensemble method is a lack of indicators that can classify data. Even if tagged
∗ datasets are available, their applicability to outlier detection
Corresponding author.
tasks is often unknown.
Permission to make digital or hard copies of all or part of this work The density-based local outlier detection method can effec-
for personal or classroom use is granted without fee provided that tively solve the above problems by describing the degree of
copies are not made or distributed for profit or commercial advantage
and that copies bear this notice and the full citation on the first outliers of data points quantified by local density. Local Out-
page. Copyrights for components of this work owned by others than lier Factor[2] calculates the relative density measure outlier
ACM must be honored. Abstracting with credit is permitted. To copy factor of each data point relative to its surrounding points,
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee. Request permissions called the lof value, which is used to describe the degree of
from [email protected]. outlier in the data. Since this method needs to calculate the
RACS ’19, September 24–27, 2019, Chongqing, China lof value of all data points, the calculation cost is very high,
© 2019 Association for Computing Machinery.
ACM ISBN 978-1-4503-6843-8/19/09. . . $15.00 which makes it difficult to apply to the outlier detection of
https://doi.org/10.1145/3338840.3355641 large-scale data. Actually, it is not necessary to calculate the

161
RACS ’19, September 24–27, 2019, Chongqing, China Z. Cheng et al.

lof value of all data points since there are only few outliers Staerman et al.[14] used Isolated Forest to detect anom-
in the dataset. alies in functional data. By randomly dividing the functional
To address these problems, the contributions of this paper space, they solve the problems that the functional space is
are as follows: equipped with different topologies and the anomalous curves
are characterized by different modes. Liefa Liao and Bin
1) A two-layer progressive ensemble method for outlier Luo[9] introduced dimension entropy as the basis for select-
detection is proposed to overcome the weaknesses of ing isolation attributes and isolation points in the training
iForest and LOF. process, called E-iForest.
2) The outlier coefficient is introduced and a filtering
threshold setting method based on outlier degree of
3 PROPOSED ALGORITHM
data is designed. They further ensure the effectiveness
of the pruning strategy. 3.1 Workflow of The Proposed Method
3) Experiments on real-world datasets and synthetic datasets Inspired by the related work, we will prune the dataset instead
demonstrate that our ensemble method outperforms of using original dataset as the data source, which can greatly
other methods in outlier detection rate, and it greatly reduce the amount of data that needs to be processed. In
reduces the time complexity. order to solve the problem that the existing outlier detection
algorithms are sensitive to global outlier points and have
The remainder of the paper is organized as follow: Section
high time complexity, an integrated method based on iForest
2 introduces the related work on outlier detection. Section 3
and LOF is proposed, and the mining - pruning - detection
details the outlier detection algorithm. Section 4 discusses
framework is applied to improve the detection accuracy and
the datasets, metrics for performance evaluation and the ex-
efficiency. Firstly, iForest is used to calculate the anomaly
perimental results compared with other methods and Section
score of each data point in the forest. Then, the apparently
5 concludes the paper.
normal data are pruned to obtain the outlier candidate set.
Finally, LOF is applied to calculate lof values of the data
2 RELATED WORKS objects in the set to further distinguish the outlier candidate
Recently, outlier detection in the field of data mining has set.
been introduced to help detect unknown anomalous behavior Fig.1 shows the overall workflow of the method, which
or potential attacks. Shalmoli Gupta et al.[6] proposed a mainly includes the following three steps:
K-means clustering algorithm based on local search: if it is 1) iForest: Based on the raw datasets, iForest is applied
profitable that using the non-central to exchange current to construct an isolation forest. Then calculate the
center to improve the target, then the local step is made. average path length of each data point by traversing
Xian Teng et al.[16] proposed a unified outlier detection each tree in the forest, and obtain the anomaly score.
framework that not only warns of current system anomalies, 2) Pruning: Prune off some normal data points according
but also provides local outlier structure information in the to the pruning threshold to obtain the outlier candidate
context of space and time. Liu Z et al.[11] proposed an set.
integrated approach to detect anomalies in large-scale system 3) LOF: Calculate the lof value of each data point in the
logs. The K-prototype is used to obtain clusters and filter outlier candidate set and select the first n points with
out obviously normal events, and k-NN is used to classify high lof values as the target outliers.
the accurate anomalies. Raihan Ul Islam et al.[8] proposed
a new belief rule-based association rule (BRBAR) that can
resolve uncertainties associated with sensor data.
The local outlier factor is a popular density-based algo-
rithm. Due to its high time complexity, LOF is not suitable
for large-scale high-dimensional datasets. Therefore, Jialing
Tang and Henry Y.T.[15] proposed a density-based bounded
LOF method (BLOF), which uses LOF to detect anomalies
in dataset after principal component analysis (PCA). Yizhou
Yan et al.[19] proposed a local outlier detection algorithm
based on LOF upper bound pruning (Top-n LOF, TOLF) to
quickly pruning most data points from the dataset, which
greatly improved the detection efficiency. For the accuracy
of LOF, the spectral angle and local outliers (SALOF) al-
gorithm are applied by Bing Tu et al.[17] to improve the
accuracy of supervised classification.
In recent years, the iForest proposed by Liu FT et al.[10] Figure 1: Workflow of the proposed method.
has attracted attention from the industry and academia due
to its low time complexity and high accuracy. Guillaume

162
Outlier Detection using Isolation Forest and Local Outlier Factor RACS ’19, September 24–27, 2019, Chongqing, China

3.2 Isolation Forest: Outlier Candidate Mining dispersion coefficient after sorting, and 𝛼 is an adjustment
The Isolation Forest(iForest) is applied to initially process factor. 𝛼 and 𝑚 depend on a comprehensive consideration of
the dataset, aiming at mining outlier candidates. It is an size and distribution of the dataset.
ensemble-based unsupervised outlier detection method with 𝛼𝑇 𝑜𝑝 𝑚𝐷𝑓
linear time complexity and high precision. The forest consists 𝜃𝐷 = (3)
𝑚
of a group of binary trees constructed from the random
property of the dataset. Then, traverse each tree in the forest Therefore, we set different thresholds for the different
and calculate the anomaly score of each data point in each characteristics of each dataset. According to the anomaly
tree. The isolation tree’s construction algorithm is defined score of each point calculated by iForest, the 1 − 𝜃𝐷 data
as iTree (𝑋, 𝑒, ℎ) function. Here, 𝑋 represents the input points of the dataset are pruned, with the remaining data
dataset, 𝑒 represents the current tree height, ℎ represents points constituting outlier candidate set.
height limit. The steps of the iForest’s construction algorithm
are as follows: 3.4 LOF: Accurate Outlier Detection
Local Outlier Factor(LOF) is a density-based outlier detec-
Algorithm 1 iForest (𝑋, 𝑡, 𝑠) tion algorithm that finds outliers by calculating the local
Input: 𝑋 - input dataset, 𝑡 - number of trees, 𝑠 - subsampling deviation of a given data point, which is suitable for outlier
size. detection of uneven distribution dataset. The determination
Output: a set of 𝑡 iTrees of the outlier is judged based on the density between each
1: Initialize Forest data point and its neighbor points. The lower the density of
2: set height limit 𝑙 = 𝑐𝑒𝑖𝑙𝑖𝑛𝑔log2 𝑠 the point, the more likely it is to be identified as the outlier.
3: for 𝑖 = 1 to 𝑡 do do Some settings of LOF are as follows:

4: 𝑋 ← 𝑠𝑎𝑚𝑝𝑙𝑒𝑋, 𝑠 Definition 1. 𝑑𝑝, 𝑞: the distance from point 𝑝 to point 𝑞.

5: 𝐹 𝑜𝑟𝑒𝑠𝑡 ← 𝐹 𝑜𝑟𝑒𝑠𝑡 ∪ 𝑖𝑇 𝑟𝑒𝑒𝑋 , 0, 𝑙 Definition 2. 𝑘-distance: sort the distances from point 𝑝 to
6: end for
other data points, and the distance from point 𝑝 to the 𝑘th
7: return 𝐹 𝑜𝑟𝑒𝑠𝑡
data point is recorded as 𝑘-𝑑𝑖𝑠𝑡𝑝.
Definition 3. 𝑘 nearest neighbors: data point set to point
𝑝 distance less than 𝑘-𝑑𝑖𝑠𝑡𝑝, recorded as 𝑁𝑘 𝑝.
Definition 4. reachability distance:
3.3 Pruning: Outlier Candidate Selection
The purpose of the pruning strategy is to prune out the 𝑟𝑒𝑎𝑐ℎ − 𝑑𝑖𝑠𝑡𝑘 𝑝, 𝑟 = max{𝑘 − 𝑑𝑖𝑠𝑡𝑟, 𝑑𝑝, 𝑟} (4)
apparently normal data point while preserving the outlier
Definition 5. local reachability 𝑑𝑒𝑛𝑠𝑖𝑡𝑦𝑙𝑟𝑑: The reciprocal
candidate set for further processing. The current algorithm
of the mean of the reachable distance of the data point 𝑝 and
cannot accurately set a threshold to determine whether a
its 𝑘 nearest neighbors, defined as:
certain point is put into the candidate set, which is due to the
unknown proportion of outliers. According to actual experi- 𝑠∈𝑁𝑘 𝑝 𝑟𝑒𝑎𝑐ℎ − 𝑑𝑖𝑠𝑡𝑘 𝑝, 𝑟
ence, outliers generally increase the dispersion of datasets. 𝑙𝑟𝑑𝑝 = 1 (5)
|𝑁𝑘 𝑝|
Therefore, this paper defines the outlier coefficient to mea-
sure the degree of dispersion of the dataset, and obtains the Definition 6. local outlier factor(𝑙𝑜𝑓 ): The average of the ra-
pruning threshold by calculation. tio of the local reachable density of the point 𝑝 neighborhood
Specify a dataset: 𝐷={𝑑1 , 𝑑2 , ..., 𝑑𝑛 }. Here, 𝑛 is the point to the local reachable density of the point 𝑝, defined
sample number of 𝐷. 𝑑𝑖 is an attribute in 𝐷, and 𝑑𝑖 ={𝑥1 , as:
𝑥2 , ..., 𝑥𝑛 }. 𝑥𝑗 is a certain data value of the attribute 𝑑𝑖 . 𝑙𝑟𝑑𝑡
𝑡∈𝑁𝑘 𝑝 𝑙𝑟𝑑𝑝
The outlier coefficient of the attribute is defined as: 𝑙𝑜𝑓 𝑝 = (6)
√︁ |𝑁𝑘 𝑝|
𝑥𝑗 −¯𝑥2 √︂
𝑛 𝑥𝑗 − 𝑥 ¯2 The steps of the Local Outlier Factor algorithm are shown
𝑓 𝑑𝑖 = = 2
(1)
𝑥
¯ 𝑛¯
𝑥 as follows:
Here, 𝑥
¯ is the mean of the attribute 𝑑𝑖 and 𝑓 𝑑𝑖 is used to
measure the degree of dispersion of the attribute 𝑑𝑖 . Calculate 4 EXPERIMENTS
the outlier coefficient of each attribute in the dataset, and In this section, we empirically evaluate the effectiveness of the
get the outlier coefficient vector 𝐷𝑓 of the dataset, which is proposed method on both synthetic and real-world datasets.
recorded as: Specifically, the experimental results are analyzed from three
𝐷𝑓 = 𝑓 𝑑1 , 𝑓 𝑑2 , ..., 𝑓 𝑑𝑛 (2) aspects: pruning efficiency, accuracy metric and time cost. At
Through the outlier coefficient vector, the pollution amount the same time, we simply implemented the iForest(IF)[10],
of the dataset can be calculated, that is, the trim threshold traditional LOF[2], KMeans-LOF(K-LOF)[12], and R1SVM[5].
𝜃𝐷 . The follow 𝜃𝐷 represents the proportion of outliers in And compare them with the proposed algorithm iForest-
the dataset. Here, 𝑇 𝑜𝑝 𝑚 refers to 𝑚 values having a large LOF(IF-LOF).

163
RACS ’19, September 24–27, 2019, Chongqing, China Z. Cheng et al.

Algorithm 2 𝐿𝑂𝐹 𝑘, 𝑚, 𝐷 Table 2: Information of Real-world Datasets


Input: 𝑘 - number of near neighbor, 𝑚 - number of outliers,
𝐷 - outlier candidate dataset. Name Instances Attributes Outliers(%)
Output: 𝑡𝑜𝑝𝑚 outliers. Satellite 6435 36 2036 (32%)
1: for 𝑗 = 1 to 𝑙𝑒𝑛𝐷 do do
Mnist 7603 100 700 (9.2%)
2: compute 𝑘 - 𝑑𝑖𝑠𝑡𝑝 Shuttle 49097 9 3511 (7%)
3: compute 𝑁𝑘 𝑝 ALOI 50000 27 1508 (3.016%)
4: end for
Smtp 95156 3 30 (0.03%)
5: calculate 𝑟𝑒𝑎𝑐ℎ - 𝑑𝑖𝑠𝑡𝑘 𝑝, 𝑟 and 𝑙𝑟𝑑𝑝
Skin 245057 3 50859 (2.075%)
6: calculate 𝑙𝑜𝑓 𝑝
7: sort the 𝑙𝑜𝑓 values of all points in descending order
8: return the 𝑚 data objects with the large 𝑙𝑜𝑓 values, which Table 3: Effectiveness of Pruning Strategies on Synthetic
are the outliers Datasets

4.1 Datasets Synthetic datasets


Pruning Precision Pruning Number (%)
4.1.1 Synthetic Datasets. Six real-world datasets are selected IF-LOF K-LOF IF-LOF K-LOF
from the UCI machine learning repository[3] to construct
Yeast 0.5 0.0892 92.18% 56.20%
synthetic datasets, the details of which are shown in columns
EMGPA 0.5 0.0907 92.16% 56.80%
1-3 of Table 1. The selected datasets eliminate the category
EEGES 0.3932 0.0943 90.00% 58.28%
attribute to make the synthetic datasets more authentic.
MGT 0.5 0.0888 92.19% 56.01%
Since the datasets have no real exception tags, a random
Avila 0.3943 0.0826 90.00% 52.24%
shift is used to preprocess the data. Treat all data points
KEGG 0.5 0.1008 92.13% 60.95%
as normal objects and generate outliers using the following
standard contamination program: randomly select a certain
proportion of the data points and then move the values of the Table 4: Effectiveness of Pruning Strategies on Real-world
selected data attributes by 3 standard deviations. Column Datasets
4 of Table 1 shows the number and proportion of outliers
generated. To briefly describe the dataset, we apply EMGPA
to represent EMG Physical Action, EEGES for EEG Eye Pruning Precision Pruning Number (%)
Real-world datasets
State, and MGT for Magic Gamma Telescope[4]. IF-LOF K-LOF IF-LOF K-LOF
Satellite 0.3784 0.3669 40.00% 17.54%
Table 1: Information of Synthetic Datasets
Mnist 0.1715 0.1415 50.01% 55.86%
Shuttle 0.3569 0.1531 80.00% 53.64%
Name Instances Attributes Outliers(%) ALOI 0.0431 0.0480 40.00% 68.43%
Yeast 1484 8 58(3.9%) Smtp 0.0028 0.0007 92.01% 59.47%
EMGPA 10000 8 392(3.92%) KEGG 0.3375 0.2829 40.25% 26.64%
EEGES 14980 15 589(3.93%)
MGT 19020 10 744(3.91%)
Avila 20867 10 823(3.94%) The goal of pruning is to prune as much normal data
KEGG 53413 23 2102(3.93%) points as possible while preserving all exception data points
to reduce the calculation of unnecessary lof value. Prun-
ing Number(PN) is the percentage of pruned data points,
4.1.2 Real-world Datasets. In this section, six different datasets which is defined as the ratio of the number of pruned data
from real world are used to demonstrate the application of points versus the total number of data. With a high PN,
this method. The datasets used are all freely accessible from the larger the Pruning Precision(PP) in the formula PP =
the Outlier Detection Datasets and are shown in Table 2. TP/(TP+FP), the better it is. Here, True Positive(TP) and
False Positive(FP) are explained in section 4.2.2.
4.2 Experimental Results To more intuitively demonstrate the pruning efficiency
4.2.1 Pruning Efficiency. Pruning improves the generaliza- of IF-LOF, we perform pruning experiments on 12 selected
tion ability of the model by pruning the data points in high- datasets and compare them with K-LOF. As shown in the
density areas and reducing over-fitting, which is a common results in Table 3, in addition to the ALOI dataset, PP
algorithm in machine learning. It has the advantage of re- and PN of IF-LOF are generally higher than K-LOF on the
ducing the time and space complexity on the LOF stage. remaining five real-world datasets. In Table 4, on the Mnist
However, the accuracy of the model can be lost due to the dataset, although the pruning number of IF-LOF is smaller
pruning of data points with low contribution. than K-LOF, its PP is higher than K-LOF.

164
Outlier Detection using Isolation Forest and Local Outlier Factor RACS ’19, September 24–27, 2019, Chongqing, China

4.2.2 Accuracy Metric. Since the used datasets have ground


truth, four criteria for accuracy, recall, precision, and F-
Measure are selected to measure the performance of all ex-
perimental methods.

Figure 2: Illustration of TP & FP & TN & FN. Figure 3: The confusion matrix set of synthetic datasets.

In Fig. 2, True Positive(TP) is the number of the anomalies


that are correctly classified as anomalies. True Negative(TN)
is the number of the normal events that are correctly classified
as normal events. False Positive(FP) is the number of the
normal events that are wrongly classified as anomalies. False
Negative(FN) is the number of the anomalies that are wrongly
classified as normal events.
Precision is the percentage of the reported anomalies that
are correctly identified, denoted by:
𝑇𝑃
𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = (7)
𝑇𝑃 + 𝐹𝑃
Recall is the percentage of the real anomalies which are
detected, expressed by:
𝑇𝑃
𝑅𝑒𝑐𝑎𝑙𝑙 = (8)
𝑇𝑃 + 𝐹𝑁 Figure 4: The confusion matrix set of real-world datasets.
Accuracy is the total proportion of all the correct predic-
tions, which can be expressed as:
𝑇𝑃 + 𝑇𝑁 unbalanced, the accuracy value will be large, and the eval-
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (9) uation model is not comprehensive enough. In Fig.3 and
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
Fig.4, the total number of TPs and TNs of R1SVM is large
F-Measure is the weighted harmonic mean of precision and
on all datasets, which results in a high Accuracy, while the
recall, which can be given by:
F-Measure is very low. Since the selected datasets contain a
2𝑇 𝑃 large proportion of normal samples, the normal samples that
𝐹 − 𝑀 𝑒𝑎𝑠𝑢𝑟𝑒 = (10)
2𝑇 𝑃 + 𝐹 𝑃 + 𝐹 𝑁 are accurately predicted are the majority, while the precision
In general, the higher the Precision, Recall, and F-Measure, accuracy of the outliers is small. Therefore, Accuracy and
the better the outlier detection algorithm works. However, F-Measure are used as the evaluation indicators of the model,
Precision and Recall are mutually constrained. In extreme which help to measure the effect of the experiment reasonably.
cases, if only one outlier is detected, the Precision is 100%,
while the Recall is very low. If all data are detected as outliers, Table 5 shows the Accuracy and F-Measure of the five com-
the Recall means 100% and the precision is very low. In parative experiments on the synthetic datasets. In addition
Fig.3, although the K-LOF has an FP value of 0 on the to being slightly less efficient than IF on the small dataset
MGT dataset, which represents its Precision equal to TP/ Yeast, IF-LOF has a better overall effect on the other five
(TP + FP) = TP/TP =1, its recall is lower than IF-LOF. larger datasets. For example, on the MGT and KEGG, the
Therefore, the F-Measure of IF-LOF is higher than K-LOF’s F-Measure of IF-LOF is 30% higher, and its performance
after calculation. On the Smtp dataset of Fig.4, the FP of IF improvement is more significant. Since IF-LOF filters out
is too large, resulting in a very low Precision and a relatively the apparently normal data points by pruning, the effect
high Recall. of all samples on the calculation of lof values is reduced,
As for the Accuracy, when the dataset is unevenly dis- making the integration method much higher than LOF in
tributed or the normal samples and abnormal samples are both Accuracy and F-Measure. As for Accuracy, K-LOF is

165
RACS ’19, September 24–27, 2019, Chongqing, China Z. Cheng et al.

Figure 5: The Accuracy & F-Measure of synthetic datasets and real-world datasets.

Table 5: The Accuracy Metric of Synthetic Datasets (Accuracy, F-Measure)

IF-LOF IF LOF K-LOF R1SVM


Yeast 0.9892 0.8621 0.9919 0.8983 0.9542 0.4237 0.9838 0.7818 0.9427 0.2735
EMGPA 0.9970 0.9617 0.9926 0.9066 0.9744 0.6735 0.9910 0.8780 0.9315 0.1340
EEGES 0.9999 0.9983 0.9909 0.8846 0.9985 0.9813 0.9997 0.9958 0.9518 0.3871
MGT 0.9999 0.9987 0.9787 0.7301 0.9703 0.6250 0.9962 0.9483 0.9380 0.2154
Avila 0.9957 0.9453 0.9873 0.8382 0.9758 0.6926 0.9806 0.7541 0.9319 0.1403
KEGG 0.9974 0.9666 0.9769 0.7083 0.9240 0.0420 0.9277 0.0814 0.9250 0.0398

Table 6: The Accuracy Metric of Real-world Datasets (Accuracy, F-Measure)

IF-LOF IF LOF K-LOF R1SVM


Satellite 0.8096 0.6497 0.7119 0.5447 0.7778 0.6488 0.7722 0.6400 0.6625 0.4666
Mnist 0.8990 0.4320 0.8765 0.3298 0.8918 0.4126 0.8882 0.3929 0.8548 0.2438
Shuttle 0.9206 0.4447 0.9872 0.9109 0.9036 0.3261 0.9034 0.3245 0.9299 0.5097
ALOI 0.9644 0.4076 0.9418 0.0358 0.9433 0.061 0.9430 0.0544 0.9435 0.0648
Smtp 0.9999 0.7692 0.9702 0.0146 0.9995 0.3333 0.9991 0.2407 0.9994 0.1017
Skin 0.7298 0.3398 0.6870 0.2452 0.6670 0.1976 0.7026 0.2836 0.6715 0.2084

166
Outlier Detection using Isolation Forest and Local Outlier Factor RACS ’19, September 24–27, 2019, Chongqing, China

only slightly lower than IF-LOF, while its F-Measure on the


last four datasets is much lower than IF-LOF due to its low
pruning efficiency. The randomization process of R1SVM will
break the characteristics of the synthesized dataset, resulting
in a relatively low F-Measure.
Table 6 shows the Accuracy and F-Measure on the real-
world datasets of the five comparative experiments. In addi-
tion to the Shuttle, IF-LOF performs better than IF on the
remaining 5 datasets. This phenomenon occurs since LOF
does not apply to the outlier detection of the Shuttle, result-
ing in that the F-Measure of the IF-LOF integrated with
the IF is much lower than the IF alone. However, on the
ALOI and Smtp, the F-Measure of IF-LOF is much higher
than IF, and its performance improvement is more significant.
Therefore, IF-LOF is superior to several other algorithms in
summary.
As can be seen in Fig.5, IF-LOF provides very stable and
efficient results on different datasets and produces the highest
Accuracy and F-Measure. All algorithms perform better on
the real-world datasets than on the synthetic dataset. The
synthetic datasets are constructed by randomly adding de-
viations to the real-world data, and the difference between Figure 6: Evaluation of time cost with synthetic datasets.
the normal point and the abnormal point is more obvious.
This is the reason for explaining the phenomenon described
above. However, this difference is not significant in real-world
datasets.

4.2.3 Time Cost. The time cost refers to the time it takes to
perform outlier detection on a standard hardware/software
system, including the time of data preprocessing and the
computation time of the detection.
As shown in Fig.6, K-LOF is the least efficient on any
scale dataset. The amount of data is slightly increased, and
the efficiency of LOF drops sharply. R1SVM is the most
efficient on small-scale datasets, while as the dataset grows
larger, its advantages are gradually reduced compared to IF
and IF-LOF. Both IF and IF-LOF have demonstrated their
efficiency on large-scale datasets. Although the time cost of
IF-LOF is slightly lower thanIF, its actual computation time
is much less.
Fig.7 shows the pruning time and detection calculation
time for IF-LOF, IF, LOF, K-LOF, and R1SVM on six real-
world datasets, respectively. In the Fig.7, R1SVM is the most
efficient except for the Skin dataset, while its accuracy is too
low. Although IF-LOF is not as efficient as IF, the processing Figure 7: Evaluation of time cost with real-world datasets.
time is very close, and its efficiency is much higher on some
datasets than LOF.
In summary, the above experimental results show that iForest is used to construct binary trees to form a forest,
the integration method IF-LOF performs better than IF, and the anomaly score of each data point in the forest is
LOF, K-LOF and R1SVM. It utilizes different algorithms calculated. Secondly, according to the pruning strategy, the
to achieve a good balance of accuracy and computational apparently normal samples are filtered to obtain the outlier
complexity, resulting in better outlier detection with lower candidate set. Finally, the data objects in the set, which
time complexity. is corresponding to the top lof values, are determined as
the outliers. This method reduces the time complexity of
5 CONCLUSIONS LOF by avoiding calculating the lof value of all data objects
This paper proposes an integrated algorithm of iForest and in raw dataset and overcomes the weakness of iForest in
LOF to perform outlier detection on multiple datasets. Firstly, dealing with local outliers. In order to verify the effect of

167
RACS ’19, September 24–27, 2019, Chongqing, China Z. Cheng et al.

the proposed integrated algorithm, we conduct comparative [17] Bing Tu, Chengle Zhou, Wenlan Kuang, Longyuan Guo, and
experiments on six synthetic datasets and six real-world Xianfeng Ou. 2018. Hyperspectral imagery noisy label detection
by spectral angle local outlier factor. IEEE Geoscience and
datasets, and evaluate the outlier detection algorithm from Remote Sensing Letters 15, 9 (2018), 1417–1421.
three aspects: pruning efficiency, accuracy metric and time [18] Prabha Verma, Prashant Singh, and RDS Yadava. 2017. Fuzzy
c-means clustering based outlier detection for SAW electronic
cost. The experimental results confirm the accuracy and nose. In 2017 2nd international conference for convergence in
effectiveness of the proposed integrated algorithm, which is technology (I2CT). IEEE, 513–519.
superior to IF, LOF, K-LOF and R1SVM. [19] Yizhou Yan, Lei Cao, and Elke A Rundensteiner. 2017. Scalable
top-n local outlier detection. In Proceedings of the 23rd ACM
SIGKDD International Conference on Knowledge Discovery
ACKNOWLEDGMENTS and Data Mining. ACM, 1235–1244.

This work was supported by the National Key R&D Program


of China under Grant No.2018YFC0704300.

REFERENCES
[1] Jorge Edmundo Alpuche Aviles, Maria Isabel Cordero Marcos,
David Sasaki, Keith Sutherland, Bill Kane, and Esa Kuusela. 2018.
Creation of knowledge-based planning models intended for large
scale distribution: Minimizing the effect of outlier plans. Journal
of applied clinical medical physics 19, 3 (2018), 215–226.
[2] Markus M Breunig, Hans-Peter Kriegel, Raymond T Ng, and Jörg
Sander. 2000. LOF: identifying density-based local outliers. In
ACM sigmod record, Vol. 29. ACM, 93–104.
[3] D Dua and E Karra Taniskidou. 2017. UCI Machine Learning
Repository [http://archive. ics. uci. edu/ml]. Irvine, CA: Univer-
sity of California. School of Information and Computer Science
(2017).
[4] Jakub Dvořák and Petr Savickỳ. 2007. Softening splits in decision
trees using simulated annealing. In International Conference on
Adaptive and Natural Computing Algorithms. Springer, 721–729.
[5] Sarah Erfani, Mahsa Baktashmotlagh, Sutharshan Rajasegarar,
Shanika Karunasekera, and Chris Leckie. 2015. R1SVM: A ran-
domised nonlinear approach to large-scale anomaly detection.
(2015).
[6] Shalmoli Gupta, Ravi Kumar, Kefu Lu, Benjamin Moseley, and
Sergei Vassilvitskii. 2017. Local search methods for k-means with
outliers. Proceedings of the VLDB Endowment 10, 7 (2017),
757–768.
[7] Riyaz Ahamed Ariyaluran Habeeb, Fariza Nasaruddin, Abdullah
Gani, Ibrahim Abaker Targio Hashem, Ejaz Ahmed, and Muham-
mad Imran. 2018. Real-time big data processing for anomaly
detection: a survey. International Journal of Information Man-
agement (2018).
[8] Raihan Ul Islam, Mohammad Shahadat Hossain, and Karl Ander-
sson. 2018. A novel anomaly detection algorithm for sensor data
under uncertainty. Soft Computing 22, 5 (2018), 1623–1639.
[9] Liefa Liao and Bin Luo. 2018. Entropy Isolation Forest Based on
Dimension Entropy for Anomaly Detection. In International Sym-
posium on Intelligence Computation and Applications. Springer,
365–376.
[10] Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. 2012. Isolation-
based anomaly detection. ACM Transactions on Knowledge
Discovery from Data (TKDD) 6, 1 (2012), 3.
[11] Zhaoli Liu, Tao Qin, Xiaohong Guan, Hezhi Jiang, and Chenxu
Wang. 2018. An integrated method for anomaly detection from
massive system logs. IEEE Access 6 (2018), 30602–30611.
[12] Khaled Ali Othman, Md Nasir Sulaiman, Norwati Mustapha, and
Nurfadhlina Mohd Sharef. 2017. Local Outlier Factor in Rough
K-Means Clustering. PERTANIKA JOURNAL OF SCIENCE
AND TECHNOLOGY 25 (2017), 211–222.
[13] Guansong Pang, Longbing Cao, Ling Chen, and Huan Liu. 2018.
Learning representations of ultrahigh-dimensional data for random
distance-based outlier detection. In Proceedings of the 24th ACM
SIGKDD International Conference on Knowledge Discovery &
Data Mining. ACM, 2041–2050.
[14] Guillaume Staerman, Pavlo Mozharovskyi, Stephan Clémençon,
and Florence d’Alché Buc. 2019. Functional Isolation Forest.
arXiv preprint arXiv:1904.04573 (2019).
[15] Jialing Tang and Henry YT Ngan. 2016. Traffic outlier detection
by density-based bounded local outlier factors. Information
Technology in Industry 4, 1 (2016), 6.
[16] Xian Teng, Muheng Yan, Ali Mert Ertugrul, and Yu-Ru Lin.
2018. Deep into Hypersphere: Robust and Unsupervised Anomaly
Discovery in Dynamic Networks.. In IJCAI. 2724–2730.

168

You might also like