Developing An Unsupervised Real-Time Anomaly Detection Scheme For Time Series With Multi-Seasonality TIMESERIES
Developing An Unsupervised Real-Time Anomaly Detection Scheme For Time Series With Multi-Seasonality TIMESERIES
Abstract—On-line detection of anomalies in time series is a key regular basis and sends them to a central detection module,
technique used in various event-sensitive scenarios such as robotic which then analyzes the aggregated time series to detect any
system monitoring, smart sensor networks and data center anomalous events including hardware failures, unavailability
security. However, the increasing diversity of data sources and the
variety of demands make this task more challenging than ever. of services and cyber attacks. This requires a reliable on-
Firstly, the rapid increase in unlabeled data means supervised line detector with strong sensitivity and specificity. Otherwise,
learning is becoming less suitable in many cases. Secondly, the inefficient detection may cause unnecessary maintenance
a large portion of time series data have complex seasonality costs.
features. Thirdly, on-line anomaly detection needs to be fast Several classes of schemes have been applied to the problem
and reliable. In light of this, we have developed a prediction-
driven, unsupervised anomaly detection scheme, which adopts a of anomaly detection for time series data. In certain cases de-
backbone model combining the decomposition and the inference cent results can be achieved by these traditional methods such
of time series data. Further, we propose a novel metric, Local as outlier detection [4][8][9][10], pattern (segment) extraction
Trend Inconsistency (LTI), and an efficient detection algorithm [12][13][14][15] and sequence mapping [18][20][21]. How-
that computes LTI in a real-time manner and scores each data ever, we are facing a growing number of new scenarios and
point robustly in terms of its probability of being anomalous.
We have conducted extensive experimentation to evaluate our applications which produce large volumes of time series data
algorithm with several datasets from both public repositories and with unprecedented complexity, posing challenges that tradi-
production environments. The experimental results show that our tional anomaly detection methods cannot address effectively.
scheme outperforms existing representative anomaly detection First, more and more time series data are being produced
algorithms in terms of the commonly used metric, Area Under without labels since data labeling/annotation is usually very
Curve (AUC), while achieving the desired efficiency.
time-consuming and costly. Sometimes it is also unrealistic
Index Terms—time series, seasonality, anomaly detection, un- or impossible to acquire reliable labels when their correctness
supervised learning has to be guaranteed. Second, some applications may produce
multi-channel series with complex features such as multi-
I. I NTRODUCTION period seasonality (i.e., multiple seasonal, such as yearly or
monthly, patterns within one channel), long periodicity, fairly
Time series data sources have been of interest in a vast unpredictable channels and different seasonality between chan-
variety of areas for many years – the nature of time series nels. As a result, learning these patterns requires effective sea-
data was examined in a seminal study by Yule [1] and the sonality discovery and strong ability of generalization. Third,
techniques were applied to areas such as econometric [2] and the process is commonly required to be fast enough to support
oceanographic data [3] since the 1930s. However, in an era instant reporting or alarming once unexpected situation occurs.
of hyperconnectivity, big data and machine intelligence, new The capability of on-line detection is especially important in
technical scenarios are emerging such as autonomous driving, a wide range of event-sensitive scenarios such as medical and
edge computing and Internet of Things (IoT). Analysis of such industrial process control systems.
systems poses new challenges to the detection of anomalies in In this paper, we propose a predictive solution to detecting
time series data. Further, for a wide range of systems which anomalies effectively in time series with complex seasonality.
require 24/7 monitoring services, it has become crucial to The fundamental idea is to inspect the data samples as they ar-
have the detection techniques that can provide early, reliable rive and match the data samples with an ensemble of forecasts
reports of anomalies. In cloud data centers, for example, a made chronologically. Specifically, our solution comprises an
distributed monitoring system usually collects a variety of log augmented forecasting model and a novel detection algorithm
data from the virtual machine level to the cluster level on a that exploits the predictions of local sequences made by the
underlying forecasting model. We built a frame-to-sequence
Corresponding author: Ligang He. W. Wu, L. He and Y. Su are with the
Department of Computer Science, University of Warwick. W. Lin is with the Gated Recurrent Unit (GRU) network while extending its
School of Computer Science and Engineering at the South China University input with seasonal terms extracted by decomposing the time
of Technology. Y Cui is with the Research Institute of Worldwide Byte series of each sample channel. The integration of the seasonal
Information Security. C. Maple is with the Warwick Manufacturer Group,
University of Warwick. S. Jarvis is with the College of Engineering and features can alleviate negative impact from anomalous samples
Physical Sciences, University of Birmingham. in the training data since the anomalous samples have minor
THIS IS A PREPRINT VERSION OF THE WORK 10.1109/TKDE.2020.3035685 PUBLISHED IN THE IEEE TKDE BY ©IEEE 2
impact on the long-term periodic patterns. Because of the The rest of this paper is organized as follows: Section II
above reasons, our prediction framework does not require discusses a number of studies related to anomaly detection.
the labels (specifying which data are normal or abnormal) In Section III, we introduce Local Trend Inconsistency as the
or uncontaminated training data (i.e., our solution tolerates key metric in our unsupervised anomaly detection scheme.
polluted/abnormal training samples). We then systematically present our unsupervised anomaly
After predicting local sequences (i.e., the output of the detection solution in Section IV, including the backbone model
forecasting model), we use a novel method to weight the for prediction and a scoring algorithm for anomaly detection.
ensemble of different forecasts based on the reliability of their We present and analyze the experimental results in Section V,
forecast sources and make it a chronological process to fit the and finally conclude this paper in Section VI.
on-line detection. The weight of each forecast is determined
dynamically during the process of detection by scoring each
II. R ELATED W ORK
forecast source (i.e., the forecast made based on this data
source), which reflects how likely the predictions made by a The term anomaly refers to a data point that significantly
forecast source is trustworthy. Based on the above ensemble, deviates from the rest of the data which are assumed to follow
we propose a new metric, termed Local Trend Inconsistency some distribution or pattern. There are two main categories
(LTI), for measuring the deviation of an actual sequence from of approaches for anomaly detection: novelty detection and
the predictions in real-time, and assigns an anomaly score to outlier detection. While novelty detection (e.g. classification
each of the newly arrived data points (which we also call methods [39][40][41][42]) requires the training data to be clas-
frames) in order to quantify the probability that a frame is sified, outlier detection (e.g., clustering, principal component
anomalous. analysis [20] and feature mapping methods [43][44]) does not
We also propose a method to map the LTI value of a frame need a prior knowledge of classes (i.e., labels) and thus is
to its Anomaly Score (AS) by a logistic-shaped function. The also known as unsupervised anomaly detection. The precise
mapping further differentiates anomalies and normal data. In terminology and definitions of these terminology may vary
order to determine the logistic mapping function, we propose in different sources. We use the same taxonomy as Ahmed
a method to automatically determine the optimal values of the et al. did in reference [45] whilst in the survey presented by
fitting parameters in the logistic mapping function. The AS Hodge and Austin [38] unsupervised detection is classified
value of a frame in turn becomes the weight of its impact on as a subtype of outlier detection. The focus of our work is
the detection of future frames. This makes our LTI metric on unsupervised anomaly detection since we aim to design
robust to the anomalous frames in the course of detection a more generic scheme and thus do not need to assume the
and significantly mitigates the potential impact of anomalous labels are unavailable.
samples on the detection results of the future frames. This In the detection of time series anomalies, we are interested
feature also enables our algorithm to work chronologically in discovering abnormal, unusual or unexpected records. In a
without maintaining a large reference database or caching too time series, an anomaly can be detected within the scope of
many historical data frames. To the best of our knowledge, the a single record or as a subsequence/pattern. Many classical
existing prediction-driven detection schemes do not take into algorithms can be applied to detect single-record anomaly as
account the reliability of the forecast sources. an outlier, such as the One Class Support Vector Machine
The main contributions of our work are as follows: (OCSVM) [4], a variant of SVM that exploits a hyperplane
• We designed a frame-to-sequence forecasting model in- to separate normal and anomalous data points. Zhang et al.
tegrating a GRU network with time series decomposition [5] implemented a network performance anomaly detector
(using Prophet, an additive time series model developed using OCSVM with Radial Basic Function (RBF), which
by Facebook [29]) to enable the contamination-tolerant is a commonly used kernel for SVM. Maglaras and Jiang
training on multi-seasonal time series data without any [6] developed an intrusion detection module based on K-
labels. OCSVM, the core of which is an algorithm that performs K-
• We propose a new metric termed Local Trend Incon- means clustering iteratively on detected anomalies. Shang et
sistency (LTI), and based on this metric we further al. [7] applies Particle Swarm Optimization (PSO) to find the
propose an unsupervised detection algorithm to score the optimal parameters for OCSVM, which they applied to detect
probability of data anomaly. An practical method is also the abnormalities in TCP traffic. In addition, Radovanović et
proposed for fitting the scoring function. al. [9] investigated the correlation between hub points and
• We mathematically present the computation of LTI in outliers, providing a useful guidance on using reverse nearest-
the form of matrix operations and prove the possibility neighbor counts to detect anomalies. Liu et al. [8] found that
of parallelization for further speeding up the detection anomalies are susceptible to the property of ”isolation” and
procedure. thus proposed Isolation Forest (iForest), an anomaly detection
• We conducted extensive experiments to evaluate the algorithm based on the structure of random forest. Taking
proposed scheme on two public datasets from the UCI advantage of iForest’s flexibility, Calheiros et al. [10] adapted
data repository and a more complex dataset from a pro- it to dynamic failures detection in large-scale data centers.
duction environment. The result shows that our solution For anomalous sequence or pattern detection, there are a
outperforms the existing algorithms significantly with low number of classical methods available such as box modeling
detection overhead. [11], symbolic sequence matching [18] and pattern extraction
THIS IS A PREPRINT VERSION OF THE WORK 10.1109/TKDE.2020.3035685 PUBLISHED IN THE IEEE TKDE BY ©IEEE 3
[14][15]). For example, Huang et al. [19] proposed a scheme to have been applied to several forms of machine learning models
identify the anomalies in VM live migrations by combining the for efficiency boost.
extended Local Outlier Factor (LOF) and Symbolic Aggregate
ApproXimation (SAX). III. L OCAL T REND I NCONSISTENCY
Recent advance in machine learning techniques inspires
prediction-driven solutions for intelligent surveillance and de- In this section, we first introduce a series of basic notions
tection systems (e.g., [48][49]). A prediction-driven anomaly and frequently-used symbols, then define a couple of distance
detection scheme is often a sliding window-based scheme, in metrics, and finally present the core concept in our anomaly
which future data values are predicted and then the predictions detection scheme - Local Trend Inconsistency (LTI).
are compared against the actual values when the data arrive. In some systems, more than one data collection device is
This type of anomaly detection schemes has been attracting deployed to gather information from multiple variables relat-
much attention recently thanks to the remarkable performance ing to a common entity simultaneously, which consequently
of recurrent neural networks (RNNs) in prediction/forecasting generates multi-variate time series. In this paper we call them
tasks. Filonov et al. [33] proposed a fault detection framework multi-channel time series.
that relies on a Long Short Term Memory (LSTM) network Definition 1: A channel is the full-length sequence of a
to make predictions. The set of predictions along with the single variable that comprises the feature space of a time
measured values of data are then used to compute error series.
distribution, based on which anomalies are detected. Similar For the sake of convenience, we define a frame as follows.
methodologies are used by [34] and [24]. LSTM-AD [34] This concept of a frame is inspired by, but is more general
is also a prediction scheme based on multiple forecasts. In than, a frame in video processing (since a video clip can be
LSTM-AD the abnormality of data samples is evaluated by reckoned as a time series of images.)
analyzing the prediction error and the corresponding probabil- Definition 2: A frame is the data record at a particular point
ity in the context of an estimated Gaussian error distribution of time in a series. A frame is a vector in a multi-channel time
obtained from the training data. However, the drawback of series, or a scalar value in a single-channel time series.
LSTM-AD is that it is prone to the contamination of training Most of previous schemes detect anomalies by analyzing
data. Therefore, when the training data contains both normal the data items in a time series as separate frames. However,
and anomalous data, the accuracy of the prediction model is in our approach we attempt to conduct the analysis from the
likely to be affected, which consequently make the anomaly perspective of local sequences.
detection less reliable. Definition 3: A local sequence is a fragment of the target
Malhotra et al. [23] adopt a different architecture named time series; a local sequence at frame x is defined as a
encoder-decoder, which is based on the notion that only normal fragment of the series spanning from a previous frame to frame
sequences can be reconstructed by a well-trained encoder- x.
decoder network. A major limitation of their model is that an For clarity, we list all the symbols frequently used in this
unpolluted training set must be provided. As revealed by Pas- paper in Table I.
canu et al. [25], RNNs may struggle in learning complex sea-
sonal patterns in time series particularly when some channels TABLE I
of the series have long periodicity (e.g., monthly and yearly). L IST OF SYMBOLS
A possible solution to that is decomposing the series before Symbol Description
feeding into the network. Shi et al. [35] proposed a wavelet- X A time series X
BP (Back Propagation) neural network model for predicting X(t) The t-th frame of time series X
the wind power. They decompose the input time series into the X (c) The c-th channel of time series X
X (c) (t) The c-th component of the t-th frame of time series X
frequency components using the wavelet transform and build
x(i) The i-th feature of frame x
a prediction network for each of them. To forecast time series x̂k The forecast of the frame x predicted by frame k
with complex seasonality, De Livera et al. [37] adopt a novel S An actual local sequence from the target time series
state space modeling framework that incorporates the seasonal Sk A local sequence predicted by frame k
S(i) The i-th frame in local sequence S
decomposition methods such as the Fourier representation. A S(i, j) An actual local sequence spanning from frame i to j
similar model was implemented by Gould et al. [36] to fit Sk (i, j) A local sequence predicted by k spanning from frame i to j
hourly and daily patterns in utility loads and traffic flows data.
Ensuring low overhead is essential for real-time anomaly Euclidean Distance and Dynamic Time Warping (DTW)
detection. For example, Gu et al. [16] proposed an efficient Distance are commonly used to measure the distance between
motif (frequently repeated patterns) discovery framework in- two vectors. However, the scale of Euclidean Distance largely
corporating an improved SAX indexing method as well as a depends on the dimensionality, i.e., vector length. DTW dis-
trivial match skipping algorithm. Their experimental results tance can measure the sequence similarity, but cannot produce
on the CPU host load series show excellent time efficiency. the length-independent results. With the relatively high time
Zhu et al. [17] propose a new method for locating similar sub- complexity (O(n2 m) for m-dimensional sequences of length
sequences as well as a parallel approach using GPUs to accel- n), DTW is often applied to the sequence-level analysis, in
erate Dynamic Time Warping (DTW) for time series pattern which the target is a sequence of frames or a pattern of varying
discovery. Similarly, parallel algorithms (e.g., [50][51][52]) length. However, our work aims to perform the frame-wise,
THIS IS A PREPRINT VERSION OF THE WORK 10.1109/TKDE.2020.3035685 PUBLISHED IN THE IEEE TKDE BY ©IEEE 4
on-line detection, i.e., detect whether a frame is anomalous as where i denotes the frame index and L − i is the temporal
the frame arrives. distance (with i = L being the current frame). Hence, the
Therefore, in this paper we use a modified form of Eu- corresponding normalization factor DL in Eq. (3) is the
clidean distance, called Dimension-independent Frame Dis- summation of a geometric series of length L:
tance (DF Dist) as formulated in Eq. (1), to measure the
distance between two frames x and y: L
X 1 − e−L
m DL = e−(L−i) = (5)
1 X 1 − e−1
DF Dist(x, y) = (x(i) − y (i) )2 (1) i=1
m i=1
where m is the number of dimensions (i.e., number of chan- where L is the sequence length.
nels) and x(i) and y (i) are the i-th component of frame x and Ideally it is easy to identify the anomalies by calculating
frame y, respectively. We do not square root the result. This W LSDist between the target (such as local sequence or
does not impact the effectiveness of our approach, but makes frame) and the ground truth. However, this approach is not
it easier to handle when we transform all computations into feasible if the labels are unavailable (i.e., there is no ground
matrix operations at the later stage of the processing. Also, truth). A possible solution is to replace the ground truth with
the desired scale (i.e., DF Dist ∈ [0, 1]) of the distance still expectation, which is obtained typically by using time series
holds for normalized data. forecasting methods [22][34], which is the basic idea of the so-
With DF Dist, we can further measure the distance between called prediction-driven anomaly detection schemes. However,
two local sequences of the same length. The desired metric for a critical problem with such a prediction-driven scheme is the
sequence distance should be independent on the length of the reliability of forecast. On the one hand, the prediction error is
sequences as we want to have a unified scale for any pair inevitable. On the other hand, the predictions made based on
of sequences. We formulate the Length-independent Sequence the historical frames, which may include anomalous frames,
Distance (LSDist) between two sequences SX and SY of the can be unreliable. This poses a great challenge for prediction-
same length in Eq. (2), where L is the length of the two local driven anomaly detection schemes.
sequences. Envisaging the above problems, we propose a novel, re-
liable prediction scheme, which makes use of multi-source
L forecasting. Unlike previous studies that use frame-to-frame
1X
LSDist(SX , SY ) = DF Dist(SX (i), SY (i)) (2) predictors, our scheme makes a series of forecast at different
L i=1 time points (i.e, from different sources) by building a frame-to-
sequence predictor. The resulting collection of forecasts form
Although the definition of LSDist already provides a
a common expectation from multiple sources for the target.
unified scale of distance, the temporal information of the
When the target arrives and if it deviates from the common
time series data is neglected. Assuming we are detecting
expectation, it is deemed that the target is likely to be an
the anomaly of the event at time t, we need to compare
anomaly. This is the underlying principle of our unsupervised
the local sequence at frame t with a ground truth sequence
anomaly detection.
(assume there is one) to see if anything goes wrong in the
latest time window. If we use LSDist as the metric, then In order to quantitatively measure how far the target deviates
every time point is regarded as being equally important. from the collection of expectations obtained from multiple
However, this does not practically comply with the rule of sources, we propose a metric we term the Local Trend Incon-
time decay, namely, the most recent data point typically has sistency (LTI). LTI takes into account the second challenging
the greatest reference value and also the greatest impact on issue discussed above (i.e, there may exist anomalous frames
what will happen in the next time point. Therefore, we refine in history) by weighting the prediction made based on a source
LSDist by weighting each term and adding a normalization (i.e., a frame at a previous time point) with the probability of
factor. The Weighted Length-independent Sequence Distance the source being normal.
(W LSDist) is defined in Eq. (3), where di is the weight of For a frame t (i.e., by which we refer to the frame arriving
time decay for frame i and DL is the normalization factor (so at time point t), LT I(t) is formally defined in Eq. (6), where
that W LSDist remains in the same scale as LSDist). S(i + 1, t) is the actual sequence from frame i + 1 to frame t,
and Si (i + 1, t) is the sequence of the same span predicted by
PL frame i (i.e., prediction made when frame i arrives). L is the
di · DF Dist(SX (i), SY (i))
i=1 length of the prediction window, which is a hyper-parameter
W LSDist(SX , SY ) =
DL determining the maximum length of the predicted sequence
(3)
and also the number of sources that make the predictions (i.e.,
Time decay is applied on the basis that the two sequences
the number of predictions/expectations) of the same target.
are chronologically aligned. In this paper, we use the expo-
P (i) denotes the probability of frame i being normal.
nentially decaying weights, which is similar to the exponential
moving average method [46]: Zt is the normalization factor for frame t defined as the
sum of all the probabilistic weights shown in Eq. (7). Zt is
di = e−(L−i) , i = 1, 2, ..., L (4) used to normalize the value of LT I(t) to the range of [0, 1].
THIS IS A PREPRINT VERSION OF THE WORK 10.1109/TKDE.2020.3035685 PUBLISHED IN THE IEEE TKDE BY ©IEEE 5
Through the use of matrices to formulate the calculation limitation of them is the difficulty in learning complex sea-
of LT I, we can know that the calculation can be performed sonal patterns in multi-seasonal time series. Even though the
efficiently in parallel. The Degree of Parallelism (DoP) of its accuracy may be improved by stacking more hidden layers and
calculation can be higher than L. This is because the DoP for increasing back propagation distance (through time) during
calculating the L terms in Eq. (6) can be L apparently (the training, it could cause prohibitive training cost.
calculation of every term is independent on each other). The In view of this, we propose to include the seasonal features
calculation of each term can be further accelerated (including of the input data explicitly as the input of the neural network.
the calculations of W LSDist and DF Dist) by parallelizing This is achieved by conducting time series decomposition
the matrix multiplication. For example, with a number of L×L before running the prediction model, which is the purpose of
processes (i.e., a grid of processes) and exploiting the Scalable the decomposition module. The resulting seasonal features can
Universal Matrix Multiplication Algorithm (SUMMA) [47], be regarded as the outcome of feature engineering. Technically
we can achieve a roughly L2 speedup in the multiplication of speaking, seasonal features are essentially the ”seasonal terms”
any two matrices with the dimension size of L, which helps decomposed from each channel of the target time series. We
reduce the time complexity of computing N1 DF from O(L3 ) use Prophet [29], a framework based on the decomposable
to O(L). Further, with the resulting N1 DF the computation time series model [28], to extract the channel-wise seasonal
of N1 DF T and PN2 can be performed in parallel as both terms. Let X (c) denote the c-th channel of time series X, and
of them are vector-matrix multiplication requiring only L pro- X (c) (t) the t-th record of the channel. The outcome of time
cesses and have time complexity of O(L2 /L) = O(L). Finally series decomposition for channel c is formulated as below:
multiplying the resulting matrices of PN2 (dimension=1 × L)
and N1 DF T (dimension=L × 1) consumes O(L). Note that X (c) (t) = gc (t) + sc (t) + hc (t) + (9)
the matrix DF contains L × L entries of frame distance, each where gc (t) is the trend term that models non-periodic
of which is calculated using Eq. (1). Therefore, updating DF changes, sc (t) represents the seasonal term that quantifies
(upon a new frame arrives) is an operation with the complexity the seasonal effects. hc (t) reflects the effects of special oc-
of O(L2 m/L2 ) = O(m), where m is the frame dimension. casions such as holidays, and is the error term that is not
Consequently, the time complexity of computing LT I(t) in accommodated by the model. For simplicity, we in this paper
parallel is O(m + L) in theory. only consider daily and weekly seasonal terms as additional
features for the inference module of our model. Prophet relies
IV. A NOMALY D ETECTION WITH LTI on Fourier series to model multi-period seasonality, which
Our anomaly detection scheme is based on LTI (Local Trend enables the flexible approximation of any periodic patterns
Inconsistency) as LTI can effectively indicate how significantly with arbitrary length. The underlying details can be referred
the series deviates locally from the common expectation to [29].
established by multi-source prediction. Separating seasonal terms from original frame values and
As can be seen from Eq. (6), there are still two problems to using them as additional features effectively improve RNN
be solved in calculating LT I. First, a mechanism is required to from the following perspectives. First, explicit input of sea-
make reliable predictions of local sequences. Second, we need sonal terms helps reduce the difficulty of learning complex
an algorithm to quantify the probabilistic factors (in matrix P) seasonal terms in RNN. The extracted seasonal terms quantify
as they are not known apriori. seasonal effects. Second, time cost of training is expected to
In this section, we first introduce the backbone model we decrease as we can apply the Truncated Back Propagation
build for achieving accurate frame-to-sequence forecasting. Through Time (TBPTT) with a distance much shorter than
The model is designed to learn the complex patterns in multi- the length of periodicity. Besides, the series decomposition
seasonal time series with tolerance to pollution in the training process is very efficient, which will be demonstrated later by
data. Then we illustrate how to make use of the predictions experiments. The top part of Fig. 2 shows the architecture
(from multiple source frames) made to compute LTI. Finally, of our backbone prediction model. In the prediction model, a
we propose an anomaly scoring algorithm that uses a scoring stacked GRU network is implemented as the inference module,
function to chronologically calculate anomaly probability for which takes as input the raw features of a frame concatenated
each frame based on LTI. with its seasonality features. We demonstrate the effectiveness
of this backbone model in Section V-A.
A. Prediction Model
To effectively learn and accurately predict local sequences in B. Computing LTI based on Predictions
multi-seasonal time series, we adopt a combinatorial backbone When we calculate Local Trend Inconsistency (LTI) in
model composed of a decomposition module and an inference Eq. (6), we are actually measuring the distance between a
module. local sequence and an ensemble of its predictions by a well
Recurrent Neural Network (RNN) is an ideal network to trained backbone prediction model. The workflow of our on-
implement the inference module of our prediction model. line anomaly detection method includes three main steps:
RNNs (including mutations such as Long Short Term Memory i) feed every arriving frame into the prediction model and
(LSTM) and Gated Recurrent Unit (GRU)) are usually applied continuously gather its output of predicting future frames, ii)
as end-to-end models (e.g., [26] [27]). However, a major organize the frame predictions by their sources (i.e., the frames
THIS IS A PREPRINT VERSION OF THE WORK 10.1109/TKDE.2020.3035685 PUBLISHED IN THE IEEE TKDE BY ©IEEE 7
which made the forecast) and concatenate them into local Considering the second reason discussed above, we replace
sequences, and iii) compute LTI of the newly arrived frame P (i) in Eq. (6) with 1 − AS(i) where i = t − L, t − L +
according to Eq. (6). Fig. 2 demonstrates the entire process, in 1, ..., t − 1. Consequently, LT I(t) is reformulated as:
which LTI of a frame is converted to a score of abnormality
using the algorithms to be introduced later. LT I(t) =
t−1
1 X
(1 − AS(i)) · W LSDist S(i + 1, t), Si (i + 1, t)
C. Anomaly Scoring Zt
i=t−L
In theory, the values of LT I(t) can be directly used to score (12)
frame t in terms of its abnormality. However, the range of this
metric is application-specific. So we further develop a measure Pt−1 Zt is the normalization factor reformulated as
where
i=t−L (1 − AS(i)) and 1 − AS(i) represents the probability
that can represent the probability of data anomaly. Specifically, that frame i is normal.
we define a logistic mapping function to convert the value of The function Φ(·) contains two parameters, k and x0 . The
LT I(t) to a probabilistic value: values of these two parameters need to be set before the func-
tion can be used to calculate the anomaly. Since x0 is supposed
1
Φ(x) = (10) to the midpoint of x, we set x0 to be mean(LT I). We set
1+ e−k(x−x0 ) k to c/stdev(LT I) (stdev(LT I) is the standard deviation
where k is the logistic growth rate and x0 the x-value of the of LT I, and c is a constant multiplier). The purpose of the
function’s midpoint. mapping function is to disperse the LTI values that are densely
The left part of Fig. 3 shows the shapes of Φ(·) with clustered. On the one hand, the standard deviation stdev(LT I)
different values of k when x0 is set to 0.5. The shape of can be used to represent how densely the LTI values reside
Φ(·) becomes steeper as k increases. We will introduce how around the mean. The lower the value of stdev(LT I), the
to determine the optimal values of k and x0 later. more closely the LTI values are clustered. On the other hand, k
Now we define the probabilistic anomaly score of frame t represents how steep the middle slope of the logistic mapping
as below: function is. The greater k is, the steeper the logistic mapping
AS(t) = Φ(LT I(t)) (11) function is. The more densely clustered the LTI values are, the
steeper the logistic function needs to be in order to disperse
The reason why we use Eq. (10) to map LT I(t) to AS(t) those values. Therefore, for a set of LTI values with lower
are three folds. First, we find that the LT I(t) values are deviation, a bigger value should be set for k.
clustered together closely (top right of Fig. 3), which means Instead of setting the values of k and x0 manually, we
that the difference in LT I(t) values between normal and propose an automated approach in this work to determine their
abnormal frames are not significant. This makes it difficult to values. More specifically, we design an iterative algorithm. The
differentiate them in practice although we can do so in theory. algorithm runs on a reference time series which is a portion
The right part of Fig. 3 illustrates the situation where we map of the training data. The algorithm is outlined in Algorithm 1.
raw LT I(t) values to AS(t). It can be seen from the figure
that the value of anomaly scores are better dispersed leaving a Algorithm 1: Iterative procedure for unparameterizing
clearer divide between normal data and (potential) anomalies. Φ(·)
For example, the red line we draw separates out roughly 10 Input : prediction span L, reference series length r,
percent of potential anomalies with high scores. Second, as predicted local sequences Si (i + 1, i + L) for
discussed in the previous section, our scheme makes a series of i ∈ [0, r − 1]
forecast from different sources for the target, which establishes Output: k, x0
a common expectation for the target. The challenge is that k ← 1.0, x0 ← 0.5
there may exist anomalous sources, from which the forecast AS(i) ← 0 for all i ∈ [0, r − 1]
made is unreliable. Thus we have to differentiate the quality while convergence criterion is not satisfied do
of the predictions by specifying large weights (i.e., the P (i) for t ← L to r − 1 do
in Eq. 6) for normal sources and small weights for the sources compute LT I(t) via Eq. (12)
that are likely to be abnormal. With the function Φ(·) to compute AS(t) via Eq. (11)
disperse the LT I(t) values (by mapping them into AS(t), end
the impact difference between normal and abnormal frames c
k ← stdev(LT I) , x0 ← mean(LT I)
is magnified. Last but not the least, we find that the actual
end
values of LT I(t) depend on particular applications that our
detection scheme is applied to. After mapping, the AS(t)
values becomes less application-dependent, making it possible In Algorithm 1, parameters k and x0 are set to 1.0 and
to set a universal anomaly threshold. This is similar to the 0.5 initially, respectively. Note that it does not matter much
scenario of determining the unusual events if the samples what the initial values of k and x0 are. When Algorithm 1 is
follow the normal distribution: the values lying beyond two run on the reference time series, LTI for each frame of the
standard deviations from the mean are often regarded as reference series is calculated. The values of k and x0 will
unusual. converge to c/stdev(LT I) and mean(LT I) eventually. In the
THIS IS A PREPRINT VERSION OF THE WORK 10.1109/TKDE.2020.3035685 PUBLISHED IN THE IEEE TKDE BY ©IEEE 8
Fig. 2. An overview of the proposed prediction-driven anomaly detection framework for the time series, which uses a seasonality augmented GRU network
as the backbone model to support the abnormality scoring based on Local Trend Inconsistency (LTI).
We set up our experiments on a machine equipped with Dodgers Loop Sensor is also a public dataset available in the
a dual-core CPU (model: Intel Core i5-8500, 3.00 GHz), a UCI data repository. The data were collected at the Glendale
GPU (model: GTX 1050Ti) and 32GB memory. The inference on-ramp for the 101 North freeway in Los Angeles. The sensor
module of our backbone model is implemented on Pytorch is close enough to the stadium for detecting unusual traffic
(version: 1.0.1) platform and the decomposition module is after a Dodgers game, but not so close and heavily used by the
implemented using Prophet (version: 0.4) released by Face- game traffic. Traffic observations were taken over 25 weeks
book. We select three datasets for evaluation. CalIt2 and (from Apr. 10 to Oct. 01, 2005) with date and timestamps
Dodgers Loop Sensor are two public datasets published by provided for both data records and events (i.e., the start and
the University of California Irving (UCI) and available in the end time of games). The raw dataset contains 50400 records
UCI machine learning repository. Another dataset we use is in total. We pro-processed the data to make it an hourly time
from the private production environment of a cyber-security series dataset.
company, which is the collaborator of this project. This dataset
collects the server logs from a number of clusters (owned by
other third-party enterprises) on a regular basis. The dataset is
referred to as the Server Log dataset in this paper.
CalIt2 Dataset
CalIt2 is a multivariate time series dataset containing 10080
observations of two data streams corresponding to the counts
of in-flow and out-flow of a building on UCI campus. The pur-
pose is to detect the presence of an event such as a conference
and seminar held in the building. The timestamps are contained
in the dataset. The original data span across 15 weeks (2520
hours) and is half-hourly aggregated. We truncated the last
120 hours and conducted a simple processing on the remaining
2400 hours of data by making it hourly-aggregated. The CalIt2
dataset is provided with annotations that label the date, start
Fig. 4. The Server Log time series dataset
time and end time of events over the entire period. There are
115 anomalous frames (4.56% contamination ratio) in total.
In our experiment, labels are omitted during training (because
our prediction model forecasts local sequences of frames) and A. Evaluating Backbone Model
will only be used for evaluating detecting results. We trained our prediction model on the datasets separately
to evaluate its accuracy as well as the impact of seasonal terms
Server Log Dataset extracted by the decomposition module. We split the datasets
The Server Log dataset is a multi-channel time series with into training, validation and test sets. On CalIt2, the first 1900
a fixed interval between two consecutive frames. The dataset frames were used for training and the following 500 for testing.
spans from June 29th to September 4th, 2018 (1620 hours On the Server Log dataset, 1100 frames for training and 520
in total). The raw data is provided to us in form of separate for test. On Dodgers Loop, 3000 records for training and 1000
log files, each of which stores the counts of a Linux server for test. 300, 300, and 500 frames were used for validation on
event on an hourly basis. The log files record the invocations of CalIt2, Server Log and Dodgers Loop, respectively.
five different processes, which include CROND, RSYSLOGD, The proposed model uses Prophet to implement the decom-
SESSION, SSHD and SU. Each process represents a channel position module and a stacked GRU network to implement the
of observing the server. We pre-processed the data by aggre- prediction module. We extracted daily and weekly terms for
gating all the files to form a five-channel time series. Fig. 4 each channel. More specifically, for each channel we generated
shows the time series of all five channels. two mapping lists after fitting the data by Prophet. One list
Currently, the company relies on security technicians to contains the readings at each of 24 hours in a day, while the
observe the time series and spot the potential anomalies, other list includes the readings at each of 7 days in a week.
which might be caused by the security attacks. The aim of Fig. 5 shows an example of the mapping lists.
this project is to develop the automated method to spot the The values of seasonal terms are different for CalIt2, Server
potential anomalies and quantify them at real time as the Log and Dodgers datasets, but the resulting mapping lists share
process invocations are being logged in the server. Anomalous the same format as the example shown in Fig. 5.
events such as external cyber attacks exist in the Server Log Based on the mapping lists and the timestamp field provided
dataset, but the labels are not available. We acquired the in the data we build our prediction network with seasonal
manual annotations for the test set from the technicians in features as additional input. Table II shows the network
the company. Totally 76 frames are labeled as anomalies in structures adopted for each of the datasets, where L is the
the test set, equivalent to a contamination ratio of 14.6%. maximum length of local sequences as a hyper-parameter.
tanh is used as the activation function and Mean Square Error
Dodgers Loop Sensor Dataset (MSE) loss as the loss function. Dropout is not enabled and
THIS IS A PREPRINT VERSION OF THE WORK 10.1109/TKDE.2020.3035685 PUBLISHED IN THE IEEE TKDE BY ©IEEE 10
TABLE III
C OMPARING GRU+ST ( THE PROPOSED BACKBONE MODEL AUGMENTED WITH SEASONAL FEATURES ) WITH THE VANILLA GRU IN ACCURACY, WHICH
IS INDICTED BY THE LOWEST TEST MSE (M EAN S QUARE E RROR ) ACHIEVED UNDER DIFFERENT TRAINING SETTINGS OF time steps (ts). I N EACH
GROUP OF COMPARISON , BOTH MODELS HAVE CONVERGED AND TRAINED FOR THE SAME NUMBER OF EPOCHS .
but fails to spot most of the anomaly frames on the Server Log
dataset. The Piecewise method misses a lot of anomalies, while
the iForest method tends to mistakenly label a large portion
of normal data as anomalies. LSTM-AD produced the results
close to our method on the Dodgers dataset, but rendered a
large portion of false alarms on other two datasets. To give a
more intuitive view, we plot the ROC curves of AD-LTI and
the baseline algorithms on the test data in Fig. 9, Fig. 10 and
Fig. 11.
Fig. 10. ROC curves of anomaly detection algorithms on Server Log dataset
Fig. 7. Heatmaps of detection decisions made by AD-LTI and baseline top-left corner for all of the three datasets, especially on the
algorithms compared with the ground truth on Server Log dataset Server Log Dataset (see Fig. 10), which features the complex
seasonality in each channel. The detection difficulty on the
Server Log dataset appears to be harder for other existing
algorithms (the reason is explained later) - none of other
algorithms achieve high true positive rate at a low false positive
rate. We further calculate the corresponding AUC for each
algorithm on both datasets. The resulting AUC values are
shown in Table IV.
As shown in Table IV, AD-LTI achieves the highest AUC
values of 0.93, 0.977 and 0.923 on CalIt2, Server Log dataset
Fig. 8. Heatmaps of decision results by AD-LTI and baseline algorithms
and the Dodgers Loop datasets, respectively. On CalIt2, the
compared with the ground truth on Dodgers Loop dataset AUC values of the baseline algorithms are between 0.8 and
0.9 with the only exception of OCSVM when nu is set to 0.05
From the ROC curves we can observe that AD-LTI produced - the approximately actual anomaly rate (0.046, precisely) for
the most reliable decisions as its curve is the closest to the CalIt2. This to some degree indicates that OCSVM is sensitive
THIS IS A PREPRINT VERSION OF THE WORK 10.1109/TKDE.2020.3035685 PUBLISHED IN THE IEEE TKDE BY ©IEEE 12
TABLE V
AUC VALUES AND DETECTION OVERHEADS ( IN MS PER FRAME ) USING LSTM-AD AND AD-LTI UNDER DIFFERENT SETTINGS OF PROBE LENGTH L.
B OTH METHODS USE MULTIPLE FORECASTS WITH EACH FRAME BEING PREDICTED FOR L TIMES .
[2] Frisch, R., & Waugh, F. V. (1933). Partial time regressions as com- [19] Huang, T., Zhu, Y., Wu, Y., Bressan, S., & Dobbie, G. (2016). Anomaly
pared with individual trends. Econometrica: Journal of the Econometric detection and identification scheme for VM live migration in cloud
Society, 387-401. infrastructure. Future Generation Computer Systems, 56, 736-745.
[3] Seiwell, H. R. (1949). The principles of time series analyses applied to [20] Hyndman, R. J., Wang, E., & Laptev, N. (2015, November). Large-scale
ocean wave data. Proceedings of the National Academy of Sciences of unusual time series detection. In 2015 IEEE international conference on
the United States of America, 35(9), 518. data mining workshop (ICDMW) (pp. 1616-1619). IEEE.
[4] Schölkopf, B., Williamson, R. C., Smola, A. J., Shawe-Taylor, J., & Platt, [21] Li, J., Pedrycz, W., & Jamal, I. (2017). Multivariate time series anomaly
J. C. (2000). Support vector method for novelty detection. Proceedings detection: A framework of Hidden Markov Models. Applied Soft Com-
of the 12th International Conference on Neural Information Processing puting, 60, 229-240.
Systems (NIPS’99), pp. 582-588. [22] Chauhan, Sucheta and Vig, Lovekesh. Anomaly detection in ECG time
[5] Zhang, R., Zhang, S., Lan, Y., & Jiang, J. (2008). Network anomaly signals via deep long short-term memory networks. In Data Science and
detection using one class support vector machine. In Proceedings of Advanced Analytics (DSAA), 2015. 36678 2015. IEEE International
the International MultiConference of Engineers and Computer Scientists Conference on, pp. 1–7. IEEE, 2015.
(Vol. 1). [23] Malhotra, P., Ramakrishnan, A., Anand, G., Vig, L., Agarwal, P.,
& Shroff, G. (2016). LSTM-based encoder-decoder for multi-sensor
[6] Maglaras, L. A., & Jiang, J. (2014, August). Ocsvm model combined
anomaly detection. arXiv preprint arXiv:1607.00148.
with k-means recursive clustering for intrusion detection in scada
[24] Ahmad, S., Lavin, A., Purdy, S., & Agha, Z. (2017). Unsupervised real-
systems. In 10th International conference on heterogeneous networking
time anomaly detection for streaming data. Neurocomputing, 262, 134-
for quality, reliability, security and robustness (pp. 133-134). IEEE.
147.
[7] Shang, W., Zeng, P., Wan, M., Li, L., & An, P. (2016). Intrusion detection [25] Pascanu, R., Mikolov, T., & Bengio, Y. (2013, February). On the diffi-
algorithm based on OCSVM in industrial control system. Security and culty of training recurrent neural networks. In International conference
Communication Networks, 9(10), 1040-1049. on machine learning (pp. 1310-1318).
[8] Liu, F. T., Ting, K. M., & Zhou, Z. H. (2012). Isolation-based anomaly [26] Tang, X. (2019). Large-Scale Computing Systems Workload Prediction
detection. ACM Transactions on Knowledge Discovery from Data Using Parallel Improved LSTM Neural Network. IEEE Access, 7,
(TKDD), 6(1). 40525-40533.
[9] Radovanović, M., Nanopoulos, A., & Ivanović, M. (2014). Reverse [27] Chen, S., Li, B., Cao, J., & Mao, B. (2018). Research on Agricultural
nearest neighbors in unsupervised distance-based outlier detection. IEEE Environment Prediction Based on Deep Learning. Procedia computer
transactions on knowledge and data engineering, 27(5), 1369-1382. science, 139, 33-40.
[10] Calheiros, R. N., Ramamohanarao, K., Buyya, R., Leckie, C., & [28] Harvey, A. & Peters, S. (1990), Estimation procedures for structural time
Versteeg, S. (2017). On the effectiveness of isolation-based anomaly series models, Journal of Forecasting, Vol. 9, 89-108.
detection in cloud data centers. Concurrency and Computation: Practice [29] Taylor, S. J., & Letham, B. (2018). Forecasting at scale. The American
and Experience, 29(2017)e4169. doi: 10.1002/cpe.4169 Statistician, 72(1), 37-45.
[11] Chan, P. K., & Mahoney, M. V. (2005, November). Modeling multiple [30] Kingma, D. and Ba, J. (2015) Adam: A Method for Stochastic Opti-
time series for anomaly detection. In Fifth IEEE International Confer- mization. Proceedings of the 3rd International Conference on Learning
ence on Data Mining (ICDM’05) (pp. 8-pp). IEEE. Representations (ICLR 2015).
[12] Ye, L., & Keogh, E. (2009, June). Time series shapelets: a new primitive [31] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B.,
for data mining. In Proceedings of the 15th ACM SIGKDD international Grisel, O., ... & Vanderplas, J. (2011). Scikit-learn: Machine learning in
conference on Knowledge discovery and data mining (pp. 947-956). Python. Journal of machine learning research, 12(Oct), 2825-2830.
ACM. [32] Vallis, O., Hochenbaum, J., & Kejariwal, A. (2014). A novel technique
[13] Zakaria, J., Mueen, A., & Keogh, E. (2012, December). Clustering time for long-term anomaly detection in the cloud. In 6th USENIX Workshop
series using unsupervised-shapelets. In 2012 IEEE 12th International on Hot Topics in Cloud Computing (HotCloud 14).
Conference on Data Mining (pp. 785-794). IEEE. [33] P. Filonov, A. Lavrentyev, A. Vorontsov, Multivariate Industrial Time
Series with Cyber-Attack Simulation: Fault Detection Using an LSTM-
[14] Yeh, C. C. M., Zhu, Y., Ulanova, L., Begum, N., Ding, Y., Dau, H.
based Predictive Data Model, NIPS Time Series Workshop 2016,
A., ... & Keogh, E. (2018). Time series joins, motifs, discords and
Barcelona, Spain, 2016.
shapelets: a unifying view that exploits the matrix profile. Data Mining
[34] Malhotra, P., Vig, L., Shroff, G., & Agarwal, P. (2015, April). Long
and Knowledge Discovery, 32(1), 83-123.
short term memory networks for anomaly detection in time series. In
[15] Hou, L., Kwok, J. T., & Zurada, J. M. (2016, February). Efficient Proceedings of European Symposium on Artificial Neural Networks,
learning of time series shapelets. In 13th AAAI Conference on Artificial Computational Intelligence and Machine Learning (ESANN 15’), pp.
Intelligence. 89-94.
[16] Gu, Z., He, L., Chang, C., Sun, J., Chen, H., & Huang, C. (2017). [35] Shi, H., Yang, J., Ding, M., & Wang, J. (2011). A short-term wind
Developing an efficient pattern discovery method for CPU utilizations power prediction method based on wavelet decomposition and BP neural
of computers. International Journal of Parallel Programming, 45(4), 853- network. Automation of Electric Power Systems, 35(16), 44-48.
878. [36] Gould, P. G., Koehler, A. B., Ord, J. K., Snyder, R. D., Hyndman, R.
[17] Zhu, H., Gu, Z., Zhao, H., Chen, K., Li, C. T., & He, L. (2018). J., & Vahid-Araghi, F. (2008). Forecasting time series with multiple
Developing a pattern discovery method in time series data and its GPU seasonal patterns. European Journal of Operational Research, 191(1),
acceleration. Big Data Mining and Analytics, 1(4), 266-283. 207-222.
[18] Wei, L., Kumar, N., Lolla, V. N., Keogh, E. J., Lonardi, S., & Chotirat [37] De Livera, A. M., Hyndman, R. J., & Snyder, R. D. (2011). Forecasting
(Ann) Ratanamahatana. (2005, June). Assumption-Free Anomaly Detec- time series with complex seasonal patterns using exponential smoothing.
tion in Time Series. In SSDBM (Vol. 5, pp. 237-242). Journal of the American Statistical Association, 106(496), 1513-1527.
THIS IS A PREPRINT VERSION OF THE WORK 10.1109/TKDE.2020.3035685 PUBLISHED IN THE IEEE TKDE BY ©IEEE 14
[38] Hodge, V., & Austin, J. (2004). A survey of outlier detection method-
ologies. Artificial intelligence review, 22(2), 85-126.
[39] Janssens, O., Slavkovikj, V., Vervisch, B., Stockman, K., Loccufier,
M., Verstockt, S., ... & Van Hoecke, S. (2016). Convolutional neural
network based fault detection for rotating machinery. Journal of Sound
and Vibration, 377, 331-345.
[40] Ince, T., Kiranyaz, S., Eren, L., Askar, M., & Gabbouj, M. (2016). Real-
time motor fault detection by 1-D convolutional neural networks. IEEE
Transactions on Industrial Electronics, 63(11), 7067-7075.
[41] Sabokrou, M., Fayyaz, M., Fathy, M., Moayed, Z., & Klette, R. (2018).
Deep-anomaly: Fully convolutional neural network for fast anomaly de-
tection in crowded scenes. Computer Vision and Image Understanding,
172, 88-97.
[42] Zheng, Y., Liu, Q., Chen, E., Ge, Y., & Zhao, J. L. (2014, June).
Time series classification using multi-channels deep convolutional neural
networks. In International Conference on Web-Age Information Man-
agement (pp. 298-310). Springer, Cham.
[43] Rajan, J. J., & Rayner, P. J. (1995). Unsupervised time series classifi-
cation. Signal processing, 46(1), 57-74.
[44] Längkvist, M., Karlsson, L., & Loutfi, A. (2014). A review of unsu-
pervised feature learning and deep learning for time-series modeling.
Pattern Recognition Letters, 42, 11-24.
[45] Ahmed, M., Mahmood, A. N., & Hu, J. (2016). A survey of network
anomaly detection techniques. Journal of Network and Computer Ap-
plications, 60, 19-31.
[46] Holt, C. C. (2004). Forecasting seasonals and trends by exponentially
weighted moving averages. International journal of forecasting, 20(1),
5-10.
[47] Van De Geijn, R. A., & Watts, J. (1997). SUMMA: Scalable universal
matrix multiplication algorithm. Concurrency: Practice and Experience,
9(4), 255-274.
[48] Chen, J., Li, K., Deng, Q., Li, K., & Philip, S. Y. (2019). Distributed
Deep Learning Model for Intelligent Video Surveillance Systems with
Edge Computing. IEEE Transactions on Industrial Informatics.
[49] Chen, J., Li, K., Bilal, K., Metwally, A. A., Li, K., & Yu, P. (2018).
Parallel protein community detection in large-scale PPI networks based
on multi-source learning. IEEE/ACM transactions on computational
biology and bioinformatics.
[50] Chen, J., Li, K., Bilal, K., Li, K., & Philip, S. Y. (2018). A bi-
layered parallel training architecture for large-scale convolutional neural
networks. IEEE transactions on parallel and distributed systems, 30(5),
965-976.
[51] Duan, M., Li, K., Liao, X., & Li, K. (2017). A parallel multiclassifi-
cation algorithm for big data using an extreme learning machine. IEEE
transactions on neural networks and learning systems, 29(6), 2337-2351.
[52] Chen, C., Li, K., Ouyang, A., Tang, Z., & Li, K. (2017). Gpu-accelerated
parallel hierarchical extreme learning machine on flink for big data.
IEEE Transactions on Systems, Man, and Cybernetics: Systems, 47(10),
2740-2753.