A Spatiotemporal Deep Learning Approach For Unsupervised Anomaly Detection in CL
A Spatiotemporal Deep Learning Approach For Unsupervised Anomaly Detection in CL
Authorized licensed use limited to: Zhejiang University. Downloaded on January 27,2025 at 01:13:45 UTC from IEEE Xplore. Restrictions apply.
1706 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 4, APRIL 2023
researchers’ attention in recent years. Recent research [3], them into traditional approaches and deep learning-based
[12]–[14] discovers that a variational auto-encoder (VAE) [15] approaches.
shows superiority for the task of anomaly detection. There- Traditional approaches [20] include statistical approaches,
fore, we develop an anomaly detector based on the design such as Gaussian-based model, classification-based
of a VAE. Beyond its advantages declared in the previous approaches, such as one-class support vector machines
literature, we also demonstrate its effectiveness for training (OC-SVMs) [21], nearest neighbor-based approaches, such as
on contaminated data (i.e., data including normal data and local outlier factor (LOF) [22], and so on. Usually, a traditional
abnormal data) without reliance on labels. anomaly detector is associated with an assumption about the
In this article, we propose a topology-aware multivariate normal data, and its effectiveness will degrade when the data
time series anomaly detector (TopoMAD), which combines cannot fit in the corresponding assumption.
graph neural networks [16], [17], long short-term memory Compared with traditional approaches, deep learning-based
(LSTM) [18], [19], and VAE [15] to perform unsupervised models are better at modeling complex dependence in
anomaly detection for a cloud system. We evaluate our model data, and thus, gain a lot of attention. For example, deep
using metrics collected from an environment running big auto-encoder [23] can be used to perform dimension reduc-
data batching systems, such as Hadoop and Spark, with fault tion by utilizing its multiple nonlinear transformations. For
injections and metrics collected from a microservice-based anomaly detection, the reconstruction error of an observation
application, where faults are injected occasionally. is used as its anomaly score. For anomaly detection on
The contributions of TopoMAD are summarized as follows. time series, Donut [3] leverages VAE [15] to model the
1) TopoMAD introduces an unsupervised anomaly detec- reconstruction probability [13] of univariate time series and
tion approach, which considers the topological informa- perform anomaly detection based on this measure. LSTM-AD
tion originated from a cloud system. We combine this [8] learns a prediction model using stacked LSTM networks,
topological information with metrics collected from a and the prediction error is used as the measure of an anomaly.
cloud system to construct a graph-based representation LSTM-ED [9] proposes an LSTM-based seq2seq auto-encoder
for anomaly detection. that learns to reconstruct normal multivariate time series and
2) TopoMAD glues graph neural networks and LSTM uses the reconstruction error to detect anomalies. Deep struc-
as the basic structure of VAE to perform anomaly tured energy-based models (DSEBMs) [10] connects an EBM
detection in a topological time series. We make use of with a regularized auto-encoder to model the data distribution,
state-of-the-art graph neural networks, such as graph and the energy score or the reconstruction error is used to
convolution network (GCN) [16] and graph attention perform anomaly detection. For sequential data, a recurrent
network (GAT) [17] to extract information from the formulation of EBMs can be employed. Though some of these
metrics organized in a predefined topology, together with approaches have tackled the problem of modeling complex
LSTM to extract information from sliding windows over temporal dependence for time series, we still need an approach
time. The spatiotemporal information extracted by graph to effectively model the spatiotemporal dependence for data
neural networks together with LSTM, can help improve which are collected from a specific topology continuously.
the anomaly detection performance in cloud systems.
3) TopoMAD makes use of VAE [15], a stochastic model, B. Anomaly Detection in Cloud Systems
to perform anomaly detection for a cloud system in a
In Section II-A, we have given a brief introduction about
fully unsupervised way. Instead of making an assump-
those general unsupervised anomaly detectors. Most of them
tion of data we use, we train our model on all data we
have been used in cloud systems [3], [24], [25]. In this section,
collect, no matter it is normal or abnormal. Moreover,
we shall focus on some other aspects of anomaly detection in
we propose an unsupervised threshold selection method
cloud systems.
and examine this method based on our model.
1) Anomaly Types: According to some work [26]–[31],
4) We open-sourced two data sets,1 including metrics col-
anomalies in cloud systems can be categorized into two types,
lected from an environment running big data batching
including external impairments and internal application faults.
systems and metrics collected from a running microser-
External impairments refer to unexpected overloads, such as
vice system. These data sets can be used to reproduce
a memory hog, induced by another co-located application or
the results of this article or to motivate further research.
infrastructure failures, such as disk failure, network disconnec-
This article is organized as follows. The related work
tion, OS crash, and so on. Internal application faults denote
is introduced in Section II. In Section III, we present the
anomalies caused by the misconfiguration or software bugs of
preliminaries of each main technique in TopoMAD. Section IV
the corresponding running application. Different causes can
outlines and specifies our approach. In Section V, we evalu-
result in different symptoms. Multiple work [26], [29], [30]
ate our approach using data collected from two application
has been conducted on the analysis of failure characteristics
scenarios. Section VI concludes this article.
in cloud systems. For example, Zhou et al. [29] investigate
II. R ELATED W ORK application faults in microservice systems with different root
A. Unsupervised Anomaly Detectors causes and the debugging practice for them.
Enormous methods of unsupervised anomaly detection have 2) Anomaly Indicators: There are multiple types of indica-
been developed in the past years. In this section, we categorize tors which can reveal the state of a cloud system and further
help detect and analyze anomalies. Examples of these types
1 https://github.com/QAZASDEDC/TopoMAD include metrics [3], [5], [32], [33], logs [34]–[37], and system
Authorized licensed use limited to: Zhejiang University. Downloaded on January 27,2025 at 01:13:45 UTC from IEEE Xplore. Restrictions apply.
HE et al.: SPATIOTEMPORAL DEEP LEARNING APPROACH FOR UNSUPERVISED ANOMALY DETECTION IN CLOUD SYSTEMS 1707
calls [38], [39]. The anomaly detection process varies accord- unified GCN to perform human action recognition. Inspired
ing to different types of indicators. In this article, we focus by these models, we develop TopoMAD to include and
on anomaly detection in cloud systems with the use of system extract topological information with the help of graph neural
metrics. System metrics, such as resource utilization and I/O networks to perform unsupervised anomaly detection in cloud
response time, are widely used as anomaly indicators in cloud systems.
systems and have proved their effectiveness in plenty of
works [5], [28], [32], and [40]. Also, there exist works which III. P RELIMINARIES
succeed in finding new effective anomaly indicators in some A. Cloud System Topologies
specific scenarios, such as queue lengths of microservices used To illustrate what is the topological information originated
in [40]. from the system, we first define nodes and edges in cloud
3) Temporal Model: Many indicators presented in system topology and expound their characteristics.
Section II-B2 can be aggregated as time-series data by 1) Nodes: We have mentioned in Section II-B that each
applying a sliding window, which helps reveal the system’s system component is treated as a node in the cloud system
local state in time. Some work [41] calculates statistical topology. The definition of a system component can vary
features (such as mean, standard deviation, and gradient) depending on the different granularity of division. Examples of
of data in the sliding window as temporal features. There a course-grained division might be the role-based (e.g., master
also exists complex but more powerful temporal modeling or slave) division in Hadoop and the service-based division
techniques, such as time series decomposition [32], spectral in a microservice system, while examples of a fine-grained
residual [6], and some deep learning-based methods [3], division might be the node-based division in Hadoop and
[8]–[10] mentioned in Section II-A. the pod-based division in a Kubernetes driven microservice
4) Spatial Model: Copious literature on system performance system. In a fine-grained topology, several nodes (e.g., pods)
diagnosis involves constructing a graph-based representation that can be grouped into one node (e.g., services) in a
for a system. Some of them [11], [42]–[44] analyze metrics course-grained topology can share the same set of edges with
collected from a system in a pairwise way and construct an similar behaviors.
invariant graph which takes each metric as a node. On the 2) Edges: Multiple relationships of system components can
other hand, some work [27], [45], [46] treat each of the be chosen to be an edge. For example, components that
system components as a node with multiple attributes and locate in the same physical machine can share an edge,
employ edges to represent the connectivity between system components whose load are balanced can share an edge,
components. In this article, we prefer the latter way because components that have the same update setting can share an
analyzing metrics pairwise is time consuming and infeasible edge, and so on. Most of these relationships are dynamic.
for anomaly detection in real time when more and more In this article, we define there is an edge between two system
metrics are collected from cloud systems. components if they interact with each other. According to this
5) Threshold Selection: To apply an unsupervised anomaly definition, we can construct the topology in advance based
detector to practice, a corresponding threshold is a necessary upon some tracing-based analysis [53], [54] or network traffic
and influential hyperparameter. Threshold selection is integral correlation-based analysis [27], [55]–[57]. These methods are
to anomaly detection in cloud systems. Some work [11], [12], important complements to our approach.
[47] has provided some practices for threshold selection. For
instance, peaks-over-threshold (POT) [12] utilizes the extreme
value theory [2] to perform automatic threshold selection. B. Problem Statement
As for dynamic threshold calculation, nonparametric-dynamic- We list the notations used in this article with their descrip-
thresholding (NDT) [47] selects a threshold from the set tions in Table I. Anomaly detection for a topological mul-
= μ(es ) + zσ (es ) with μ(es ) and σ (es ) as the mean and tivariate time series is to determine whether an observation
standard deviation of anomaly scores over a sliding window X t is an anomaly or not. We divide this objective into two
and z chosen from a range between two and ten to maximize steps. First, we calculate an anomaly score S(X t |X t−W :t−1 , E)
an equation predefined by us. for X t considering its recent history X t−W :t−1 and its topol-
ogy E. Then, we obtain an anomaly result by comparing
C. Graph Neural Networks for Spatiotemporal Modeling S(X t |X t−W :t−1 , E) with a threshold τ , which is selected in
an unsupervised manner. If the score is higher than τ , an alert
Learning patterns from spatiotemporal graphs is will be triggered.
increasingly important in many applications. Many current
approaches [48] apply graph neural networks together with
RNN and CNN to simultaneously consider spatial and C. Basics of Graph Neural Networks
temporal relations. For instance, diffusion convolutional Graph neural networks are modern deep learning techniques
recurrent neural network (RNN) [49] replaces the matrix designed for graph data. They are helpful when we need
multiplications in GRU [50] with a diffusion convolution to model the spatial dependence in metrics collected from
operation to accomplish the task of traffic forecasting. different components with connectivity and extract information
CNN-GCN [51] overcomes this problem through a complete from the topology. In this article, we apply two representative
convolutional structure which interleaves 1-D-CNN with graph neural networks, namely, GCN [16] and GAT [17],
GCN. ST-GCN [52] extends a temporal flow as graph edges as the basic layer in an LSTM cell to capture the spatial
and then extracts both temporal and spatial features using a dependence in the topology. In the following, we will introduce
Authorized licensed use limited to: Zhejiang University. Downloaded on January 27,2025 at 01:13:45 UTC from IEEE Xplore. Restrictions apply.
1708 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 4, APRIL 2023
TABLE I update the states of all nodes with LSTM. Peng et al. [61]
N OTATIONS AND D ESCRIPTIONS partition a document graph into two directed acyclic graphs
and constructs the LSTMs accordingly. In the constructed
LSTM, different precedents of a unit are calculated via full
parametrization or edge-type embedding and then summed to
obtain the output, states, and gates. Jain et al. [62] parame-
terize a spatiotemporal graph with a factor graph [63] and
then represent each factor with an RNN. Compared with these
methods, we directly replace the linear layers in LSTM with
state-of-the-art graph neural layers to perform spatiotemporal
learning. A more detailed description is given in Section IV-D.
the layerwise propagation rules of these two techniques in
detail. E. Basics of Variational Auto-Encoder
GCN [16] proposes a graph convolution layer which accepts VAE [15] is a deep probabilistic graphical model. It is a
an N × F matrix X (l) containing features of each node and the powerful tool for modeling the relationship between observed
adjacency matrix A of the topology as inputs. It propagates variables x and their corresponding latent variables z with
forward with the following rule: reduced dimension. VAE assumes that x can be generated
X (l+1) = D̃ − 2 Ã D̃ − 2 X (l) W(l) .
1 1
(1) through a process which first samples z from some prior
distribution pθ (z) and then samples x from pθ (x|z), which is
Here, Ã = A + I denotes the adjacency matrix with inserted derived from a neural network with parameter θ . Since pθ (z|x)
self-loops. D̃ is its diagonal degree matrix with D̃ii = j Ãi j is intractable, VAE introduces a recognition model qφ (z|x) to
and W(l) ∈ R F×F is a layer-specific trainable weight matrix. approximate it. Parameters φ and θ can be trained through
An activation function can also be applied to the result X l+1 maximizing the variational lower bound L(θ, φ; x) [(4)] on the
to introduce nonlinearity. marginal likelihood with stochastic gradient variational Bayes
GAT [17] is similar to GCN. It utilizes attention mech- estimator [15]
anisms [58], [59] to fuse the neighboring nodes and then
learn hidden representations for each node in a graph. After L(θ, φ; x) = −DKL qφ (z|x) pθ (z) + Eqφ (z|x) (log( pθ (x|z))).
inputting X (l) = {X 1(l) , X 2(l) , . . . , X (l)
N }, new hidden representa- (4)
tions can be computed as follows:
A typical choice for pθ (x|z) or qφ (z|x) is a multivariate
1 k
K
Gaussian with a diagonal covariance structure, whose mean
X i(l+1) = α Wk X (l) (2) and the covariance matrix is derived by neural networks. And
K k=1 j ∈N i, j j
i in general, a standard normal distribution N (0, I ) is chosen
where the attention coefficients αi, j are computed as for the prior of z.
For the purpose of anomaly detection given a specific
exp LeakyReLU a Wxi Wx j input x, reconstruction probability Eqφ (z|x) (log( pθ (x|z))) [13]
αi, j = . (3)
k∈Ni exp LeakyReLU a [Wxi Wxk ] is adopted and calculated using Monte Carlo integration [64]
as follows:
Here, W ∈ R F ×F and a ∈ R2F are trainable parameters.
1 (l)
L
K denotes the number of independent attention mechanisms
executed and Ni denotes the first-order neighbors of node Eqφ (x|z) (log( pθ (x|z))) = log pθ x|z (5)
L l=1
i (including node i ). LeakyReLU(·) is kind of an activation
function. Nonlinearity can also be introduced here by applying where L denotes the number of samples and z (l) ,
an activation function to the output. l = 1, 2, . . . , L are sampled from qφ (z|x).
Authorized licensed use limited to: Zhejiang University. Downloaded on January 27,2025 at 01:13:45 UTC from IEEE Xplore. Restrictions apply.
HE et al.: SPATIOTEMPORAL DEEP LEARNING APPROACH FOR UNSUPERVISED ANOMALY DETECTION IN CLOUD SYSTEMS 1709
Authorized licensed use limited to: Zhejiang University. Downloaded on January 27,2025 at 01:13:45 UTC from IEEE Xplore. Restrictions apply.
1710 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 4, APRIL 2023
Authorized licensed use limited to: Zhejiang University. Downloaded on January 27,2025 at 01:13:45 UTC from IEEE Xplore. Restrictions apply.
HE et al.: SPATIOTEMPORAL DEEP LEARNING APPROACH FOR UNSUPERVISED ANOMALY DETECTION IN CLOUD SYSTEMS 1711
of an epoch if the validation loss does not decrease. While attention. Therefore, we also calculate an anomaly score from
during inference, the input for the decoder per time step a component perspective as follows:
always comes from its preceding reconstruction (see Fig. 6).
1 i (l)
D−1 L
Specifically, at time step t − 1, we use the mean value, μ(z t ), St = − max log pθ X t |z t+d , X t+1:t+d
of p(X t |z t ) = N (μ(z t ), σ (z t )) as the input for the decoder. 0≤i<N L ∗ D
d=0 l=1
(10)
F. Offline Model Training where N denotes the number of components in the system and
Let φ denote the encoder network parameters and θ denote X ti denotes the metrics of component i at time t.
the decoder network parameters. Similar to VAE, our model For online anomaly detection, whether to trigger an alert
is trained by optimizing the variational lower bound on the at time t is decided based upon tempSt , which can be imme-
marginal likelihood. Given a topological time series X t0 :t , diately calculated at time t and D − 1 consecutive updated
where t0 = t − W + 1 denotes the beginning of the sliding anomaly scores preceding to it. When one of these anomaly
window, the loss function is derived as follows: scores is higher than the threshold, an anomaly is detected.
Algorithm 2 in Appendix A of the supplementary mater-
L θ, φ; X t0 :t = −DKL qφ z t |X t0 :t pθ (z t ) ial shows the pseudocode of our online anomaly detection
+ Eqφ (zt |X t0 :t ) log pθ X t0 :t |z t algorithm.
= −DKL qφ z t |X t0 :t N (0, I )
H. Threshold Selection
1
L t
+ log pθ X j |z t(l) , X j +1:t . (7) An anomaly score can reveal how anomalous an observation
L l=1 j =t is, but in practice, a threshold is still needed to trigger an alert
0
and instruct operators to take actions.
Here, L denotes the sampling number to estimate the expec- We propose a threshold selection method based on an
tation. In this article, we set L = 1 since it has been reported assumption that anomaly scores of normal data locate in an
in [15] that one sample is sufficient as long as the minibatch area with a high density, while anomaly scores of abnormal
size is large enough. The pseudocode for training the model data locate in an area with a low density. The distance of
GraphLSTM-VAE in TopoMAD is shown in Algorithm 1 in these two areas is relatively long. Specifically, we define the
Appendix A of the supplementary material. distance of two anomaly scores sets S<τ and S>τ separated by
a threshold τ as follows:
G. Computing Anomaly Scores min(S>τ ) − max(S<τ )
d(S<τ , S>τ ) = (11)
min(S>τ ) + max(S<τ ) − 2 ∗ min(S<τ )
As shown in Section III-E, the reconstruction probability
Eqφ (zt |X t0 :t ) (log( pθ (X t |z t ))) can be immediately calculated at where max(S) denotes the maximal element in S, and min(S)
time t and used to indicate whether X t is anomalous. We fol- denotes the minimal element.
low it and use its additive inverse as the anomaly score of an Based upon this assumption, we select a threshold which
observation X t to make a higher anomaly score imply a more maximizes the distance [11)] between the two sets cut from
anomalous observation. The temporary anomaly score at time the training data set by this threshold from a range provided
t is formulated as follows: by an operator. The pseudocode of the threshold selection
algorithm is demonstrated in Algorithm 3 in Appendix A of
tempSt = −Eqφ (zt |X t0 :t ) (log( pθ (X t |z t ))). (8) the supplementary material.
If operators want the selected threshold to adapt to the
It has been reported by [3] that anomalous observations
dynamic changes in the system, they can use the automatic
usually occur continuously. In practice, it is acceptable to
threshold selection method on newly collected data and update
trigger an alert within a short delay. We allow the anomaly
the selected threshold routinely to maintain its effectiveness.
score of an observation X t to be adjusted according to some
of its succeeding observations. As a stochastic sequence-to- V. E XPERIMENT VALIDATION
sequence model, the model in TopoMAD not only learns
In this section, we conduct experiments to validate our
to reconstruct X t but also learns to reconstruct W − 1
model and answer the following questions.
observations preceding to X t , which provides convenience for
us to update the anomaly score of an observation X t . The 1) Can TopoMAD outperform other approaches in anomaly
final anomaly score St for an observation X t is formulated as detection for topological multivariate time series col-
follows: lected from a cloud system?
2) How does each component of TopoMAD affect the
1
D−1 L
(l) performance?
St = − log pθ X t |z t+d , X t+1:t+d (9) 3) How effective is the topology?
L ∗ D d=0 l=1
4) How can we interpret the results of TopoMAD?
where L denotes the sampling number and D denotes the 5) How is the robustness of TopoMAD?
number of times we calculate or update an anomaly score 6) How is the efficiency of TopoMAD?
(i.e., the tolerance of detection delay). 7) How effective is our unsupervised threshold selection
In some cases, a relative low reconstruction probability of method in TopoMAD? Can it recommend a relatively
a particular component in a system is enough to draw our better threshold?
Authorized licensed use limited to: Zhejiang University. Downloaded on January 27,2025 at 01:13:45 UTC from IEEE Xplore. Restrictions apply.
1712 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 4, APRIL 2023
TABLE II
F EATURES OF O UR T WO D ATA S ETS
A. Data Sets
To demonstrate the effectiveness of our model, we conduct
experiments using two data sets, of which one is collected from
an environment running a big data batch processing system
Fig. 8. Pod-level topology of the Hipster-Shop where pods belonging to the
(MBD), and the other is from a microservice-based transaction same service are plotted with the same color.
processing system (MMS). The features of these two data sets
are shown in Table II. collected from each pod and recorded in the data set MMS.
We obtain the data set MBD from a cluster containing Then anomalies are labeled according to the fault injection
five nodes with one master and four slaves. During the log. The pod-level topology, which can be used to construct
experiments, we continuously generate multiple random work- a graph-based representation for metrics in this data set,
loads with random parameters using HiBench [67], a big is demonstrated in Fig. 8.
data benchmark published by Intel. At the same time, ran-
dom faults, including external impairments and application B. Performance Metrics
faults, are injected irregularly with random parameters. The All the anomaly detectors we evaluate in this article can
injected external impairments include system resource hog return an anomaly score for each observation, and therefore,
(high CPU/memory/disk-IO load), network failure (delay or we consider using the average precision (AP) of the abnormal
packets loss), and the application faults are simulated through class as the performance metric to compare them. AP is con-
injecting some delays or exceptions into Hadoop distributed sidered more suitable than Area Under the receiver operating
file system (HDFS) (causing symptoms similar to HDFS-448 characteristic Curve (AUC) when dealing with a highly skewed
[68] and HDFS-8160 [69]). We monitor and collect 26 metrics data set [74], which means the normal data dominates the
per node, including CPU idle, CPU I/O wait, CPU softirq, whole data set. It can be calculated as follows:
CPU system, CPU user, disk I/O wait per second, disk I/O TP TP
in progress per second, disk used percentage, disk read speed, P = , R=
T P + FP T P + FN
disk write speed, kernel entropy, load1, load5, load15, memory A PC
active, memory available percentage, memory cached, memory AP = (Ri − Ri−1 )Pi , m A P = (12)
dirty, memory free, memory used percentage, network bytes i
N(Classes)
received rate, network bytes sent rate, network TCP time wait, where T P, F P, and F N refer to true positive, false positive,
number of processes blocked, number of running processes, and false negative, respectively, and Pi and Ri denote the
and the total number of processes. These metrics are selected precision and recall at the i th threshold. The mean of A P
for two reasons. First, they are representative system metrics from all classes is known as mAP.
used in a series of previous works [12], [27], [32], [43], [46], F1 score, the harmonic mean of precision and recall, is also
and we try to cover many types, ranging from CPU, memory, widely used as a performance metric for anomaly detection
disk, and network to processes, as we can to gain a thorough when the observations have been classified into normal or
insight into the running state of the system. In addition, abnormal with a threshold. It can be calculated as follows:
these metrics are easy to acquire using tools, such as Perf, P·R
Telegraf [70], Amazon CloudWatch [71], and so on. These F1 = 2 · . (13)
metrics are collected for three days and constitute the data set P+R
MBD with anomalies labeled on the basis of the injection log. We use the F1 score to compare the effectiveness of different
We can refer to Fig. 3 for the topology input of this data set. threshold selection methods.
There is no doubt that other performance metrics can be fed
into our algorithm. C. Experiment Setup
With respect to the data set MMS, we adopt We compare our model in TopoMAD with seven baseline
Hipster-Shop [72], a web-based e-commerce microservice anomaly detectors, including a Gaussian-based anomaly detec-
benchmark where users can browse hipster goods, add them tor, OC-SVM [21], LOF [22], a simple auto-encoder [23],
to the cart and purchase them. This benchmark is deployed LSTM-AD [8], LSTM-ED [9], and RNN-EBM [10], which
in a Kubernetes [73] cluster with 12 VMs. A load generator have been introduced in Section II-A. Since the anomaly detec-
is included in Hipster-Shop to mimic visits to the website. tion for metrics with the help of topological information is
Moreover, we inject anomalies with random parameters, such kind of an unprecedented attempt, for other anomaly detectors,
as CPU/disk-IO hog, network delay, and container hang to we conduct experiments in the following two ways. The first
some pods during the experiments. Metrics, including CPU way is to concatenate all metrics from different components
usage, memory usage, network receive rate, network transmit into one huge vector and use this vector as the input of the
rate, pod latency, pod workload, and pod success rate, are evaluated anomaly detector. The second way is to perform
Authorized licensed use limited to: Zhejiang University. Downloaded on January 27,2025 at 01:13:45 UTC from IEEE Xplore. Restrictions apply.
HE et al.: SPATIOTEMPORAL DEEP LEARNING APPROACH FOR UNSUPERVISED ANOMALY DETECTION IN CLOUD SYSTEMS 1713
Fig. 9. AP of TopoMAD and other baseline anomaly detectors on MBD. Fig. 10. AP of TopoMAD and other baseline anomaly detectors on MMS.
Authorized licensed use limited to: Zhejiang University. Downloaded on January 27,2025 at 01:13:45 UTC from IEEE Xplore. Restrictions apply.
1714 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 4, APRIL 2023
Deep learning-based models are good at modeling complex Fig. 14. ROC curves on both data sets, MBD and MMS. (a) MBD. (b) MMS.
dependencies embedded in data, but this advantage can turn
into a disadvantage when these deep anomaly detectors are
trained on contaminated training data. With this problem
out of consideration when performing unsupervised anomaly
detection, attempts to improve model fitting capacity are very
likely to backfire. For MBD, LSTM-ED performs the best
among all baseline anomaly detectors. For MMS, if we utilize
the concatenate policy, RNN-EBM achieves the best result
among all baseline deep anomaly detectors, and if we apply Fig. 15. AP of model variants on two data sets.
the one-by-one policy, a simple auto-encoder is the best in all
Finally, the ROC curves of anomaly detectors with top three
baseline deep anomaly detectors.
performances on each data set are displayed in Fig. 14 to show
LSTM-AD achieves the worst result in all deep
the performance tradeoff of different models more distinctly.
learning-based models evaluated. As a prediction-based
From the ROC curves on MBD, we can see that the false
model, LSTM-AD heavily relies on the fact that the time
positive rate of TopoMAD is lower compared with the other
series should be predictable; therefore, it is not surprising
two baseline models under a relatively low threshold. As for
that LSTM-AD performs the worst when our data sets do
MMS, the performance of TopoMAD is similar to LOF and
not satisfy this setting. By contrast, the assumption that
steadily better than RNN-EBM under different thresholds.
observations are reconstructable is weaker and more reliable.
Other deep learning-based models we evaluate, including
auto-encoder, LSTM-ED, RNN-EBM, and TopoMAD are E. Effects of Major Components
based upon this more reliable assumption, and thus, perform In this section, we will answer the question “How does
better than LSTM-AD in our two data sets. each component of TopoMAD affect the performance?”
LSTM-ED is a reconstruction-based model, and our model The results are displayed in Fig. 15. We include the mAP
in TopoMAD can be seen as its extension with consideration of LSTM-ED here because our model is kind of its exten-
of topological information and contaminated training data. sion. We select the best performance metric gained by
In addition to the lack of these considerations, the sub- LSTM-ED among two policies mentioned above to display
optimal performance of LSTM-ED is also caused by the here. “GCN ONLY” and “GAT ONLY” denote model variants
discrepancy between input distribution at training and testing of TopoMAD gained by removing its variational component
phases because the decoder in LSTM-ED uses ground truth which degrades our model into a deterministic model. “LIN-
observations as input at the training stage while reconstructed EAR + VAE” denotes a model variant gained by replac-
observations at the inference stage. To show the comparison ing the GraphLSTM in TopoMAD with a simple LSTM,
between LSTM-ED and TopoMAD more clearly, we provide which loses sight of the topological information. “GCN +
a case study in Appendix B of the supplementary material. VAE” and “GAT + VAE” denote model variants with two
As shown in Section V-C, we evaluate the baseline anomaly different choices of graph neural layers when implementing
detectors using two training policies, namely, a concatenate GraphLSTM in TopoMAD.
policy and a one-by-one policy. For MBD, nearly all base- 1) Effect of the Inclusion of Graph Neural Networks: The
line models achieve better performance using the concatenate objective of including graph neural networks in TopoMAD is
policy (note that deep anomaly detectors with the concatenate to extract features from neighbors of each component. We can
policy have higher space and time complexity than TopoMAD see in Fig. 15 that “GCN ONLY” and “GAT ONLY” are
and those with the one-by-one policy and we will discuss superior to LSTM-ED, and “GCN + VAE” and “GAT + VAE”
this more specifically in Section V-I). Nevertheless, no matter outperform or match “LINEAR + VAE” in both data sets.
which policy they utilize, TopoMAD remains considerable That is, no matter in a deterministic model or in a stochastic
superiority over them. In Section IV-A, we claim three benefits model, utilizing graph neural networks to explicitly model the
of TopoMAD, namely, the unified feature learning, the con- spatial dependence among components in a cloud system is
venience for end-to-end learning and the ability to prevent beneficial.
overfitting. These three benefits are based upon the comparison 2) Effect of VAE: It can be seen in Fig. 15 that “LINEAR +
between our model and other methods trained using the two VAE,” “GCN + VAE,” and “GAT + VAE” all outperform
policies declared earlier. their corresponding deterministic variants. The superiority of
Authorized licensed use limited to: Zhejiang University. Downloaded on January 27,2025 at 01:13:45 UTC from IEEE Xplore. Restrictions apply.
HE et al.: SPATIOTEMPORAL DEEP LEARNING APPROACH FOR UNSUPERVISED ANOMALY DETECTION IN CLOUD SYSTEMS 1715
TABLE III
D IFFERENT T OPOLOGY I NPUTS ’ AP ON T WO D ATA S ETS
Authorized licensed use limited to: Zhejiang University. Downloaded on January 27,2025 at 01:13:45 UTC from IEEE Xplore. Restrictions apply.
1716 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 4, APRIL 2023
TABLE VI
S PACE C OMPLEXITY OF D EEP L EARNING R ECONSTRUCTION -BASED
A NOMALY D ETECTION M ETHODS
Authorized licensed use limited to: Zhejiang University. Downloaded on January 27,2025 at 01:13:45 UTC from IEEE Xplore. Restrictions apply.
HE et al.: SPATIOTEMPORAL DEEP LEARNING APPROACH FOR UNSUPERVISED ANOMALY DETECTION IN CLOUD SYSTEMS 1717
selection method can also gain a relatively good threshold. 3) Research in combining some other state-of-the-art deep
Considering the lack of operational data with labels, in reality, anomaly detectors to gain a better performance.
we recommend selecting a threshold in an unsupervised way.
R EFERENCES
K. Discussions [1] Y. Dang, Q. Lin, and P. Huang, “AIOps: Real-world challenges and
research innovations,” in Proc. IEEE/ACM 41st Int. Conf. Softw. Eng.,
1) Assumptions of Unsupervised Anomaly Detectors: Due Companion (ICSE-Companion), May 2019, pp. 4–5.
to the lack of labeled data for training, the effectiveness [2] A. Siffer, P.-A. Fouque, A. Termier, and C. Largouet, “Anomaly detec-
of an unsupervised anomaly detector relies heavily on its tion in streams with extreme value theory,” in Proc. 23rd ACM SIGKDD
assumption. As introduced in Section II-A, assumptions of Int. Conf. Knowl. Discovery Data Mining, Aug. 2017, pp. 1067–1075.
[3] H. Xu et al., “Unsupervised anomaly detection via variational auto-
traditional anomaly detectors include stochastic model-based encoder for seasonal KPIs in Web applications,” in Proc. World Wide
assumptions, density-based assumptions, classification-based Web Conf. World Wide Web WWW, 2018, pp. 187–196.
assumptions, and so on. As for deep learning-based anomaly [4] Z. Li, W. Chen, and D. Pei, “Robust and unsupervised KPI anomaly
detection based on conditional variational autoencoder,” in Proc. IEEE
detectors examined in this article, putting aside their superi- 37th Int. Perform. Comput. Commun. Conf. (IPCCC), Nov. 2018,
ority in modeling dependence in data with complex structure, pp. 1–9.
their assumptions are actually quite simple. What they assume [5] W. Chen et al., “Unsupervised anomaly detection for intricate KPIs via
adversarial training of VAE,” in Proc. IEEE INFOCOM Conf. Comput.
is that normal data are easier to generate than abnormal Commun., Apr. 2019, pp. 1891–1899.
data. This generation process can be further subdivided into [6] H. Ren et al., “Time-series anomaly detection service at microsoft,” in
reconstruction based or prediction based. The quality of an Proc. 25th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining,
assumption is determined by the data rather than the assump- Jul. 2019, pp. 3009–3017.
[7] Z. Zheng, J. Zhu, and M. R. Lyu, “Service-generated big data and big
tion itself; therefore, it is no wonder a simple traditional data-as-a-service: An overview,” in Proc. IEEE Int. Congr. Big Data,
model can achieve same or even superior performance com- Jun. 2013, pp. 403–410.
pared with some state-of-the-art deep learning-based models, [8] P. Malhotra, L. Vig, G. Shroff, and P. Agarwal, “Long short term
memory networks for anomaly detection in time series,” in Proc.
as visualized in Figs. 9 and 10. Integration of the capacity Eur. Symp. Artif. Neural Netw., Comput. Intell. Mach. Learn., 2015,
of modeling dependence in data with complex structure in pp. 89–94.
deep learning-based models and a diversity of assumptions in [9] P. Malhotra, A. Ramakrishnan, G. Anand, L. Vig, P. Agarwal, and
G. Shroff, “LSTM-based encoder-decoder for multi-sensor anomaly
traditional models is worth of a try, which is denoted as a deep detection,” in Proc. Anomaly Detection Workshop 33rd Int. Conf. Mach.
hybrid model in [79]. Learn., 2016, pp. 1–5.
2) Anomaly Localization: Once an anomaly is detected, [10] S. Zhai, Y. Cheng, W. Lu, and Z. Zhang, “Deep structured energy based
system operators usually need to locate the real root causes models for anomaly detection,” in Proc. 33rd Int. Conf. Int. Conf. Mach.
Learn., 2016, pp. 1100–1109.
of this anomaly. To locate an anomaly, operators can take the [11] C. Zhang et al., “A deep neural network for unsupervised anomaly
anomaly scores of all components generated by TopoMAD as detection and diagnosis in multivariate time series data,” in Proc. AAAI
the input of some anomaly localization algorithms, such as Conf. Artif. Intell., 2019, pp. 1409–1416.
[12] Y. Su, Y. Zhao, C. Niu, R. Liu, W. Sun, and D. Pei, “Robust anomaly
CauseInfer [27] and microscope [80]. For example, when an detection for multivariate time series through stochastic recurrent neural
anomaly in a microservice system is detected by TopoMAD, network,” in Proc. 25th ACM SIGKDD Int. Conf. Knowl. Discovery Data
operators can locate the anomaly using the cause inference Mining, Jul. 2019, pp. 2828–2837.
[13] J. An and S. Cho, “Variational autoencoder based anomaly detec-
algorithm proposed in microscope [80], which traverses across tion using reconstruction probability,” Special Lect. IE, vol. 2, no. 1,
the service topology to find the root cause. pp. 1–18, Dec. 2015.
[14] L. Li, J. Yan, H. Wang, and Y. Jin, “Anomaly detection of time series
with smoothness-inducing sequential variational auto-encoder,” IEEE
VI. C ONCLUSION Trans. Neural Netw. Learn. Syst., early access, Apr. 13, 2020, doi: 10.
Since anomaly detection is essential to operate a cloud 1109/TNNLS.2020.2980749.
[15] D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,” in
system, this article proposes a topology-aware multivariate Proc. Int. Conf. Learn. Represent., 2014, pp. 1–14.
time series anomaly detector, namely, TopoMAD for unsu- [16] T. N. Kipf and M. Welling, “Semi-supervised classification with graph
pervised anomaly detection. To achieve that, we propose a convolutional networks,” in Proc. Int. Conf. Learn. Represent., 2017,
pp. 1–13.
novel architecture of a neural network by integrating the [17] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and
technologies of graph neural networks, LSTM and VAE. Our Y. Bengio, “Graph attention networks,” in Proc. Int. Conf. Learn.
approach can robustly and effectively model the complex Represent., 2018, pp. 1–12.
[18] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
spatiotemporal dependence in contaminated data. The time Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
complexity of TopoMAD is quantitatively analyzed, which [19] F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning to forget:
shows that TopoMAD is efficient to analyze a large-scale Continual prediction with LSTM,” Neural Comput., vol. 12, no. 10,
cloud system. Moreover, a threshold selection algorithm is pp. 2451–2471, Oct. 2000.
[20] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection: A survey,”
also proposed to help optimize TopoMAD. The experimental ACM Comput. Surv., vol. 41, no. 3, p. 15, 2009.
results conducted on two real-world data sets demonstrate our [21] B. Schölkopf, R. C. Williamson, A. J. Smola, J. Shawe-Taylor, and
model’s superior performance over other baseline anomaly J. C. Platt, “Support vector method for novelty detection,” in Proc. Adv.
Neural Inf. Process. Syst., 2000, pp. 582–588.
detectors. [22] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander, “LOF: Identifying
In future works, we will focus on the following. density-based local outliers,” in Proc. Int. Conf. Manage. Data (SIG-
1) Research in online learning techniques for TopoMAD. MOD), vol. 29, 2000, pp. 93–104.
[23] S. Hawkins, H. He, G. Williams, and R. Baxter, “Outlier detection
2) The choice of metrics which are better at exposing the using replicator neural networks,” in Proc. Int. Conf. Data Warehousing
failure of a cloud system. Knowl. Discovery. Berlin, Germany: Springer, 2002, pp. 170–180.
Authorized licensed use limited to: Zhejiang University. Downloaded on January 27,2025 at 01:13:45 UTC from IEEE Xplore. Restrictions apply.
1718 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 4, APRIL 2023
[24] S. Fu, J. Liu, and H. Pannu, “A hybrid anomaly detection frame- [46] M. Zasadziński, M. Solé, A. Brandon, V. Muntés-Mulero, and
work in cloud computing using one-class and two-class support vector D. Carrera, “‘Next stop’ noops’: Enabling cross-system diagnostics
machines,” in Proc. Int. Conf. Adv. Data Mining Appl. Berlin, Germany: through graph-based composition of logs and metrics,” in Proc. IEEE
Springer, 2012, pp. 726–738. Int. Conf. Cluster Comput. (CLUSTER). Sep. 2018, pp. 212–222.
[25] T. Huang et al., “An LOF-based adaptive anomaly detection scheme for [47] K. Hundman, V. Constantinou, C. Laporte, I. Colwell, and T. Soder-
cloud computing,” in Proc. IEEE 37th Annu. Comput. Softw. Appl. Conf. strom, “Detecting spacecraft anomalies using LSTMs and nonparametric
Workshops, Jul. 2013, pp. 206–211. dynamic thresholding,” in Proc. 24th ACM SIGKDD Int. Conf. Knowl.
[26] K. V. Vishwanath and N. Nagappan, “Characterizing cloud computing Discovery Data Mining, Jul. 2018, pp. 387–395.
hardware reliability,” in Proc. 1st ACM Symp. Cloud Comput. SoCC, [48] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu, “A com-
2010, pp. 193–204. prehensive survey on graph neural networks,” 2019, arXiv:1901.00596.
[27] P. Chen, Y. Qi, P. Zheng, and D. Hou, “CauseInfer: Automatic and [Online]. Available: http://arxiv.org/abs/1901.00596
distributed performance diagnosis with hierarchical causality graph in [49] Y. Li, R. Yu, C. Shahabi, and Y. Liu, “Diffusion convolutional recurrent
large distributed systems,” in Proc. IEEE INFOCOM Conf. Comput. neural network: Data-driven traffic forecasting,” in Proc. Int. Conf.
Commun., Apr. 2014, pp. 1887–1895. Learn. Represent., 2018, pp. 1–14.
[28] Q. Lin et al., “Predicting node failure in cloud service systems,” in Proc. [50] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical eval-
26th ACM Joint Meeting Eur. Softw. Eng. Conf. Symp. Found. Softw. Eng. uation of gated recurrent neural networks on sequence modeling,”
ESEC/FSE, 2018, pp. 480–490. 2014, arXiv:1412.3555. [Online]. Available: http://arxiv.org/abs/1412.
[29] X. Zhou et al., “Fault analysis and debugging of microservice sys- 3555
tems: Industrial survey, benchmark system, and empirical study,” IEEE [51] B. Yu, H. Yin, and Z. Zhu, “Spatio-temporal graph convolutional
Trans. Softw. Eng., early access, Dec. 18, 2018, doi: 10.1109/TSE.2018. networks: A deep learning framework for traffic forecasting,” in Proc.
2887384. 27th Int. Joint Conf. Artif. Intell., Jul. 2018, pp. 3634–3640.
[30] X. Zhou et al., “Latent error prediction and fault localization for [52] S. Yan, Y. Xiong, and D. Lin, “Spatial temporal graph convolutional
microservice applications by learning from system trace logs,” in Proc. networks for skeleton-based action recognition,” in Proc. 32nd AAAI
27th ACM Joint Meeting Eur. Softw. Eng. Conf. Symp. Found. Softw. Conf. Artif. Intell., 2018, pp. 7444–7452.
Eng., Aug. 2019, pp. 683–694. [53] P. Barham, A. Donnelly, R. Isaacs, and R. Mortier, “Using magpie for
[31] H. Liu, S. Lu, M. Musuvathi, and S. Nath, “What bugs cause production request extraction and workload modelling,” in Proc. OSDI, vol. 4, 2004,
cloud incidents,” in Proc. Workshop Hot Topics Operating Syst., 2019, p. 18.
pp. 155–162. [54] R. Fonseca, G. Porter, R. H. Katz, and S. Shenker, “X-trace: A pervasive
[32] O. Vallis, J. Hochenbaum, and A. Kejariwal, “A novel technique for network tracing framework,” in Proc. 4th USENIX Symp. Netw. Syst.
long-term anomaly detection in the cloud,” in Proc. 6th USENIX Design Implement. (NSDI), 2007, pp. 271–284.
Workshop Hot Topics Cloud Comput. (HotCloud), 2014, p. 15. [55] X. Chen, M. Zhang, Z. M. Mao, and P. Bahl, “Automating network
[33] C. Wang, V. Talwar, K. Schwan, and P. Ranganathan, “Online detection application dependency discovery: Experiences, limitations, and new
of utility cloud anomalies using metric distributions,” in Proc. IEEE solutions,” in Proc. OSDI, vol. 8, 2008, pp. 117–130.
Netw. Oper. Manage. Symp. NOMS, Apr. 2010, pp. 96–103. [56] P. Barham et al., “Constellation: Automated discovery of service and
[34] M. Du, F. Li, G. Zheng, and V. Srikumar, “DeepLog: Anomaly detection host dependencies in networked systems,” Microsoft Res., Redmond,
and diagnosis from system logs through deep learning,” in Proc. ACM WA, USA, Tech. Rep. MSR-TR-2008-67, 2008, pp. 1–14.
SIGSAC Conf. Comput. Commun. Secur., Oct. 2017, pp. 1285–1298. [57] J. Hwang, G. Liu, S. Zeng, F. Y. Wu, and T. Wood, “Topology discovery
[35] W. Meng et al., “LogAnomaly: Unsupervised detection of sequential and service classification for distributed-aware clouds,” in Proc. IEEE
and quantitative anomalies in unstructured logs,” in Proc. 28th Int. Int. Conf. Cloud Eng., Mar. 2014, pp. 385–390.
Joint Conf. Artif. Intell., Aug. 2019, pp. 4739–4745, doi: 10.24963/ijcai. [58] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation
2019/658. by jointly learning to align and translate,” in Proc. Int. Conf. Learn.
[36] X. Zhang et al., “Robust log-based anomaly detection on unstable log Represent., 2015, pp. 1–15.
data,” in Proc. 27th ACM Joint Meeting Eur. Softw. Eng. Conf. Symp. [59] A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural Inf.
Found. Softw. Eng. ESEC/FSE, 2019, pp. 807–817. Process. Syst., 2017, pp. 5998–6008.
[37] S. He, J. Zhu, P. He, and M. R. Lyu, “Experience report: System log [60] X. Liang, X. Shen, J. Feng, L. Lin, and S. Yan, “Semantic object parsing
analysis for anomaly detection,” in Proc. IEEE 27th Int. Symp. Softw. with graph lstm,” in Proc. Eur. Conf. Comput. Vis. Berlin, Germany:
Rel. Eng. (ISSRE), Oct. 2016, pp. 207–218. Springer, 2016, pp. 125–143.
[38] N. Khadke, M. P. Kasick, S. P. Kavulya, J. Tan, and P. Narasimhan, [61] N. Peng, H. Poon, C. Quirk, K. Toutanova, and W.-T. Yih, “Cross-
“Transparent system call based performance debugging for cloud com- sentence N-ary relation extraction with graph LSTMs,” Trans. Assoc.
puting,” Presented at the Workshop Manag. Syst. Automatically Dynam- Comput. Linguistics, vol. 5, pp. 101–115, Dec. 2017.
ically, 2012. [62] A. Jain, A. R. Zamir, S. Savarese, and A. Saxena, “Structural-RNN:
[39] W. Sha, Y. Zhu, M. Chen, and T. Huang, “Statistical learning for Deep learning on spatio-temporal graphs,” in Proc. IEEE Conf. Comput.
anomaly detection in cloud server systems: A multi-order Markov chain Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 5308–5317.
framework,” IEEE Trans. Cloud Comput., vol. 6, no. 2, pp. 401–413, [63] F. R. Kschischang, B. J. Frey, and H.-A. Loeliger, “Factor graphs and
Jun. 2018. the sum-product algorithm,” IEEE Trans. Inf. Theory, vol. 47, no. 2,
[40] Y. Gan et al., “Seer: Leveraging big data to navigate the complexity pp. 498–519, Feb. 2001.
of performance debugging in cloud microservices,” in Proc. 24th Int. [64] C. Robert and G. Casella, Monte Carlo Statistical Methods. Berlin,
Conf. Architectural Support Program. Lang. Operating Syst., 2019, Germany: Springer, 2013.
pp. 19–33. [65] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning
[41] M. Ma, W. Lin, D. Pan, and P. Wang, “MS-rank: Multi-metric and self- with neural networks,” in Proc. Adv. Neural Inf. Process. Syst., 2014,
adaptive root cause diagnosis for microservice applications,” in Proc. pp. 3104–3112.
IEEE Int. Conf. Web Services (ICWS), Jul. 2019, pp. 60–67. [66] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, “Scheduled sampling
[42] A. B. Sharma, H. Chen, M. Ding, K. Yoshihira, and G. Jiang, “Fault for sequence prediction with recurrent neural networks,” in Proc. Int.
detection and localization in distributed systems using invariant rela- Conf. Neural Inf. Process. Syst., 2015, pp. 1171–1179.
tionships,” in Proc. 43rd Annu. IEEE/IFIP Int. Conf. Dependable Syst. [67] S. Huang, J. Huang, J. Dai, T. Xie, and B. Huang, “The HiBench bench-
Netw. (DSN), Jun. 2013, pp. 1–8. mark suite: Characterization of the MapReduce-based data analysis,” in
[43] P. Chen, Y. Qi, X. Li, and L. Su, “An ensemble MIC-based approach for Proc. IEEE 26th Int. Conf. Data Eng. Workshops (ICDEW ), Mar. 2010,
performance diagnosis in big data platform,” in Proc. IEEE Int. Conf. pp. 41–51.
Big Data, Oct. 2013, pp. 78–85. [68] (2020). HDFS-448. Accessed: 2020. [Online]. Available: https://issues.
[44] W. Cheng, K. Zhang, H. Chen, G. Jiang, Z. Chen, and W. Wang, apache.org/jira/browse/HDFS-448
“Ranking causal anomalies via temporal and dynamical analysis on [69] (2020). HDFS-8160. Accessed: 2020. [Online]. Available: https://issues.
vanishing correlations,” in Proc. 22Nd ACM SIGKDD Int. Conf. Knowl. apache.org/jira/browse/HDFS-8160
Discovery Data Mining, 2016, pp. 805–814. [70] (2020). Telegraf. Accessed: 2020. [Online]. Available: https://www.
[45] Z. Zheng, T. Chao Zhou, M. R. Lyu, and I. King, “Component ranking influxdata.com/time-series-platform/telegraf/
for fault-tolerant cloud applications,” IEEE Trans. Services Comput., [71] (2020). Cloudwatch. Accessed: 2020. [Online]. Available: https://aws.
vol. 5, no. 4, pp. 540–550, 4th Quart., 2012. amazon.com/cn/cloudwatch/
Authorized licensed use limited to: Zhejiang University. Downloaded on January 27,2025 at 01:13:45 UTC from IEEE Xplore. Restrictions apply.
HE et al.: SPATIOTEMPORAL DEEP LEARNING APPROACH FOR UNSUPERVISED ANOMALY DETECTION IN CLOUD SYSTEMS 1719
[72] (2019). Hipster Shop: Cloud-Native Microservices Demo Appli- Yongfeng Wang received the B.E. degree from the
cation. Accessed: 2019. [Online]. Available: https://github.com/ School of Data and Computer Science, Sun Yat-sen
GoogleCloudPlatform/microservices-demo University, Guangzhou, China.
[73] (2019). Kubernetes. Accessed: 2019. [Online]. Available: https:// His current research areas include persistent mem-
kubernetes.io/ ory, storage system, and cloud computing.
[74] J. Davis and M. Goadrich, “The relationship between precision-recall
and ROC curves,” in Proc. 23rd Int. Conf. Mach. Learn. ICML, 2006,
pp. 233–240, doi: 10.1145/1143844.1143874.
[75] M. Fey and J. Eric Lenssen, “Fast graph representation learning
with PyTorch geometric,” 2019, arXiv:1903.02428. [Online]. Available:
http://arxiv.org/abs/1903.02428
[76] P. I. Frazier, “A tutorial on Bayesian optimization,” 2018,
arXiv:1807.02811. [Online]. Available: http://arxiv.org/abs/1807.02811
[77] J. Snoek, H. Larochelle, and R. P. Adams, “Practical Bayesian optimiza-
tion of machine learning algorithms,” in Proc. Adv. Neural Inf. Process. Guangba Yu is currently pursuing the Ph.D. degree
Syst., 2012, pp. 2951–2959. with the School of Data and Computer Science, Sun
[78] W.-L. Chiang, X. Liu, S. Si, Y. Li, S. Bengio, and C.-J. Hsieh, Yat-sen University, Guangzhou, China.
“Cluster-GCN: An efficient algorithm for training deep and large graph His current research areas include distributed sys-
convolutional networks,” in Proc. 25th ACM SIGKDD Int. Conf. Knowl. tem, cloud computing, and AI-driven operations.
Discovery Data Mining, Jul. 2019, pp. 257–266.
[79] R. Chalapathy and S. Chawla, “Deep learning for anomaly detection:
A survey,” 2019, arXiv:1901.03407. [Online]. Available: http://arxiv.org/
abs/1901.03407
[80] J. Lin, P. Chen, and Z. Zheng, “Microscope: Pinpoint performance issues
with causal graphs in micro-service environments,” in Proc. Int. Conf.
Service-Oriented Comput. Berlin, Germany: Springer, 2018, pp. 3–20.
Xiaoyun Li is currently pursuing the Ph.D. degree Zibin Zheng (Senior Member, IEEE) received
with the School of Data and Computer Science, Sun the Ph.D. degree from The Chinese University of
Yat-sen University, Guangzhou, China. Hong Kong, Hong Kong, in 2011.
Her current research areas include log analysis and He is currently a Professor with the School of
AI-driven operations. Data and Computer Science, Sun Yat-sen University,
Guangzhou, China. His research interests include
blockchain, services computing, software engineer-
ing, and financial big data.
Authorized licensed use limited to: Zhejiang University. Downloaded on January 27,2025 at 01:13:45 UTC from IEEE Xplore. Restrictions apply.