0% found this document useful (0 votes)
41 views10 pages

TripleThe Interpretable Deep Learning Anomaly Detection Framework Based On Trace-Metric-Log of Microservice

The document presents Triple, an interpretable anomaly detection framework for microservice systems that utilizes deep learning by integrating metrics, traces, and logs. Triple employs a Spatial-Temporal Graph Convolutional Network (STGCN) and Deep SVDD to enhance anomaly detection precision by 11%-65% over existing methods, while also providing interpretable results to assist engineers in understanding system decisions. The framework aims to improve trust in automated anomaly detection by offering credible explanations based on the combined analysis of multiple data sources.

Uploaded by

fhd.jafari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views10 pages

TripleThe Interpretable Deep Learning Anomaly Detection Framework Based On Trace-Metric-Log of Microservice

The document presents Triple, an interpretable anomaly detection framework for microservice systems that utilizes deep learning by integrating metrics, traces, and logs. Triple employs a Spatial-Temporal Graph Convolutional Network (STGCN) and Deep SVDD to enhance anomaly detection precision by 11%-65% over existing methods, while also providing interpretable results to assist engineers in understanding system decisions. The framework aims to improve trust in automated anomaly detection by offering credible explanations based on the combined analysis of multiple data sources.

Uploaded by

fhd.jafari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Triple:The Interpretable Deep Learning Anomaly

Detection Framework based on Trace-Metric-Log of


Microservice
Rui Ren Yang Wang Fengrui Liu
Institute of Computing Institute of Computing Institute of Computing
Technology, CAS; University Technology, CAS; Technology, CAS; University
2023 IEEE/ACM 31st International Symposium on Quality of Service (IWQoS) | 979-8-3503-9973-8/23/$31.00 ©2023 IEEE | DOI: 10.1109/IWQOS57198.2023.10188773

of Chinese Academy of Sciences; [email protected] of Chinese Academy of Sciences;


[email protected] [email protected]

Zhenyu Li Gaogang Xie∗


Institute of Computing Computer Network Information Center, CAS;
Technology, CAS; [email protected]
[email protected]

Abstract—Existing anomaly detection approaches based on I. I NTRODUCTION


deep-learning just could simultaneously dig out key information
from two dimensions in the traces, metrics or logs. Besides, they Large monolithic services have been gradually re-
just output simple binary result, which ignores the key artificial architected into finer-grained modules, which combine hun-
statement information in the log. In this paper, we propose
Triple, an interpretable anomaly detection approach based on dreds (or even thousands) of loosely-coupled microservices.
deep learning for microservice system. More importantly, Triple For large-scale online microservice systems, such as social
aims to help engineers to establish trust in the system decision networking, E-bank, and search engines, engineers need to
from key metrics and the artificial statements in logs. frequently conduct software changes [12], [13], aiming to fix
Triple leverages graph representation to describe the com- bugs, deploy new features, adapt to environmental change,
plicated dependency relationship in the traces with the logs and improve software performance. Based on the statistics
and metrics embedded into the node features. Based on the of Google SRE(Site Reliability Engineering), 70% of failure
graph representation, Triple trains a Spatial-Temporal Graph
Convolutional Network(STGCN) to capture the key information are caused by configuration and service changes [11], [14].
and generate decision boundary by deep SVDD, which detects To allow engineers to timely react to potential failures, it is
the system’s anomaly. In addition, we design an interpreter to essential to detect anomaly automatically at runtime.
transfer the simple binary result into a humanly understandable With the development of monitoring and collecting tools,
result, including log, metrics and trace, to facilitate engineers’
Metrics, Traces and Logs become the three cornerstones of
understanding and handling of the incoming incident. Our work
has four aims. microservice system’s observability. Metrics are numeric data
measurements instrumented over intervals of time, which help
First, to the best of our knowledge, we are the first to simul-
taneously apply three data sources to finish anomaly detection to understand why your application is working in a certain
in the domain. Second, we design a new anomaly detection way. Each trace records the calling process of a request
method that is an STGCN based on SVDD. Third, we design through some service instances and their operation [25], [26].
an interpreter that makes the decision not only a simple binary Logs are an important and valuable data source in online
result. The interpretable result could capture the key artificial
service systems, which records detailed information of system
statement information in the log and assist engineers in incident
troubleshooting.Finally, we design a series experiments to validate running status and user behavior.
our method’s effectiveness in the real-world system’s dataset. Our Logs, metrics and traces have been widely used in anomaly
results show that Triple consistently achieves improvements over detection for microservice systems. Existing log anomaly
other state-of-the-art models by 11%-65%. detection methods learn log pattern from normal history data
Index Terms—microservice, anomaly detection, Spatial- and trigger detection when the online data generates deviation
Temporal Graph Convolutional Network, Deep SVDD, inter- with trained model. For a microservice system, a request
pretability
involves complex invocations among many service instances.
These approaches couldn’t capture the complex structure of
its invocation chain and decrease the detection accuracy. Some
* is
the corresponding author of the paper researches aim to detect anomaly by trace data. They encode
IEEE Copyright: 979-8-3503-9973-8/23/$31.00 ©2023 IEEE trace as a sequence of service invocation. However, the trace

Authorized licensed use limited to: York University. Downloaded on March 04,2025 at 18:22:07 UTC from IEEE Xplore. Restrictions apply.
focus on the inter-services’ relationship and couldn’t reflect Trace
Microservice System
service instance’s status enough well. On the contrary, the
Consists
Consists of
methods that just leverage metrics to detect anomalies will (Caller,Callee) of

capture the service instance’s status but lose the inter-services’ Microservice Microservice pair Invocation(Span)
relationship. Recorded
to Generating log
Recent researches [16], [37] try to combine traces and
Hosts
metrics or traces and logs to detect runtime anomalies of Instance(Container) Physical Node Log message

microservice systems by deep learning method. These algo- Collected Collected


rithms construct a graph representation by trace and embed
Instance metrics Node metrics
logs or metrics into the graph. However, existing anomaly
detection approaches couldn’t simultaneously capture the key
information and relationship from three data source, metrics, Fig. 1: The Overview of Basic Concepts
traces and logs. Furthermore, they ignore the log messages
could help engineers to understand the algorithm’s de-
which describe the behaviors of individual services or the met-
cision and confirm the anomaly’s impact from three data
rics messages which describe the status of service’s instance.
sources.
Therefore, these algorithms couldn’t detect anomalies in the
• We evaluate Triple using open-source datasets with
system well enough.
214 faults in total. Triple improves performance over
Besides, many of these researches aim to finish the anomaly
prior state-of-the-art methods and outperforms other ap-
detection through deep-learning and machine-learning method.
proaches by 11%-65%. Besides, we release our source
However, they just focus on how to improve the precision of
code on Github [9].
anomaly detection. They are ill-suited to a production system
since it is difficult to establish trust in the system decision
without credible explanations. II. BACKGROUND AND M OTIVATION
In the paper, we propose Triple, an interpretable anomaly
In Fig.1, we describe the relationship among some basic
detection approach based on deep learning. Triple lever-
concepts. A Microservice System is an architecture that is
ages trace’s dependency relationships to construct the unified
composed by loosely-coupled and lightweight Microservices.
graph representation, where nodes’ features are processed
Microservices are initiated by instances, which is the running
logs and metrics. Then, Triple trains a STGCN based Deep
version of this microservice on the Physical Node. The
SVDD(Support Vector Data Description) model, which could
monitoring tools [7] generate metrics to describe physical
extracts key information from the unified graph represen-
node and instances’ status. The trace describes the execution
tation and construct a data-enclosing hypersphere in output
process of a request through service instances [2]. It has been
space [30], [35]. Triple judges whether the time-series data is
supported by open-source solutions [3], [5], [8]. The Logs are
anomalous based on its score, which is the distance between
produced during each service invocation, which is generated
the data and hypersphere’s center. Triple doesn’t rely on data
by the logging statements. They record the internal status and
label and just require most of the training data is collected
behaviors of the invocation.
by the normal situation of the system. After finishing the
Various statistical and deep learning method have been
procedure of anomaly detection, our interpreter helps engi-
leveraged to detect the anomaly in microservice system [27],
neers to establish trust in the system by providing credible
[38]. The logs, traces and metrics are the most important
evidence for why the anomaly has been identified from three
features to observe anomaly.
data sources. We minimize the distance between the possible
subgraph representation and the original graph representation
A. Anomaly Detection Prior Work Based Single-source Data
prediction result to dig out the key features.
We design a series experiments to evaluate our algorithm Some algorithms aim to detect anomalies from logs mes-
in the open-source real-system dataset [4]. The result shows sages and pattern. DeepLog [15] parses log messages and
that Triple outperforms existing other state-of-the-art (SOTA) encodes them for time-series data. It leverages LSTM to learn
anomaly detection algorithms by 11%-65% on precision. and predict the time-series log pattern at the next time step.
Overall, the contributions of this paper are four-fold: If the result deviates from the online log result, it will trigger
• We design the new anomaly detection method that is a the anomaly detection procedure. TraceAnomaly [23] treats
STGCN based Deep SVDD model. a trace as a sequence of service invocations. They construct
• We are the first to simultaneously apply three data sources the service trace vector(STV) to record the detailed invocation
to finish anomaly detection in the domain. The ablation sequences and leverage VAE to find out the deviation between
experiment shows the three data sources help our model history normal STV and online STV. Metric helps engineers
to get promotion of detection precision 11%–20% at to monitor the physical node and service instance’s status.
dataset A and 3%–7% at dataset B. Donut [22] captures the normal seasonal metrics with VAE
• To tackle the black box problem, we design an interpreter and reconstructs the metrics, which will generate deviation
for our anomaly detection algorithm. The interpreter with abnormal data.

Authorized licensed use limited to: York University. Downloaded on March 04,2025 at 18:22:07 UTC from IEEE Xplore. Restrictions apply.
B. Anomaly Detection Prior Work Based multi-source Data Online Logs
Log 1: Time:1647794342 severity: info, message: order
confirmation email sent to "[email protected]"
However, above mentioned works still cannot fully detect Log M: Time:1647809826 ERROR:Opentelemetry.sdk.trace.
all types of anomalies due to the single-source data. Therefore, export: Exception while exporting Span batch.

New Log Templates


recent work further explores anomaly detection on multi- Offline Base Standard
History Normal Log templates Templates
source data.
Template 1:severity: info, message: <*> <*> <*>
The Seer [16] encodes traces and metrics as sequences. It ...
Template N:Executed endpoint 'gRPC - <*>. Log Matching and analyzing
leverages Convolutional Neural Network(CNN) to capture the Generating Log Analysis
spatial relationship and Long Short-Term Memory(LSTM) to
learn time-series features. The CNN+LSTM architecture helps
to analyze anomalies from the service’s interaction relationship
and instance status together. The architecture helps Seer to Fig. 2: The Overview of Log Parsing
outperform the existing single-source data anomaly detection
method. PGExplainer [24] aims to approximate discrete masks for
The SCWarn [41] leverages multi-model LSTM to capture edges to explain the predictions. The interpreter leverages the
informations from heterogeneous multi-source data. The core edge embeddings to predict the selected probability of each
ideas are capturing the temporal dependency in each time edge, which could be treated as the important score of the
series via LSTM model and encoding the inter-correlations model.
among multi-source data(metrics and logs).
DeepTraLog [37] uses a unified graph representation to D. Motivation
depict the complex structure of a trace together with log
DeepTraLog is a deep-learning model(GGNN+SVDD) [20]
events embedded in the structure. It leverages the graph
to learn the latent representation and describe the hypersphere
representation to detect anomalies by deep learning method.
boundary that could distinguish the outliers. It ignores the
C. Interpretable Research of Deep Learning service instance’s performance status and just focuses on the
Interpretable artificial intelligence is an emerging area of re- service’s interaction behavior, thus can only detect anomalies
search that aims to understand the deep-learning how to make that are reflected in traces and log messages. Besides, the
decision. We consider a model to be ”interpretable” when outliers, which are decided by high dimensional hypersphere,
the model could provide humanly understandable result. Two are difficult for engineers to understand and trust it.
major categories of interpretable researches are introduced in On the other hand, SCWarn could evaluate the service’s
the following subsections. performance status and detect whether existing performance
1) Gradient-based Method: Employing gradients to explain anomalies. However, it does not consider the service’s depen-
the deep models is the most straightforward solution, which dency relationships which describe the behaviors of individual
is widely used in image and text tasks [10], [36]. service instances involved in a specific trace. Therefore, this
CAM [28] maps the node features in the final layer to approach cannot well capture microservice anomalies that are
the input space to identify important nodes. It requires the reflected by trace. This method couldn’t also break away the
GNN model to employ a global average pooling (GAP) and simple binary black-box result due to the model structure of
fully-connected(FC) layer as the final classifier. CAM leverage LSTM.
the weights between the GAP and final FC layer to calculate Though existing works try to explore multi-source data
importance scores. anomaly detection, they are still unable to take advantage of
Grad-CAM [32] extends the CAM to break the constraint of three data sources at the same time. Besides, many of them
the GAP layer. It employs gradients as the weights to explore that leverage deep-learning only focus on precision and ignore
different feature maps. Specifically, it averages such gradients the result’s interpretability that makes operation engineers hard
to obtain the weight for each feature map and calculate the to understand the simple binary prediction result and confirm
importance score. anomaly as soon as possible.
2) Perturbation-based Method: Intuitively, the Motivated by the limitation of above mentioned algorithms,
perturbation-based method aims to explore the output we design a new three data sources fusion method and
variation with respect to different input perturbations. When successfully generate the unified graph representation that
the important input information is kept, the prediction should could describe the structure of traces together with logs and
be similar to the original prediction result [31]. metrics to improve anomaly detection precision. We design
GNNExplainer [34] learns soft masks for edges and node the new anomaly detection method based on deep-learning to
features to the original graph via element-wise perturbation. match our unified graph representation that help to maximize
The soft masks are randomly initiated and treated as trainable the precision of anomaly detection. To enjoy the satisfying
features. GNNExplainer combines the soft mask with graph precision of deep-learning and tackle the black-box problem,
and node features to perturb the original features. Then it trains we design the interpreter to establish operation engineers’
the pertubed features to maximize the mutual information with trust in the model decision with humanly understandable
original features. explanations.

Authorized licensed use limited to: York University. Downloaded on March 04,2025 at 18:22:07 UTC from IEEE Xplore. Restrictions apply.
Offline Training
Generating unified graph Model Training
Log Parsing and Analyzing Timei
Processed Metric
STGCN
 m1 

SVDD
 

Timei 1
 m1   m2 
   ... 
 m2  Pro  

Logs and Log  ... 




 mn 

 mn 1 

du
ctc
 mn 1 
m 
 n  Em
m

ail
1 

Timei  2
 m1
ata  m 2 
 

Input  m1   ... 
 m2
log  Pro  
   ... 
 m2  du  mn 1 
 ...  Reco  
 mn 1 
Fro m 
 n  Em
  m 
ctc  m1 
 mn 1 
mme  n  nte ail 
ata  m1   m1  m2 

 mn 
 ndati  m1  log  m2  R

Graph-Conv
Gated-Conv
 m1    ... 
on    nd 
C
 m2 

Temporal

 m2   m2   ...  ar  ...  mn 1 
Fro

Spatial
Trace Sample Interval
 ...  Reco ... 
mme mn 1 

 mn 1 

t   m 
 mn 1  n 

MLP
 
P  mn 1  m 
 n 
nte 
 mn 


 mn 
 ndati mn   m1 
 m1 
r nd C  m2 
on m1   Che 
oF  

Data dr
u
 m2 m 
 ...  1 
 m2 
 m2 
 ... cko

 mn ut
1


ar  ... 
t  
 mn 1 
m 

Time
 m1 
F  mn 1
co E  m   n 
 
 m2   m ...   n   m1 
 ...   n m m   
t n i  r2 Che C
1

 mn 1 
 m m   n 1     m2 
m 
c m  C 1 m  m 
cko 2
ar  ... 
 n  
 m a m 
  n   m1   ... 
a t mC o a
Logs
 m1 
1 1
2
   
  2

 m 
 ... 
  m2 
ut  t  mn 1 
 m2   ... il m  m
2

 n 1 

Metrics and Trace


 ... 
 ...  t e mh n r n 1
 m   ...  m 
 

 m   m   n 
   n 
n 1 n

m  C
n 1
 mn 1 
a m me  mn 1 
m 
t Ct   m 
n

ln
n
 n    1


m 1

1

 m c
2
 a  m 2

 m 
 n 
 m2 m 
 m1 
eh
 ...  ...

 ...  1

 od k r  m 
  
m2
n 1

 m2 
m 
 mn 1
 m ... 
 n 1
 gm 
 o
e t  m 
 C
nc
n
  m1 
 n m
n
 m1     ... 
 n 1  
 m2 
 a  m2   
m 
 n  u
 m1   ... 
 mn 1 

 m2 
t
 dk
 ... 
 
 mn 1 
r  
 mn 1  m 
 ...  m   n 
 
 mn 1 
o
m 
 n  t  n 

m 
 n  u m 1 

t  ...m
2 


 
 mn 1 
m 
 n 

Raw Data Update Model


Generating online Data

Interpreter Anomaly Detection


Cloud Infrastructure DataBase
Solve
problem SVDD STGCN
Feedback result R

Graph-Conv

Gated-Conv
Temporal
Spatial
Request&Response

MLP
Collect Data

Operations Engineers
Cloud
Web Applications Service&agent Interpreted Original
Online Testing Important node
Important Features

Fig. 3: An overview of Triple’s pipeline

III. O UR A PPROACH : Triple For helping to generate the unified graph representation with
A. Overview metric, the log aggregation interval is same with metric’s
sample interval.
The overview of Triple is presented in Figure.3, which
includes five main steps: Log parsing and Analyzing, Gen- C. Generating unified graph representation
erating unified graph representation, Model Training, In the previous subsection, we transfer the log messages
Anomaly detection and Interpreter. The key idea of Triple to data format that is similar with metrics. To combine
is to support unsupervised anomaly detection by dealing traces and this data format, we need to convert the spans of
with heterogeneous three data sources and provide humanly traces into time-series data [21]. In the figure 4, we show
understandable results. Specifically, the first step of Triple is the trace parsing procedure. We aggregate all spans by time
transforming heterogeneous logs into unified time series data. interval, which is same with metric’s sample interval. Then,
Then we generate the unified graph representation by traces, the aggregated traces will construct the dependency graph for
processed logs and metrics. The STGCN based Deep SVDD this time interval. Every service’s instance is the graph’s node
model maps the graph representation into hypersphere that and the node’s hidden features are this instance’s metrics and
could distinguish the outliers. Finally, the result of classifica- log features. Besides, we also extract the average response
tion will be translated into humanly understandable result for time of every service’s instance from the traces to embed it
operation engineers. into the node’s hidden features.
In our approach, the traces are merged into a directed
B. Log Parsing and Analyzing
attributed graph G = V, E. To better capture the temporal
The Drain is one of the most popular log parsing approaches relationship, the node set that is generated by metric and log
for log anomaly detection. It is an efficient and accurate online features V = vti |t = 1, ..., τ, i = 1, ..., N contains all N
log parsing method that can parse logs in a streaming and service instances for all continuous τ time intervals before
timely manner. Drain uses a fixed depth parse tree, which and after the current time. We construct the directed graph by
encodes specially designed rules for parsing and generates log merging all τ time intervals’ spans.
templates. The figure 2 shows the procedure how to parse the
log template and analyze the variation of template. First, we D. Model Training
parse the normal history log messages to construct standard In the scenario of microservice’s anomaly detection, the
template base offline by drain [18]. These normal log templates majority of the training dataset consists of normal data.
will be the standard to check the online log messages whether Therefore, our model aims to describe the train data which
existing new template. Then, we parse the online log messages is considered belong to the same class. In the task, we need
to construct time-series log templates streaming. The online to learn the boundary that could classify most of normal data
log streaming will be matched with the template base. We will and other outliers.
record the unmatched log template quantity, kinds and other The Triple leverages STGCN to construct the latent repre-
features. Then, we aggregate these features by time interval. sentation of input features and Deep SVDD [30] to learn the

Authorized licensed use limited to: York University. Downloaded on March 04,2025 at 18:22:07 UTC from IEEE Xplore. Restrictions apply.
Generating Graph Representation Input

Metric Metric Metric


And Log And Log And Log CNN
conv
Temporal Blocks X3
Trace Sample Interval Sample Interval Sample Interval
Data GLU
P

 m1 
 
r
o
d BN
Timei
 m2 
uFr
 ... 
 
cton
E  m1 
 mn 1 
m  m  
 m2 
 n  c C
ate
 m1 
 m1 
  Fr a  ail 
 m2 
 m2 
 ... 
tnd
al C on r
 ... 
   ... 
 
 
m   mn 1 
 mn 1 

te t
m 
1  o  m h  n  C
 mn 1 
 m1 
 
 mn  1  

 m2 m
g  m ec


a  m2 
ndC m 
2
 ... 

Spatial Blocks X3 GCN


  ... 
r  
 ... m k
 n 
1
    mn 1 
  mn 1  m 
2
t
 
 ... 

m 
 n  o h  n  C
 mn 1m
m n 1 

ut ec a
m 
n   m1 
k r

conv
 
 n   m2 
 ...  o t
 
 mn 1  ut
m 
 n 

MLP
Fig. 4: The procedure of generating graph representation
Hidden
Features
latent representations of the data together with the one-class
classification objective. STGCN leverages graph convolutional Fig. 5: The Architecture of STGCN
network(GCN) and convolutional neural network(CNN) to
kernel followed by gated linear units(GLU) for non-linearity.
capture the spatial and temporal features. Deep SVDD aims to
For each node in graph, G, the 1-D temporal convolution
employ a deep learning model(STGCN) that is jointly trained
explores Kt time intervals of input elements without padding.
to map the data into a hypersphere. The hypersphere’s bound-
The 1-D convolution will capture the hidden relationship
ary could help us to classify the normal data and other outliers.
between different time intervals. Thus, the node features of
Existing anomaly detection methods based on deep-learning
input could be regarded as a length-τ sequence with M features
usually leverage reconstruction error [23] or anomalous value
and Ci channels as X ∈ RM ×τ ×Ci . The convolution kernel
deviation [15]. Different with them, Deep SVDD learns the
Γ ∈ RKt ×Ci ×C0 . The temporal gated convolution could be
latent features of the data and describes the boundary to finish
defined as 1:
anomaly detection. The original Deep SVDD leverages CNN
to learn the latent features and map the data into a hypersphere. Γ(X) = P ⊙ σ(Q) ∈ R(τ −Kt +1)×C0 (1)
However, it is not enough to explore the three data sources in
our scenario. Therefore, we provide Triple, which leverages where P, Q are input of gates in GLU; ⊙ denotes the
the STGCN to capture the three data sources from spatial and element-wise Hadamard product. The σ is sigmoid function.
temporal dimensions and jointly trains with Deep SVDD. 2) Spatial Block: In order to detect the anomaly from the
In our approach, we choose the STGCN to extract the latent spatial domain, we apply the Graph Convolutional network to
features. An STGCN consists of Temporal Block, Spatial learn features. Given a directed graph G = (V, E), where the
Block and Output Block. Fig.5 shows the full architecture of set V contains N nodes; and E represents the set of edges.
Triple’s STGCN module. It leverages three temporal blocks We calculate the graph Laplacian L = D − W and P the eigen
to extract time-series features. Every temporal block will be decomposition of L = U ΛU T , where Dii = j Eij is a
followed by one batch-normalization (BN) layer to maintain diagonal matrix with the ith element of the diagonal line being
the consistency of the data distribution. Then, the input will be the degree of the node i and U is an orthogonal matrix.
trained by three spatial blocks that are composed of 1-D graph A signal on graph G of N nodes can be described as a matrix
convolutions. Finally, we leverage an MLP that is combined X ∈ RN ×Cin consisting of Cin vectors of size N . The graph
with two fully-connected layers to output the result of the convolution kernel is expressed as:
hidden representation of input features. We will introduce each
block in detail in the following part. Θ(L)X =
P
θijk Lk xi , j = 1, 2, ..., cout
1) Temporal Block: Although LSTM-based models are (2)
i∈[1,cin ],k∈[1,K]
effective in time-series analysis for anomaly detection [15], it
requires time-consuming iterations, which are slow to respond where Θ ∈ RK×Cin ×Cout and K is the kernel size of
to dynamic changes. On the contrary, CNN has the advantage graph convolution that determines the maximum radius of the
of faster training and requires no dependency constrains on convolution from central nodes.
previous steps. Therefore, the temporal block leverages 1-D 3) Output Block: The Output Block aims to output the
convolution with the gated linear unit(GLU) to capture time- hidden features vector of the graph representation. We design
series relationship from node features. Figure.6 shows the the output Block that feeds the features H to a Multi-Layer
1-D convolution how to work and capture the relationship Perceptron(MLP) as Eq.3.
between different time interval for every metrics. The temporal
convolution layer contains a 1-D convolution with a width-Kt ho = M LP (H) (3)

Authorized licensed use limited to: York University. Downloaded on March 04,2025 at 18:22:07 UTC from IEEE Xplore. Restrictions apply.
τ time frames In fact, when the data’s latent representation vector out of
the hypersphere’s boundary, the distance greater than 0, we
1-D treat the data as anomalous.
temporal M metrics
conv F. Interpreter
Given a trained deep-learning model and the specific
anomaly instance, our Interpreter will generate an explanation
by identifying the most important features. The interpreter
aims to confirm the important features by maximizing the
Fig. 6: The 1-D temporal convolution similarity between the perturbed result and the original result.
Intuitively, the interpreter first designs the soft mask matrix
4) Deep SVDD Module: In the previous section, we intro-
and combines it with original features. Then, the interpreter
duce our STGCN model how to learn hidden features from
calculates the loss between the masked result and the original
spatial and temporal dimensions. The Deep SVDD module is
prediction result. The loss will update the mask matrix by
leveraged to optimize our STGCN by the following loss func-
gradient descent to make the masked features’ prediction result
tion, which expresses the objective of learning a minimized
more similar with original result. When the loss converges, the
hypersphere to enclose the hidden features vector of the graph
soft masks will record the every feature’s importance.
representation:
Given the original graph dependency relation-
ships GC and graph’s node features XC =
Pl=1
Loss = R2 + µ1 max{0, ||ho − c||2 − R2 } + λ
||W l ||2F mtij |t = 1, 2, ..., τ, i = 1, 2, ..., M, vj ∈ GC , we
2 L
(4) aims to find out the most important subgraph
where R is the radius of the hypersphere; c is the center GS ⊆ GC and identify the associated features
of the hypersphere; ||ho − c||2 is the distance from the XS = mtik |t = 1, 2, ..., τ, i = 1, 2, ..., M, vk ∈ GS .
latent representation of an unified graph data to c; Nl is the For example, considering the situation where a CPU
number of network layers in STGCN; the hyperparameter µ anomaly occurs at service vk . If the interpreter removes an
controls the trade-off between the hypersphere volume and the edge between vk and vj and decreases the probability of
violations of the boundary, Pallowing some data to be mapped anomaly prediction, the edge is a good counterfactual explana-
l=1
outside the hypersphere; λ2 L ||W l ||2F is a regularizer for tion for the current prediction. To be similar, if weakening the
the STGCN’s parameters W with hyperparameter λ. feature mt′ i′ j ′ ∈ X also decreases the probability of current
Optimizing objective 4 lets the network learn parameters W prediction then the mt′ i′ j ′ is good counterfactual explanation.
such that data points are closely mapped to the center c of the We leverage mutual information (MI) to formulate the
hypersphere. We leverage Adam [19] to optimize the STGCN’s optimization framework using Eq. 6:
parameters W , radius R and center c. We train the STGCN
parameters W for some k epochs while the radius R is fixed.
max M I(Y, (GS , XS )) = H(Y ) − H(Y |G = GS , X = XS )
Then, after every k-th epoch, we calculate the radius R by the GS
data representation that comes from the STGCN with latest (6)
paramters W . Every time we calculate the R value, which The M I quantifies the change of prediction when the graph
could contain (1 − µ) percentile of all training data according dependency relationships and features are changed to sub-
to the distance. The hypersphere center c is set to the mean dependency and sub-features.
of vector representation of all training data after once initial 1) The Mask of Graph Dependency: The graph mask
forward pass. MG ∈ RN ×N combines with the computation graph’s ad-
jacency matrix, A ⊙ σ(MG ), where ⊙ denotes element-wise
E. Anomaly Detection multiplication. We optimize the graph mask matrix by gradient
Given a new time interval’s trace, logs and metrics, Triple as follows:
follows the same procedure to process data in training phase
and generate the unified graph representation. Then, the Triple
min −1[y = 1] log PΦ (Y = y|G = A ⊙ σ(MG ), X = XS )
experiment the graph to generate the latent representation MG
by STGCN and the Deep SVDD will calculate the distance (7)
between the latent representation and hypersphere center. The We evaluate the dependency importance by the graph mask.
calculating procedure is in the equation. 5: Every dependency’s weight of the graph mask will be ranked.
2) The Mask of Node Features: To identify which node
Distance(ho ) = ||ho − c||2 − Rf2 (5) features could impact the prediction result, we leverage the
feature mask MF to retain the important features. Intuitively,
where ho is the latent representation vector; c is the center the mask of node features combines with node features and is
of the hypersphere; Rf is the final radius of the hypersphere. updated by gradient descent as follows:

Authorized licensed use limited to: York University. Downloaded on March 04,2025 at 18:22:07 UTC from IEEE Xplore. Restrictions apply.
Dataset A is constructed by China Construction Bank
Offline Normal Log
Node Features Templates Base Related Logs
based on hiptershop system [1]. The dataset comes from the
Ranking
CPU Record Log 1: socket.gaierror: [Errno <*>]
AIOPS 2022 [4] international challenge championship. They
Network Temporary failure in name resolution
process
Top2 unmatched deploy three sub-systems on Kubernetes [6] and the faults
logs Log M: socket.gaierror: [Errno <*>]
fs  read service Name or service not known are randomly injected in three sub-system that contains 120
Unmatched_template Top1 Current Time Logs
Trigger
Template instances. The fault includes 9 different types and we show
the fault distribution in table I.
Dataset B is taken from an open-source microservice
Fig. 7: The Translation Logs Module
benchmark system for train ticket service. The system is
widely used in existing works [21], [44], [45] and contains
41 microservices. We build the system and inject 175 faults
min −1[y = 1] log PΦ (Y = y|G = GS , X = X ⊙ σ(MF )) related to microservice.
MF
(8)
We initialize the feature mask using the Normal distribution. TABLE I: Overview fault type of the dataset
After finishing the update procedure of mask, we will rank
Dataset fault type cases cases percent
every features’ importance by the mask matrix. File system Read 26 12%
Different with traditional classification tasks, our model File system Write 11 5%
aims to finish unsupervised anomaly detection, which needs to Network Delay 29 14%
Network Packet Corruption 31 14%
describe the hypersphere and detect the outliers. So we modify Dataset A Network Packet Duplicate 22 10%
the loss function to calculate the distance between the original Network Drop Packet 28 13%
prediction coordination(latent feature representation) and the CPU load 27 13%
Memory load 25 12%
perturbed coordination. Therefore, we calculate the objective Container Pause 15 7%
as follows: Network-delay 56 34%
Dataset B CPU 51 31%
min −1[y = 1] log PΦ (Distance(h, ĥ)| Container Pause 58 35%
Mtotal (9)
G = A ⊙ σ(MG ), X = X ⊙ σ(MF ))
B. Experiment Settings
where Distance(h, ĥ) = ||h − ĥ||2 ; The h and ĥ is the We implement Triple using Python 3.7, Pytorch 1.13.0. All
latent feature representation at the hypersphere. The y = 1 the experiments are conducted on a server with 3070ti GPU,
represents the anomalous label. 32GB RAM and 5800X processors with 6 cores. The detail
3) Interpretable Analysis: When an anomaly is predicted setting of Triple are the following:
by eWarn, an alert will be sent to engineers, so that they can Following existing work [37], [40], [42], [43], we use three
adopt some proactive actions in advance. To facilitate engi- following metrics to evaluate the model’s effectiveness:
neers to handle the warning, Triple provides an interpretable • Precision measures how many returned results of abnor-
report about the alert from the original matrix score for engi- mal data are correct.
neers. Due to the space limit, we omit the details interpretable • Recall measures how many relevant results are returned.
report cases, which can be found in our appendix ??. Different • (F1-Score) combines the precision and recall of a model
with traditional interpretable result [29], [39], we design the into a single metric by taking their harmonic mean.
Translation Logs Module for our interpreter to improve the Besides, we design the new evaluation method to confirm
readability of log features. In the figure 7, when the interpreter our interpreter could give best interpretation result, comparing
marks the relative features of logs in the nodes, the Translation with other interpreter:
Logs Module is triggered and picks up the unmatched template
• Label flipping ratio(LFR) replacing the picked up im-
from current online log templates. The unmatched templates
portant features with history normal value and test the
will be written into the interpretable report for the related logs
anomaly instance whether recovers normal status.
part.
• Overhead calculating the time-consuming of the inter-
Therefore, with an interpretable report, engineers not only
preter.
benefit from early alerts of incidents, but gain some inspiration
for incident diagnosis from three data sources’ features. 1) The Model Effectiveness Comparison: The table II
shows the detail effectiveness of all anomaly detection algo-
IV. E VALUATION rithms. Across the 214 fault in the open-source dataset A,
Triple outperforms all other algorithms in anomaly detection.
A. Experiment Dataset Most approaches perform badly at the dataset, since it has
We rely on two datasets in our evaluation: (i) the Hip- nine kinds of fault types and 120 service instances. These
stershop system that is constructed by China Construction fault types are hard to be fully detected from a single source.
Bank [4]; (ii) an open-source microservice benchmark system That is also why other algorithms have poor performance in
for Train-Ticket service [44]; the dataset.

Authorized licensed use limited to: York University. Downloaded on March 04,2025 at 18:22:07 UTC from IEEE Xplore. Restrictions apply.
TABLE II: Overview comparison of Anomaly Detection Ap-
2763
proaches
2500

Training Time(s)
Dataset Algorithm Precision Recall F1-Score 2000
1862
Triple(Trace+Log+Metric) 0.89 0.87 0.88
Triple(Trace+Log) 0.74 0.68 0.66 1500
Triple(Trace+Metric) 0.80 0.77 0.78 1163
A DeepLog 0.54 0.49 0.51 1042
1000 894
SCWarn 0.80 0.81 0.80
TraceAnomaly 0.65 0.63 0.64 500
DeepTraLog 0.76 0.73 0.74
Triple(Trace+Log+Metric) 0.97 0.98 0.97 0
Ours DeepTraLog SCWarn TraceAnomaly DeepLog
Triple(Trace+Log) 0.91 0.95 0.93
Algorithms
Triple(Trace+Metric) 0.94 0.95 0.94
B DeepLog 0.78 0.76 0.77
SCWarn 0.93 0.94 0.93 Fig. 8: The training time for every algorithms
TraceAnomaly 0.85 0.84 0.84 12.0 20
w/ GPU w/ GPU
DeepTraLog 0.89 0.88 0.84

Epoch Time(s)

Epoch Time(s)
9.6 w/o GPU w/o GPU
15
Different with dataset A, due to the simple fault type and 7.2
10
fewer service instances in dataset B, all anomaly detection 4.8

algorithms have the more similar performance result. Triple 2.4


5

outperforms all other algorithms by just 2%-20%. It also 0.0 0


0 50 100 150 200 0 50 100
proves our model is more valuable to apply to complicated Fault Instance Numbers Node Feature Numbers
and enormous real-systems scenarios.
The TraceAnomaly leverages trace to finish the detection, Fig. 9: The training time for every epochs
which achieves low precision and recall in dataset A. As it
has a special focus on response time of trace data, it can the graph as network units and conducts message passing. It
only detect anomalies that have significant impact on the means every trace and relative log events need to be processed
distribution of service’s response time. by the model individually. On the contrary, Triple aggregates
The DeepTraLog and DeepLog are approaches based on the trace features by time intervals and reduces the time-
logs. These two approaches consider log events within time- consuming.
series, thus can detect anomalies in log events. However, they LSTM-based SCWarn and Deeplog uses much less features
ignore the service’s response time and performance status. than Triple, but is slower than Triple. Because they leverage
The SCWarn couldn’t capture the fluctuation of service’s LSTM to capture time-series features, which is slow to iterate
response time in the trace. In many anomalies, the fluctuating the cell state for each time step. We replace the procedure by
response time is always the key common information to CNN+LSTM structure that help to reduce the time-consuming
confirm the anomaly. 23%.
To step further prove the advantage of three data sources in The TraceAnomaly only considers traces that help it to get
our model, we design two ablation models(Triple(Trace and faster than most algorithms in our experiment, it sacrifices
Metric),Triple(Trace and Log)) that just leverage traces and precision in the prediction.
metrics or traces and logs for the unified graph representa- To step further, though the training time will not affect the
tion. The result shows our unified graph representation that efficiency of online localization, we analyze it to demonstrate
leverages three data sources improves our model’s precision the scalability. We analyze the training time for each epoch
11%–20%. with respect to the number of fault instances/node features
2) Anomaly Detection Model Overhead Comparison: Com- with dataset A. The figure. 9 show that the training time
pared with dataset B(5 GB), dataset A size is bigger(60 GB), of Triple could decrease 70% and the extended features will
including more than 4000000 traces, 918022 (20%) of which not cause unbearable impact of training time overhead with
are affected by the faults. Thus, we focus on all algorithms’ GPU acceleration. The experiment proves our model has good
overhead at dataset A, which is closer to the complex and scalability.
variant real scenario. The figure 8 shows the training time- The testing time of online prediction is important for achiev-
consuming of every algorithms in dataset A. These algorithms ing real-time anomaly detection. Different with the Triple that
take 15-89 minutes to train the model offline. extracts data based on sample interval, DeepTraLog aims to
Compared with other algorithms, the Triple has much less detect every trace whether existing anomaly. So to step further
time-consuming than other algorithms. Because other algo- analyze the advantage and disadvantage of these algorithms,
rithms need to detect the anomaly trace by trace or log by log. we design experiment to describe the testing time from re-
Our approach aggregates these data by time interval, which sponse time and total testing time-consuming. The response
reduces the time-consuming. time expresses that the model finish one prediction(one trace
DeepTraLog is slower than Triple in both training and or one unified graph representation) for different algorithms.
testing. The reason is that GGNNs need to treat every event in The total testing time-consuming calculate the total time-

Authorized licensed use limited to: York University. Downloaded on March 04,2025 at 18:22:07 UTC from IEEE Xplore. Restrictions apply.
1.000
102 97.515

Convergence Ratio
Ours
51.6 0.875
DeepTraLog
26.2 DeepLog 0.750
SCWarn 0.625
101 TraceAnomaly 0.500
0.375
Time(s)

2.4 0.250
1.6
100 0.125
0.000
0 50 100 150 200 250 300 350 400
Epoch
−1
10 0.062
0.045
0.03 0.03
0.02
Fig. 11: The training epochs for convergence
Total Time Once Response Time problem. In addition, compared with Triple, the GRAD and
GAT’s interpretable report will also lose the information of
Fig. 10: The testing time for every algorithms node features(related metrics and related logs) or dependency
graph(related dependency).
consuming that all datas are predicted. We reply 15 minutes’
data including 5 minutes abnormal data and 10 minutes normal TABLE III: Overview comparison of Interpreter
data. Triple and SCWarn will generate 30 time intervals’
data and detect these data whether existing anomaly. The Dataset Algorithm LFR Overhead(seconds)
Triple 0.89 5.95
TraceAnomaly, DeepTraLog and DeepLog will detect every Dataset A GAT 0.78 0.34
trace and log’s anomaly. The testing time of DeepTraLog is GRAD 0.67 0.42
highly depends on the size of the trace, i.e., the number of Triple 0.85 4.95
span/log events in it. We follow the detail evaluation procedure Dataset B GAT 0.74 0.28
GRAD 0.84 0.40
in DeepTraLog paper [37], which calculates average time of
prediction. V. DISCUSSION
The figure. 10 shows the detail result of response time and A. The Log Template Change of Service
total testing time-consuming for every algorithms. Though the
response time is similar among all algorithms, Triple has a Though Triple is the first method to finish the anomaly
great advantage in total time-consuming. Because the trace and detection by three data source and have better performance
log numbers will seriously impact the total time-consuming of than other SOTA, it still couldn’t leverage log messages
these algorithms. That means Triple is better suited for tremen- enough well. When the service’s log statement is updated and
dous system that could generate gigabytes(GB) of traces and generates new log template. Our method may judge the new
logs per minutes or even per seconds. template for anomaly and cause false positive. In the future
work, we hope to improve the adaptive of our algorithm and
3) The Interpreter Comparison: We compare our inter-
avoid the false positive.
preter with the GAT and GRAD method. ATT originates
from graph attention network (GAT) [33] and ASTGCN [17] B. The Static Graph Pattern
that learns attention weights for edges in the dependency Though the Triple has brilliant precision for anomaly de-
graph, which we use as a proxy measure of edge importance. tection, it is constrained by static graph patterns. When the
GRAD is a gradient-based method. We compute the gradient microservice system add new services, STGCN must retrain
of our model’s loss function with respect to the adjacency the new graph pattern and increase overhead. In the future
matrix and the associated node features. It is similar to the work, we will try to modify our model to fit the dynamic
saliency mapping approach. Due to our method needs to graph pattern.
be trained until convergence, the figure. 11 calculates the
average convergence epochs of 120 faults in A dataset. When VI. CONCLUSION
the training epochs arrives 300, more than 99% faults are In the paper, we propose Triple, an anomaly detection ap-
convergent. So we set the 300 epochs’ time-consuming of proach for microservice system based on deep learning model.
our model for Overhead. The table III shows the comparison We first leverage three data sources to generate the unified
result of LFR and Overhead in detail. Different with our graph representation. Then, we design a STGCN based Deep
interpreter, GAT and GRAD doesn’t nearly generate overhead SVDD model which could extract and learn the latent features
and could give the interpretable result immediately. Because from trace, log and metrics. The Deep SVDD constructs a
they leverage the gradient or attention weight to rank the data-enclosing hypersphere and confirms the anomalous data
features. However, GRAD is based on heuristic assumptions according to the hypersphere’s boundary. We evaluate the
that the gradient could reflect the feature importance and may Triple on a open-source dataset that is constructed by China
not be true. GAT doesn’t consider node features and could only Construction Bank based on hipstershop system. The result
explain edge, which means engineers just couldn’t understand shows that our approach outperforms SOTA anomaly detection
the anomaly from specific metrics and logs variations. The algorithms. In the future, we will go step further to promote
lowest LFR result among these methods also proves their the precision and support more fault type.

Authorized licensed use limited to: York University. Downloaded on March 04,2025 at 18:22:07 UTC from IEEE Xplore. Restrictions apply.
VII. ACKNOWLEDGE [25] Ajay Mahimkar. Rapid detection of maintenance induced changes in
service performance. In Proceedings of the Seventh COnference on
This work was supported in part by National Key RD Emerging Networking EXperiments and Technologies, pages 1–12, 2011.
Program of China: 2022YFB3103000, and Natural Science [26] Sonu Mehta and Ranjita Bhagwan. Rex: Preventing bugs and miscon-
figuration in large services using correlated change analysis. In 17th
Foundation of China (U20A20180, 62072437), and CAS- USENIX Symposium on Networked Systems Design and Implementation
Austria Joint Project (171111KYSB20200001) (NSDI 20), pages 435–448, 2020.
[27] Animesh Nandi and Atri Mandal. Anomaly detection using program
R EFERENCES control flow graph mining from execution logs. In Proceedings of the
22nd ACM SIGKDD International Conference on Knowledge Discovery
[1] Hipster-Shop, October 2019. https://github.com/GoogleCloudPlatform/ and Data Mining, pages 215–224, 2016.
microservices-demo. [28] Phillip E Pope and Soheil Kolouri. Explainability methods for graph
[2] Elasticsearch., October 2022. https://www.elastic.co/. convolutional neural networks. In Proceedings of the IEEE/CVF
[3] High-quality, ubiquitous, and portable telemetry to enable effective Conference on Computer Vision and Pattern Recognition, pages 10772–
observability., October 2022. https://opentelemetry.io/. 10781, 2019.
[4] International AIOps Challenge, March 2022. http://iops.ai/. [29] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. ” why
[5] Jaeger: open source, end-to-end distributed tracing., October 2022. should i trust you?” explaining the predictions of any classifier. In
https://www.jaegertracing.io/. Proceedings of the 22nd ACM SIGKDD international conference on
[6] Kubernetes Production-Grade Container Orchestration., October 2022. knowledge discovery and data mining, pages 1135–1144, 2016.
https://kubernetes.io/. [30] Lukas Ruff, Robert Vandermeulen, and Nico Goernitz. Deep one-class
[7] Prometheus, October 2022. https://prometheus.io/. classification. In International conference on machine learning, pages
[8] skywalking.apache.org. Apache SkyWalking., October 2022. 4393–4402. PMLR, 2018.
http://skywalking.apache.org/. [31] Michael Sejr Schlichtkrull, Nicola De Cao, and Ivan Titov. Interpreting
[9] Triple’s source code, October 2022. graph neural networks for nlp with differentiable edge masking. arXiv
https://github.com/TripleResearcher/Triple. preprint arXiv:2010.00577, 2020.
[10] Federico Baldassarre. Explainability techniques for graph convolutional [32] Ramprasaath R Selvaraju, Michael Cogswell, et al. Grad-cam: Visual
networks. arXiv preprint arXiv:1905.13686, 2019. explanations from deep networks via gradient-based localization. In
[11] Betsy Beyer and Chris Jones. Site reliability engineering: How Google Proceedings of the IEEE international conference on computer vision,
runs production systems. ” O’Reilly Media, Inc.”, 2016. pages 618–626, 2017.
[12] Junjie Chen and Xiaoting He. An empirical investigation of incident [33] Petar Veličković et al. Graph attention networks. arXiv preprint
triage for online service systems. In 2019 IEEE/ACM 41st Interna- arXiv:1710.10903, 2017.
tional Conference on Software Engineering(ICSE-SEIP), pages 111–120. [34] Zhitao Ying, Dylan Bourgeois, et al. Gnnexplainer: Generating ex-
IEEE, 2019. planations for graph neural networks. Advances in neural information
[13] Junjie Chen, Xiaoting He, Qingwei Lin, et al. Continuous incident processing systems, 32, 2019.
triage for large-scale online service systems. In 2019 34th IEEE/ACM [35] Bing Yu, Haoteng Yin, and Zhanxing Zhu. Spatio-temporal graph con-
International Conference on Automated Software Engineering (ASE), volutional networks: A deep learning framework for traffic forecasting.
pages 364–375. IEEE, 2019. arXiv preprint arXiv:1709.04875, 2017.
[14] Yujun Chen, Yang, et al. Identifying linked incidents in large-scale [36] Hao Yuan, Haiyang Yu, Shurui Gui, and Shuiwang Ji. Explainability
online service systems. In Proceedings of the 28th ACM Joint Meeting in graph neural networks: A taxonomic survey. IEEE Transactions on
on the Foundations of Software Engineering(FSE), pages 304–314, 2020. Pattern Analysis and Machine Intelligence, 2022.
[15] Min Du, Feifei Li, Guineng Zheng, and Vivek Srikumar. Deeplog: [37] Chenxi Zhang, Xin Peng, and Chaofeng Sha. Deeptralog: Trace-log
Anomaly detection and diagnosis from system logs through deep learn- combined microservice anomaly detection through graph-based deep
ing. In Proceedings of the 2017 ACM SIGSAC conference on computer learning. 2022.
and communications security, pages 1285–1298, 2017. [38] Mengshi Zhang and Yuqun Zhang. Deeproad: Gan-based metamorphic
[16] Yu Gan. Seer: Leveraging big data to navigate the complexity of testing and input validation framework for autonomous driving systems.
performance debugging in cloud microservices. In Proceedings of In 2018 33rd IEEE/ACM International Conference on Automated Soft-
the twenty-fourth international conference on architectural support for ware Engineering (ASE), pages 132–142. IEEE, 2018.
programming languages and operating systems, pages 19–33, 2019. [39] Nengwen Zhao, Junjie Chen, et al. Real-time incident prediction for
[17] Shengnan Guo, Youfang Lin, et al. Attention based spatial-temporal online service systems. In Proceedings of the 28th ACM Joint Meeting
graph convolutional networks for traffic flow forecasting. In Proceedings on the Foundations of Software Engineering(FSE), pages 315–326, 2020.
of the AAAI conference on artificial intelligence, volume 33, pages 922– [40] Nengwen Zhao, Junjie Chen, and Zhou Wang. Real-time incident
929, 2019. prediction for online service systems. In Proceedings of the 28th ACM
[18] Pinjia He, Jieming Zhu, and Zibin Zheng. Drain: An online log parsing Joint Meeting on the Foundations of Software Engineering(FSE), pages
approach with fixed depth tree. In 2017 IEEE international conference 315–326, 2020.
on web services (ICWS), pages 33–40. IEEE, 2017. [41] Nengwen Zhao, Junjie Chen, Zhaoyang Yu, et al. Identifying bad
[19] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic software changes via multimodal anomaly detection for online service
optimization. arXiv preprint arXiv:1412.6980, 2014. systems. In Proceedings of the 29th ACM Joint Meeting on the
[20] Yujia Li. Gated graph sequence neural networks. arXiv preprint Foundations of Software Engineering(FSE), pages 527–539, 2021.
arXiv:1511.05493, 2015. [42] Nengwen Zhao, Panshi Jin, Lixin Wang, et al. Automatically and
[21] Zeyan Li, Junjie Chen, et al. Practical root cause localization for adaptively identifying severe alerts for online service systems. In
microservice systems via trace analysis. In 2021 IEEE/ACM 29th IEEE INFOCOM 2020-IEEE Conference on Computer Communications,
International Symposium on Quality of Service (IWQOS), pages 1–10. pages 2420–2429. IEEE, 2020.
IEEE, 2021. [43] Nengwen Zhao, Jing Zhu, and Yao Wang. Automatic and generic
[22] Zeyan Li, Wenxiao Chen, and Dan Pei. Robust and unsupervised kpi periodicity adaptation for kpi anomaly detection. IEEE Transactions
anomaly detection based on conditional variational autoencoder. In 2018 on Network and Service Management, 16(3):1170–1183, 2019.
IEEE 37th International Performance Computing and Communications [44] Xiang Zhou, Xin Peng, et al. Fault analysis and debugging of microser-
Conference (IPCCC), pages 1–9. IEEE, 2018. vice systems: Industrial survey, benchmark system, and empirical study.
[23] Ping Liu, Haowen Xu, and Qianyu Ouyang. Unsupervised detection IEEE Transactions on Software Engineering, 47(2):243–260, 2018.
of microservice trace anomalies through service-level deep bayesian [45] Xiang Zhou, Xin Peng, et al. Latent error prediction and fault localiza-
networks. In 2020 IEEE 31st International Symposium on Software tion for microservice applications by learning from system trace logs.
Reliability Engineering (ISSRE), pages 48–58. IEEE, 2020. In Proceedings of the 2019 27th ACM Joint Meeting on the Foundations
[24] Dongsheng Luo, Wei Cheng, et al. Parameterized explainer for graph of Software Engineering(FSE), pages 683–694, 2019.
neural network. Advances in neural information processing systems,
33:19620–19631, 2020.

Authorized licensed use limited to: York University. Downloaded on March 04,2025 at 18:22:07 UTC from IEEE Xplore. Restrictions apply.

You might also like