TripleThe Interpretable Deep Learning Anomaly Detection Framework Based On Trace-Metric-Log of Microservice
TripleThe Interpretable Deep Learning Anomaly Detection Framework Based On Trace-Metric-Log of Microservice
Authorized licensed use limited to: York University. Downloaded on March 04,2025 at 18:22:07 UTC from IEEE Xplore. Restrictions apply.
focus on the inter-services’ relationship and couldn’t reflect Trace
Microservice System
service instance’s status enough well. On the contrary, the
Consists
Consists of
methods that just leverage metrics to detect anomalies will (Caller,Callee) of
capture the service instance’s status but lose the inter-services’ Microservice Microservice pair Invocation(Span)
relationship. Recorded
to Generating log
Recent researches [16], [37] try to combine traces and
Hosts
metrics or traces and logs to detect runtime anomalies of Instance(Container) Physical Node Log message
Authorized licensed use limited to: York University. Downloaded on March 04,2025 at 18:22:07 UTC from IEEE Xplore. Restrictions apply.
B. Anomaly Detection Prior Work Based multi-source Data Online Logs
Log 1: Time:1647794342 severity: info, message: order
confirmation email sent to "[email protected]"
However, above mentioned works still cannot fully detect Log M: Time:1647809826 ERROR:Opentelemetry.sdk.trace.
all types of anomalies due to the single-source data. Therefore, export: Exception while exporting Span batch.
Authorized licensed use limited to: York University. Downloaded on March 04,2025 at 18:22:07 UTC from IEEE Xplore. Restrictions apply.
Offline Training
Generating unified graph Model Training
Log Parsing and Analyzing Timei
Processed Metric
STGCN
m1
SVDD
Timei 1
m1 m2
...
m2 Pro
Timei 2
m1
ata m 2
Input m1 ...
m2
log Pro
...
m2 du mn 1
... Reco
mn 1
Fro m
n Em
m
ctc m1
mn 1
mme n nte ail
ata m1 m1 m2
mn
ndati m1 log m2 R
Graph-Conv
Gated-Conv
m1 ...
on nd
C
m2
Temporal
m2 m2 ... ar ... mn 1
Fro
Spatial
Trace Sample Interval
... Reco ...
mme mn 1
mn 1
t m
mn 1 n
MLP
P mn 1 m
n
nte
mn
mn
ndati mn m1
m1
r nd C m2
on m1 Che
oF
Data dr
u
m2 m
... 1
m2
m2
... cko
mn ut
1
ar ...
t
mn 1
m
Time
m1
F mn 1
co E m n
m2 m ... n m1
... n m m
t n i r2 Che C
1
mn 1
m m n 1 m2
m
c m C 1 m m
cko 2
ar ...
n
m a m
n m1 ...
a t mC o a
Logs
m1
1 1
2
2
m
...
m2
ut t mn 1
m2 ... il m m
2
n 1
m C
n 1
mn 1
a m me mn 1
m
t Ct m
n
ln
n
n 1
m 1
1
m c
2
a m 2
m
n
m2 m
m1
eh
... ...
... 1
od k r m
m2
n 1
m2
m
mn 1
m ...
n 1
gm
o
e t m
C
nc
n
m1
n m
n
m1 ...
n 1
m2
a m2
m
n u
m1 ...
mn 1
m2
t
dk
...
mn 1
r
mn 1 m
... m n
mn 1
o
m
n t n
m
n u m 1
t ...m
2
mn 1
m
n
Graph-Conv
Gated-Conv
Temporal
Spatial
Request&Response
MLP
Collect Data
Operations Engineers
Cloud
Web Applications Service&agent Interpreted Original
Online Testing Important node
Important Features
III. O UR A PPROACH : Triple For helping to generate the unified graph representation with
A. Overview metric, the log aggregation interval is same with metric’s
sample interval.
The overview of Triple is presented in Figure.3, which
includes five main steps: Log parsing and Analyzing, Gen- C. Generating unified graph representation
erating unified graph representation, Model Training, In the previous subsection, we transfer the log messages
Anomaly detection and Interpreter. The key idea of Triple to data format that is similar with metrics. To combine
is to support unsupervised anomaly detection by dealing traces and this data format, we need to convert the spans of
with heterogeneous three data sources and provide humanly traces into time-series data [21]. In the figure 4, we show
understandable results. Specifically, the first step of Triple is the trace parsing procedure. We aggregate all spans by time
transforming heterogeneous logs into unified time series data. interval, which is same with metric’s sample interval. Then,
Then we generate the unified graph representation by traces, the aggregated traces will construct the dependency graph for
processed logs and metrics. The STGCN based Deep SVDD this time interval. Every service’s instance is the graph’s node
model maps the graph representation into hypersphere that and the node’s hidden features are this instance’s metrics and
could distinguish the outliers. Finally, the result of classifica- log features. Besides, we also extract the average response
tion will be translated into humanly understandable result for time of every service’s instance from the traces to embed it
operation engineers. into the node’s hidden features.
In our approach, the traces are merged into a directed
B. Log Parsing and Analyzing
attributed graph G = V, E. To better capture the temporal
The Drain is one of the most popular log parsing approaches relationship, the node set that is generated by metric and log
for log anomaly detection. It is an efficient and accurate online features V = vti |t = 1, ..., τ, i = 1, ..., N contains all N
log parsing method that can parse logs in a streaming and service instances for all continuous τ time intervals before
timely manner. Drain uses a fixed depth parse tree, which and after the current time. We construct the directed graph by
encodes specially designed rules for parsing and generates log merging all τ time intervals’ spans.
templates. The figure 2 shows the procedure how to parse the
log template and analyze the variation of template. First, we D. Model Training
parse the normal history log messages to construct standard In the scenario of microservice’s anomaly detection, the
template base offline by drain [18]. These normal log templates majority of the training dataset consists of normal data.
will be the standard to check the online log messages whether Therefore, our model aims to describe the train data which
existing new template. Then, we parse the online log messages is considered belong to the same class. In the task, we need
to construct time-series log templates streaming. The online to learn the boundary that could classify most of normal data
log streaming will be matched with the template base. We will and other outliers.
record the unmatched log template quantity, kinds and other The Triple leverages STGCN to construct the latent repre-
features. Then, we aggregate these features by time interval. sentation of input features and Deep SVDD [30] to learn the
Authorized licensed use limited to: York University. Downloaded on March 04,2025 at 18:22:07 UTC from IEEE Xplore. Restrictions apply.
Generating Graph Representation Input
m1
r
o
d BN
Timei
m2
uFr
...
cton
E m1
mn 1
m m
m2
n c C
ate
m1
m1
Fr a ail
m2
m2
...
tnd
al C on r
...
...
m mn 1
mn 1
te t
m
1 o m h n C
mn 1
m1
mn 1
m2 m
g m ec
a m2
ndC m
2
...
conv
n m2
... o t
mn 1 ut
m
n
MLP
Fig. 4: The procedure of generating graph representation
Hidden
Features
latent representations of the data together with the one-class
classification objective. STGCN leverages graph convolutional Fig. 5: The Architecture of STGCN
network(GCN) and convolutional neural network(CNN) to
kernel followed by gated linear units(GLU) for non-linearity.
capture the spatial and temporal features. Deep SVDD aims to
For each node in graph, G, the 1-D temporal convolution
employ a deep learning model(STGCN) that is jointly trained
explores Kt time intervals of input elements without padding.
to map the data into a hypersphere. The hypersphere’s bound-
The 1-D convolution will capture the hidden relationship
ary could help us to classify the normal data and other outliers.
between different time intervals. Thus, the node features of
Existing anomaly detection methods based on deep-learning
input could be regarded as a length-τ sequence with M features
usually leverage reconstruction error [23] or anomalous value
and Ci channels as X ∈ RM ×τ ×Ci . The convolution kernel
deviation [15]. Different with them, Deep SVDD learns the
Γ ∈ RKt ×Ci ×C0 . The temporal gated convolution could be
latent features of the data and describes the boundary to finish
defined as 1:
anomaly detection. The original Deep SVDD leverages CNN
to learn the latent features and map the data into a hypersphere. Γ(X) = P ⊙ σ(Q) ∈ R(τ −Kt +1)×C0 (1)
However, it is not enough to explore the three data sources in
our scenario. Therefore, we provide Triple, which leverages where P, Q are input of gates in GLU; ⊙ denotes the
the STGCN to capture the three data sources from spatial and element-wise Hadamard product. The σ is sigmoid function.
temporal dimensions and jointly trains with Deep SVDD. 2) Spatial Block: In order to detect the anomaly from the
In our approach, we choose the STGCN to extract the latent spatial domain, we apply the Graph Convolutional network to
features. An STGCN consists of Temporal Block, Spatial learn features. Given a directed graph G = (V, E), where the
Block and Output Block. Fig.5 shows the full architecture of set V contains N nodes; and E represents the set of edges.
Triple’s STGCN module. It leverages three temporal blocks We calculate the graph Laplacian L = D − W and P the eigen
to extract time-series features. Every temporal block will be decomposition of L = U ΛU T , where Dii = j Eij is a
followed by one batch-normalization (BN) layer to maintain diagonal matrix with the ith element of the diagonal line being
the consistency of the data distribution. Then, the input will be the degree of the node i and U is an orthogonal matrix.
trained by three spatial blocks that are composed of 1-D graph A signal on graph G of N nodes can be described as a matrix
convolutions. Finally, we leverage an MLP that is combined X ∈ RN ×Cin consisting of Cin vectors of size N . The graph
with two fully-connected layers to output the result of the convolution kernel is expressed as:
hidden representation of input features. We will introduce each
block in detail in the following part. Θ(L)X =
P
θijk Lk xi , j = 1, 2, ..., cout
1) Temporal Block: Although LSTM-based models are (2)
i∈[1,cin ],k∈[1,K]
effective in time-series analysis for anomaly detection [15], it
requires time-consuming iterations, which are slow to respond where Θ ∈ RK×Cin ×Cout and K is the kernel size of
to dynamic changes. On the contrary, CNN has the advantage graph convolution that determines the maximum radius of the
of faster training and requires no dependency constrains on convolution from central nodes.
previous steps. Therefore, the temporal block leverages 1-D 3) Output Block: The Output Block aims to output the
convolution with the gated linear unit(GLU) to capture time- hidden features vector of the graph representation. We design
series relationship from node features. Figure.6 shows the the output Block that feeds the features H to a Multi-Layer
1-D convolution how to work and capture the relationship Perceptron(MLP) as Eq.3.
between different time interval for every metrics. The temporal
convolution layer contains a 1-D convolution with a width-Kt ho = M LP (H) (3)
Authorized licensed use limited to: York University. Downloaded on March 04,2025 at 18:22:07 UTC from IEEE Xplore. Restrictions apply.
τ time frames In fact, when the data’s latent representation vector out of
the hypersphere’s boundary, the distance greater than 0, we
1-D treat the data as anomalous.
temporal M metrics
conv F. Interpreter
Given a trained deep-learning model and the specific
anomaly instance, our Interpreter will generate an explanation
by identifying the most important features. The interpreter
aims to confirm the important features by maximizing the
Fig. 6: The 1-D temporal convolution similarity between the perturbed result and the original result.
Intuitively, the interpreter first designs the soft mask matrix
4) Deep SVDD Module: In the previous section, we intro-
and combines it with original features. Then, the interpreter
duce our STGCN model how to learn hidden features from
calculates the loss between the masked result and the original
spatial and temporal dimensions. The Deep SVDD module is
prediction result. The loss will update the mask matrix by
leveraged to optimize our STGCN by the following loss func-
gradient descent to make the masked features’ prediction result
tion, which expresses the objective of learning a minimized
more similar with original result. When the loss converges, the
hypersphere to enclose the hidden features vector of the graph
soft masks will record the every feature’s importance.
representation:
Given the original graph dependency relation-
ships GC and graph’s node features XC =
Pl=1
Loss = R2 + µ1 max{0, ||ho − c||2 − R2 } + λ
||W l ||2F mtij |t = 1, 2, ..., τ, i = 1, 2, ..., M, vj ∈ GC , we
2 L
(4) aims to find out the most important subgraph
where R is the radius of the hypersphere; c is the center GS ⊆ GC and identify the associated features
of the hypersphere; ||ho − c||2 is the distance from the XS = mtik |t = 1, 2, ..., τ, i = 1, 2, ..., M, vk ∈ GS .
latent representation of an unified graph data to c; Nl is the For example, considering the situation where a CPU
number of network layers in STGCN; the hyperparameter µ anomaly occurs at service vk . If the interpreter removes an
controls the trade-off between the hypersphere volume and the edge between vk and vj and decreases the probability of
violations of the boundary, Pallowing some data to be mapped anomaly prediction, the edge is a good counterfactual explana-
l=1
outside the hypersphere; λ2 L ||W l ||2F is a regularizer for tion for the current prediction. To be similar, if weakening the
the STGCN’s parameters W with hyperparameter λ. feature mt′ i′ j ′ ∈ X also decreases the probability of current
Optimizing objective 4 lets the network learn parameters W prediction then the mt′ i′ j ′ is good counterfactual explanation.
such that data points are closely mapped to the center c of the We leverage mutual information (MI) to formulate the
hypersphere. We leverage Adam [19] to optimize the STGCN’s optimization framework using Eq. 6:
parameters W , radius R and center c. We train the STGCN
parameters W for some k epochs while the radius R is fixed.
max M I(Y, (GS , XS )) = H(Y ) − H(Y |G = GS , X = XS )
Then, after every k-th epoch, we calculate the radius R by the GS
data representation that comes from the STGCN with latest (6)
paramters W . Every time we calculate the R value, which The M I quantifies the change of prediction when the graph
could contain (1 − µ) percentile of all training data according dependency relationships and features are changed to sub-
to the distance. The hypersphere center c is set to the mean dependency and sub-features.
of vector representation of all training data after once initial 1) The Mask of Graph Dependency: The graph mask
forward pass. MG ∈ RN ×N combines with the computation graph’s ad-
jacency matrix, A ⊙ σ(MG ), where ⊙ denotes element-wise
E. Anomaly Detection multiplication. We optimize the graph mask matrix by gradient
Given a new time interval’s trace, logs and metrics, Triple as follows:
follows the same procedure to process data in training phase
and generate the unified graph representation. Then, the Triple
min −1[y = 1] log PΦ (Y = y|G = A ⊙ σ(MG ), X = XS )
experiment the graph to generate the latent representation MG
by STGCN and the Deep SVDD will calculate the distance (7)
between the latent representation and hypersphere center. The We evaluate the dependency importance by the graph mask.
calculating procedure is in the equation. 5: Every dependency’s weight of the graph mask will be ranked.
2) The Mask of Node Features: To identify which node
Distance(ho ) = ||ho − c||2 − Rf2 (5) features could impact the prediction result, we leverage the
feature mask MF to retain the important features. Intuitively,
where ho is the latent representation vector; c is the center the mask of node features combines with node features and is
of the hypersphere; Rf is the final radius of the hypersphere. updated by gradient descent as follows:
Authorized licensed use limited to: York University. Downloaded on March 04,2025 at 18:22:07 UTC from IEEE Xplore. Restrictions apply.
Dataset A is constructed by China Construction Bank
Offline Normal Log
Node Features Templates Base Related Logs
based on hiptershop system [1]. The dataset comes from the
Ranking
CPU Record Log 1: socket.gaierror: [Errno <*>]
AIOPS 2022 [4] international challenge championship. They
Network Temporary failure in name resolution
process
Top2 unmatched deploy three sub-systems on Kubernetes [6] and the faults
logs Log M: socket.gaierror: [Errno <*>]
fs read service Name or service not known are randomly injected in three sub-system that contains 120
Unmatched_template Top1 Current Time Logs
Trigger
Template instances. The fault includes 9 different types and we show
the fault distribution in table I.
Dataset B is taken from an open-source microservice
Fig. 7: The Translation Logs Module
benchmark system for train ticket service. The system is
widely used in existing works [21], [44], [45] and contains
41 microservices. We build the system and inject 175 faults
min −1[y = 1] log PΦ (Y = y|G = GS , X = X ⊙ σ(MF )) related to microservice.
MF
(8)
We initialize the feature mask using the Normal distribution. TABLE I: Overview fault type of the dataset
After finishing the update procedure of mask, we will rank
Dataset fault type cases cases percent
every features’ importance by the mask matrix. File system Read 26 12%
Different with traditional classification tasks, our model File system Write 11 5%
aims to finish unsupervised anomaly detection, which needs to Network Delay 29 14%
Network Packet Corruption 31 14%
describe the hypersphere and detect the outliers. So we modify Dataset A Network Packet Duplicate 22 10%
the loss function to calculate the distance between the original Network Drop Packet 28 13%
prediction coordination(latent feature representation) and the CPU load 27 13%
Memory load 25 12%
perturbed coordination. Therefore, we calculate the objective Container Pause 15 7%
as follows: Network-delay 56 34%
Dataset B CPU 51 31%
min −1[y = 1] log PΦ (Distance(h, ĥ)| Container Pause 58 35%
Mtotal (9)
G = A ⊙ σ(MG ), X = X ⊙ σ(MF ))
B. Experiment Settings
where Distance(h, ĥ) = ||h − ĥ||2 ; The h and ĥ is the We implement Triple using Python 3.7, Pytorch 1.13.0. All
latent feature representation at the hypersphere. The y = 1 the experiments are conducted on a server with 3070ti GPU,
represents the anomalous label. 32GB RAM and 5800X processors with 6 cores. The detail
3) Interpretable Analysis: When an anomaly is predicted setting of Triple are the following:
by eWarn, an alert will be sent to engineers, so that they can Following existing work [37], [40], [42], [43], we use three
adopt some proactive actions in advance. To facilitate engi- following metrics to evaluate the model’s effectiveness:
neers to handle the warning, Triple provides an interpretable • Precision measures how many returned results of abnor-
report about the alert from the original matrix score for engi- mal data are correct.
neers. Due to the space limit, we omit the details interpretable • Recall measures how many relevant results are returned.
report cases, which can be found in our appendix ??. Different • (F1-Score) combines the precision and recall of a model
with traditional interpretable result [29], [39], we design the into a single metric by taking their harmonic mean.
Translation Logs Module for our interpreter to improve the Besides, we design the new evaluation method to confirm
readability of log features. In the figure 7, when the interpreter our interpreter could give best interpretation result, comparing
marks the relative features of logs in the nodes, the Translation with other interpreter:
Logs Module is triggered and picks up the unmatched template
• Label flipping ratio(LFR) replacing the picked up im-
from current online log templates. The unmatched templates
portant features with history normal value and test the
will be written into the interpretable report for the related logs
anomaly instance whether recovers normal status.
part.
• Overhead calculating the time-consuming of the inter-
Therefore, with an interpretable report, engineers not only
preter.
benefit from early alerts of incidents, but gain some inspiration
for incident diagnosis from three data sources’ features. 1) The Model Effectiveness Comparison: The table II
shows the detail effectiveness of all anomaly detection algo-
IV. E VALUATION rithms. Across the 214 fault in the open-source dataset A,
Triple outperforms all other algorithms in anomaly detection.
A. Experiment Dataset Most approaches perform badly at the dataset, since it has
We rely on two datasets in our evaluation: (i) the Hip- nine kinds of fault types and 120 service instances. These
stershop system that is constructed by China Construction fault types are hard to be fully detected from a single source.
Bank [4]; (ii) an open-source microservice benchmark system That is also why other algorithms have poor performance in
for Train-Ticket service [44]; the dataset.
Authorized licensed use limited to: York University. Downloaded on March 04,2025 at 18:22:07 UTC from IEEE Xplore. Restrictions apply.
TABLE II: Overview comparison of Anomaly Detection Ap-
2763
proaches
2500
Training Time(s)
Dataset Algorithm Precision Recall F1-Score 2000
1862
Triple(Trace+Log+Metric) 0.89 0.87 0.88
Triple(Trace+Log) 0.74 0.68 0.66 1500
Triple(Trace+Metric) 0.80 0.77 0.78 1163
A DeepLog 0.54 0.49 0.51 1042
1000 894
SCWarn 0.80 0.81 0.80
TraceAnomaly 0.65 0.63 0.64 500
DeepTraLog 0.76 0.73 0.74
Triple(Trace+Log+Metric) 0.97 0.98 0.97 0
Ours DeepTraLog SCWarn TraceAnomaly DeepLog
Triple(Trace+Log) 0.91 0.95 0.93
Algorithms
Triple(Trace+Metric) 0.94 0.95 0.94
B DeepLog 0.78 0.76 0.77
SCWarn 0.93 0.94 0.93 Fig. 8: The training time for every algorithms
TraceAnomaly 0.85 0.84 0.84 12.0 20
w/ GPU w/ GPU
DeepTraLog 0.89 0.88 0.84
Epoch Time(s)
Epoch Time(s)
9.6 w/o GPU w/o GPU
15
Different with dataset A, due to the simple fault type and 7.2
10
fewer service instances in dataset B, all anomaly detection 4.8
Authorized licensed use limited to: York University. Downloaded on March 04,2025 at 18:22:07 UTC from IEEE Xplore. Restrictions apply.
1.000
102 97.515
Convergence Ratio
Ours
51.6 0.875
DeepTraLog
26.2 DeepLog 0.750
SCWarn 0.625
101 TraceAnomaly 0.500
0.375
Time(s)
2.4 0.250
1.6
100 0.125
0.000
0 50 100 150 200 250 300 350 400
Epoch
−1
10 0.062
0.045
0.03 0.03
0.02
Fig. 11: The training epochs for convergence
Total Time Once Response Time problem. In addition, compared with Triple, the GRAD and
GAT’s interpretable report will also lose the information of
Fig. 10: The testing time for every algorithms node features(related metrics and related logs) or dependency
graph(related dependency).
consuming that all datas are predicted. We reply 15 minutes’
data including 5 minutes abnormal data and 10 minutes normal TABLE III: Overview comparison of Interpreter
data. Triple and SCWarn will generate 30 time intervals’
data and detect these data whether existing anomaly. The Dataset Algorithm LFR Overhead(seconds)
Triple 0.89 5.95
TraceAnomaly, DeepTraLog and DeepLog will detect every Dataset A GAT 0.78 0.34
trace and log’s anomaly. The testing time of DeepTraLog is GRAD 0.67 0.42
highly depends on the size of the trace, i.e., the number of Triple 0.85 4.95
span/log events in it. We follow the detail evaluation procedure Dataset B GAT 0.74 0.28
GRAD 0.84 0.40
in DeepTraLog paper [37], which calculates average time of
prediction. V. DISCUSSION
The figure. 10 shows the detail result of response time and A. The Log Template Change of Service
total testing time-consuming for every algorithms. Though the
response time is similar among all algorithms, Triple has a Though Triple is the first method to finish the anomaly
great advantage in total time-consuming. Because the trace and detection by three data source and have better performance
log numbers will seriously impact the total time-consuming of than other SOTA, it still couldn’t leverage log messages
these algorithms. That means Triple is better suited for tremen- enough well. When the service’s log statement is updated and
dous system that could generate gigabytes(GB) of traces and generates new log template. Our method may judge the new
logs per minutes or even per seconds. template for anomaly and cause false positive. In the future
work, we hope to improve the adaptive of our algorithm and
3) The Interpreter Comparison: We compare our inter-
avoid the false positive.
preter with the GAT and GRAD method. ATT originates
from graph attention network (GAT) [33] and ASTGCN [17] B. The Static Graph Pattern
that learns attention weights for edges in the dependency Though the Triple has brilliant precision for anomaly de-
graph, which we use as a proxy measure of edge importance. tection, it is constrained by static graph patterns. When the
GRAD is a gradient-based method. We compute the gradient microservice system add new services, STGCN must retrain
of our model’s loss function with respect to the adjacency the new graph pattern and increase overhead. In the future
matrix and the associated node features. It is similar to the work, we will try to modify our model to fit the dynamic
saliency mapping approach. Due to our method needs to graph pattern.
be trained until convergence, the figure. 11 calculates the
average convergence epochs of 120 faults in A dataset. When VI. CONCLUSION
the training epochs arrives 300, more than 99% faults are In the paper, we propose Triple, an anomaly detection ap-
convergent. So we set the 300 epochs’ time-consuming of proach for microservice system based on deep learning model.
our model for Overhead. The table III shows the comparison We first leverage three data sources to generate the unified
result of LFR and Overhead in detail. Different with our graph representation. Then, we design a STGCN based Deep
interpreter, GAT and GRAD doesn’t nearly generate overhead SVDD model which could extract and learn the latent features
and could give the interpretable result immediately. Because from trace, log and metrics. The Deep SVDD constructs a
they leverage the gradient or attention weight to rank the data-enclosing hypersphere and confirms the anomalous data
features. However, GRAD is based on heuristic assumptions according to the hypersphere’s boundary. We evaluate the
that the gradient could reflect the feature importance and may Triple on a open-source dataset that is constructed by China
not be true. GAT doesn’t consider node features and could only Construction Bank based on hipstershop system. The result
explain edge, which means engineers just couldn’t understand shows that our approach outperforms SOTA anomaly detection
the anomaly from specific metrics and logs variations. The algorithms. In the future, we will go step further to promote
lowest LFR result among these methods also proves their the precision and support more fault type.
Authorized licensed use limited to: York University. Downloaded on March 04,2025 at 18:22:07 UTC from IEEE Xplore. Restrictions apply.
VII. ACKNOWLEDGE [25] Ajay Mahimkar. Rapid detection of maintenance induced changes in
service performance. In Proceedings of the Seventh COnference on
This work was supported in part by National Key RD Emerging Networking EXperiments and Technologies, pages 1–12, 2011.
Program of China: 2022YFB3103000, and Natural Science [26] Sonu Mehta and Ranjita Bhagwan. Rex: Preventing bugs and miscon-
figuration in large services using correlated change analysis. In 17th
Foundation of China (U20A20180, 62072437), and CAS- USENIX Symposium on Networked Systems Design and Implementation
Austria Joint Project (171111KYSB20200001) (NSDI 20), pages 435–448, 2020.
[27] Animesh Nandi and Atri Mandal. Anomaly detection using program
R EFERENCES control flow graph mining from execution logs. In Proceedings of the
22nd ACM SIGKDD International Conference on Knowledge Discovery
[1] Hipster-Shop, October 2019. https://github.com/GoogleCloudPlatform/ and Data Mining, pages 215–224, 2016.
microservices-demo. [28] Phillip E Pope and Soheil Kolouri. Explainability methods for graph
[2] Elasticsearch., October 2022. https://www.elastic.co/. convolutional neural networks. In Proceedings of the IEEE/CVF
[3] High-quality, ubiquitous, and portable telemetry to enable effective Conference on Computer Vision and Pattern Recognition, pages 10772–
observability., October 2022. https://opentelemetry.io/. 10781, 2019.
[4] International AIOps Challenge, March 2022. http://iops.ai/. [29] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. ” why
[5] Jaeger: open source, end-to-end distributed tracing., October 2022. should i trust you?” explaining the predictions of any classifier. In
https://www.jaegertracing.io/. Proceedings of the 22nd ACM SIGKDD international conference on
[6] Kubernetes Production-Grade Container Orchestration., October 2022. knowledge discovery and data mining, pages 1135–1144, 2016.
https://kubernetes.io/. [30] Lukas Ruff, Robert Vandermeulen, and Nico Goernitz. Deep one-class
[7] Prometheus, October 2022. https://prometheus.io/. classification. In International conference on machine learning, pages
[8] skywalking.apache.org. Apache SkyWalking., October 2022. 4393–4402. PMLR, 2018.
http://skywalking.apache.org/. [31] Michael Sejr Schlichtkrull, Nicola De Cao, and Ivan Titov. Interpreting
[9] Triple’s source code, October 2022. graph neural networks for nlp with differentiable edge masking. arXiv
https://github.com/TripleResearcher/Triple. preprint arXiv:2010.00577, 2020.
[10] Federico Baldassarre. Explainability techniques for graph convolutional [32] Ramprasaath R Selvaraju, Michael Cogswell, et al. Grad-cam: Visual
networks. arXiv preprint arXiv:1905.13686, 2019. explanations from deep networks via gradient-based localization. In
[11] Betsy Beyer and Chris Jones. Site reliability engineering: How Google Proceedings of the IEEE international conference on computer vision,
runs production systems. ” O’Reilly Media, Inc.”, 2016. pages 618–626, 2017.
[12] Junjie Chen and Xiaoting He. An empirical investigation of incident [33] Petar Veličković et al. Graph attention networks. arXiv preprint
triage for online service systems. In 2019 IEEE/ACM 41st Interna- arXiv:1710.10903, 2017.
tional Conference on Software Engineering(ICSE-SEIP), pages 111–120. [34] Zhitao Ying, Dylan Bourgeois, et al. Gnnexplainer: Generating ex-
IEEE, 2019. planations for graph neural networks. Advances in neural information
[13] Junjie Chen, Xiaoting He, Qingwei Lin, et al. Continuous incident processing systems, 32, 2019.
triage for large-scale online service systems. In 2019 34th IEEE/ACM [35] Bing Yu, Haoteng Yin, and Zhanxing Zhu. Spatio-temporal graph con-
International Conference on Automated Software Engineering (ASE), volutional networks: A deep learning framework for traffic forecasting.
pages 364–375. IEEE, 2019. arXiv preprint arXiv:1709.04875, 2017.
[14] Yujun Chen, Yang, et al. Identifying linked incidents in large-scale [36] Hao Yuan, Haiyang Yu, Shurui Gui, and Shuiwang Ji. Explainability
online service systems. In Proceedings of the 28th ACM Joint Meeting in graph neural networks: A taxonomic survey. IEEE Transactions on
on the Foundations of Software Engineering(FSE), pages 304–314, 2020. Pattern Analysis and Machine Intelligence, 2022.
[15] Min Du, Feifei Li, Guineng Zheng, and Vivek Srikumar. Deeplog: [37] Chenxi Zhang, Xin Peng, and Chaofeng Sha. Deeptralog: Trace-log
Anomaly detection and diagnosis from system logs through deep learn- combined microservice anomaly detection through graph-based deep
ing. In Proceedings of the 2017 ACM SIGSAC conference on computer learning. 2022.
and communications security, pages 1285–1298, 2017. [38] Mengshi Zhang and Yuqun Zhang. Deeproad: Gan-based metamorphic
[16] Yu Gan. Seer: Leveraging big data to navigate the complexity of testing and input validation framework for autonomous driving systems.
performance debugging in cloud microservices. In Proceedings of In 2018 33rd IEEE/ACM International Conference on Automated Soft-
the twenty-fourth international conference on architectural support for ware Engineering (ASE), pages 132–142. IEEE, 2018.
programming languages and operating systems, pages 19–33, 2019. [39] Nengwen Zhao, Junjie Chen, et al. Real-time incident prediction for
[17] Shengnan Guo, Youfang Lin, et al. Attention based spatial-temporal online service systems. In Proceedings of the 28th ACM Joint Meeting
graph convolutional networks for traffic flow forecasting. In Proceedings on the Foundations of Software Engineering(FSE), pages 315–326, 2020.
of the AAAI conference on artificial intelligence, volume 33, pages 922– [40] Nengwen Zhao, Junjie Chen, and Zhou Wang. Real-time incident
929, 2019. prediction for online service systems. In Proceedings of the 28th ACM
[18] Pinjia He, Jieming Zhu, and Zibin Zheng. Drain: An online log parsing Joint Meeting on the Foundations of Software Engineering(FSE), pages
approach with fixed depth tree. In 2017 IEEE international conference 315–326, 2020.
on web services (ICWS), pages 33–40. IEEE, 2017. [41] Nengwen Zhao, Junjie Chen, Zhaoyang Yu, et al. Identifying bad
[19] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic software changes via multimodal anomaly detection for online service
optimization. arXiv preprint arXiv:1412.6980, 2014. systems. In Proceedings of the 29th ACM Joint Meeting on the
[20] Yujia Li. Gated graph sequence neural networks. arXiv preprint Foundations of Software Engineering(FSE), pages 527–539, 2021.
arXiv:1511.05493, 2015. [42] Nengwen Zhao, Panshi Jin, Lixin Wang, et al. Automatically and
[21] Zeyan Li, Junjie Chen, et al. Practical root cause localization for adaptively identifying severe alerts for online service systems. In
microservice systems via trace analysis. In 2021 IEEE/ACM 29th IEEE INFOCOM 2020-IEEE Conference on Computer Communications,
International Symposium on Quality of Service (IWQOS), pages 1–10. pages 2420–2429. IEEE, 2020.
IEEE, 2021. [43] Nengwen Zhao, Jing Zhu, and Yao Wang. Automatic and generic
[22] Zeyan Li, Wenxiao Chen, and Dan Pei. Robust and unsupervised kpi periodicity adaptation for kpi anomaly detection. IEEE Transactions
anomaly detection based on conditional variational autoencoder. In 2018 on Network and Service Management, 16(3):1170–1183, 2019.
IEEE 37th International Performance Computing and Communications [44] Xiang Zhou, Xin Peng, et al. Fault analysis and debugging of microser-
Conference (IPCCC), pages 1–9. IEEE, 2018. vice systems: Industrial survey, benchmark system, and empirical study.
[23] Ping Liu, Haowen Xu, and Qianyu Ouyang. Unsupervised detection IEEE Transactions on Software Engineering, 47(2):243–260, 2018.
of microservice trace anomalies through service-level deep bayesian [45] Xiang Zhou, Xin Peng, et al. Latent error prediction and fault localiza-
networks. In 2020 IEEE 31st International Symposium on Software tion for microservice applications by learning from system trace logs.
Reliability Engineering (ISSRE), pages 48–58. IEEE, 2020. In Proceedings of the 2019 27th ACM Joint Meeting on the Foundations
[24] Dongsheng Luo, Wei Cheng, et al. Parameterized explainer for graph of Software Engineering(FSE), pages 683–694, 2019.
neural network. Advances in neural information processing systems,
33:19620–19631, 2020.
Authorized licensed use limited to: York University. Downloaded on March 04,2025 at 18:22:07 UTC from IEEE Xplore. Restrictions apply.