ADA Adaptive Deep Log Anomaly Detector
ADA Adaptive Deep Log Anomaly Detector
Yali Yuan∗† , Sripriya Srikant Adhatarao∗† , Mingkai Lin‡ , Yachao Yuan† , Zheli Liu§ , and Xiaoming Fu†
† University of Göttingen, Germany. Email:{yali.yuan, adhatarao, fu}@cs.uni-goettingen.de, [email protected]
‡ Nanjing University, Nanjing, China. Email: [email protected]
§ Nankai University, Tianjin, China. Email: [email protected]
Abstract—Large private and government networks are often rapidly increasing number of internet applications and users,
subjected to attacks like data extrusion and service disruption. the size of log data collected by a system running on even
Existing anomaly detection systems use offline supervised learn- a medium-sized network can grow beyond terabytes per day.
ing and employ experts for labeling. Hence they cannot detect
anomalies in real-time. Even though unsupervised algorithms are Until recently, about 500 MB of logs per day was considered
increasingly used nowadays, they cannot readily adapt to newer a normal volume in small businesses; today, 5 GB per day is
threats. Moreover, many such systems also suffer from high not unusual for such environments. Large organizations can
cost of storage and require extensive computational resources. easily produce logs that is orders of magnitude larger than
In this paper, we propose ADA: Adaptive Deep Log Anomaly this. With this much information being generated, there is a
Detector, an unsupervised online deep neural network framework
that leverages LSTM networks and regularly adapts to newer need for an efficient strategy to store, analyze and manage the
log patterns to ensure accurate anomaly detection. In ADA, an system logs. Studies in [4] show that an application running
adaptive model selection strategy is designed to choose pareto- on 51,000 Amazon EC2 instances and publishes five custom
optimal configurations and thereby utilize resources efficiently. metrics will incur a charge of $31,646.40 per month [5].
Further, a dynamic threshold algorithm is proposed to dictate the Therefore, it is challenging and nearly impossible to do manual
optimal threshold based on recently detected events to improve
the detection accuracy. We also use the predictions to guide analysis in real-time and traditional approaches such as mining
storage of abnormal data and effectively reduce the overall are proven to be in-effective. Besides, since log data mainly
storage cost. We compare ADA with state-of-the-art approaches follows a time-series distribution, it is subject to rapid updates.
through leveraging the Los Alamos National Laboratory cyber Therefore, obtaining labeled log data for any applications area
security dataset and show that ADA accurately detects anomalies of interest is often difficult and it is mostly unbalanced or
with high F1-score ~95% and it is 97 times faster than existing
approaches and incurs very low storage cost. system specific and hence it needs to be pre-processed before
Index Terms—Anomaly detection, deep neural networks, logs, analysis.
online training, unsupervised, log-normal, threshold In addition, obtaining a large-scale log anomaly dataset with
high-quality ground truth has been an ongoing challenge [6]–
I. I NTRODUCTION [8]. Labelling log anomalies in a dataset requires experts as-
To ensure the security of any online application, it is essen- sistance and therefore it is labor intensive and often expensive.
tial for organizations to detect and mitigate malicious activities Hence, supervised machine learning strategies like [6]–[8] that
on their networks such as, unauthorized access, malwares, port depend on prior patterns of normal and abnormal behaviours
scanning, etc. These attacks may allow unauthorized access to are not suitable for real-time anaomaly detection systems.
the network and inflict further damages like compromising Many recent works like [1], [9], proposed unsupervised ma-
credentials, violating intellectual property rights, etc. Such chine learning algorithms for detecting anomalies. However,
attacks may even expose business sensitive information includ- these approaches train the models offline and hence cannot
ing confidential documents of government agencies, resulting effectively adapt to the rapidly changing network behaviours
in serious security breaches [1]. The reported losses [2], [3] over time and learn new threats and vulnerabilities. Hence,
incurred due to security breach in 2011 by RSA and Target models trained offline can soon become outdated and put
Corporation in 2013 was $66 and $248 millions. Therefore, the system at risk. Recently, Du et al. [4] proposed online
system logs are generally used to periodically record states log anomaly detection to generate sequences leveraging Long
of the systems and any significant events at various significant Short-Term Memory (LSTM) [10] or clustering algorithms
points. Nearly all computer systems today collect and maintain for detecting Denial of Service attacks. Some others [11]–
such system-wide log data. These logs are used by experts [13] leveraged LSTM networks to pre-process the sequence
to diagnose any suspicious behaviours like system failures of API calls as components in order to detect malwares in
and unauthorized authentication events. Analyzing system logs systems. However, these approaches also suffer from increased
is essential to discover the root cause of any problem and latency and pre-processing overhead and thus are not suitable
potential security breaches. Hence system logs are a valuable for detecting anomalies rapidly.
source of information and they are crucial for monitoring Therefore, in this paper we propose ADA: Adaptive Deep
online applications and detecting anomalies. Log Anomaly Detector, an efficient unsupervised online learn-
Log data analysis and storage are both essential but expen- ing approach for detecting anomalies in system logs. Unlike
sive operations especially in large scale networks. Due to the recent works, we exploit the online deep learning [14] to build
deep neural networks on the fly with LSTM using unlabelled
∗ Both authors have equal contribution. log data collected from Los Alamos National Laboratory
Authorized licensed use limited to: BEIJING UNIVERSITY OF POST AND TELECOM. Downloaded on June 12,2023 at 07:04:44 UTC from IEEE Xplore. Restrictions apply.
2449
(LANL) Cyber Security Dataset [15]. Further, we also propose II. S YSTEM D ESIGN
an adaptive prediction strategy where the pareto-optimal neural In this section, we introduce the design goals and rationale
network configurations are selected for the current model for design choices in ADA followed by the framework.
from the set of all models obtained by online deep learning.
In addition, since the patterns in log data are unstable, we A. Design Goals
develop a dynamic threshold algorithm which uses the recent The design of ADA is driven by the following main goals:
predictions from the models to dynamically adapt the threshold 1) Heterogeneous data: System logs from various appli-
for detecting abnormal events using the log-normal distribu- cations and systems are heterogeneous in nature and
tion. Utilizing adaptive strategy and dynamic threshold, ADA vary significantly in terms of their format and collected
always selects the pareto-optimal neural network model with information. A generalized anomaly detection system
minimal configurations and current threshold based on system should be able to process and analyze any log format.
environment to predict every event in the system logs. The 2) Supervision and Data drift: System behavior and
pareto-optimal policy first learns from the predictions obtained potential threats evolve over time. As a consequence, the
for earlier events. Then, it selects a model that optimizes the system logs and attack models also change. An anomaly
needed computational resources. The predictions also drive the detection system should be able to learn newer event
decision for selectively storing the log data. patterns automatically even in the absence of experts
Since we utilize the online deep learning, any system built inputs and labelled dataset.
using the ADA framework can effectively synchronize with 3) Accuracy and Threshold: Anomaly detection systems
any newly discovered log-patterns. Moreover, the system will should correctly predict the abnormal events with high
use models with minimal configurations and predict anomalies accuracy. In addition, it should adapt system thresholds
with highest possible accuracy using optimal thresholds. In based on recently observed events and system behaviour
this work, we build an anomaly detection system based on the to clearly distinguish between normal and abnormal
ADA framework and perform extensive evaluations using the events and reduce false negatives.
LANL dataset. We show that online deep learning produces 4) Resource utilization: Since many anomaly detection
highly accurate models where we obtain F1-scores in the range systems are based on deep neural networks, they demand
of 0.91 - 0.95 with the latency to predict in the range of high computational resources to produce higher accuracy
14 ms - 37 ms. We further observed that frequency of abnormal and use parallelization to reduce latency. An efficient
events are far less than normal events in the system logs, so anomaly detection system should be able to use modern
we propose to store only abnormal events after prediction in and established algorithms with minimal resources.
order to optimally utilize the storage and reduce the overall 5) Cost: System logs are generated everyday and hence
storage cost. To the best of our knowledge, we are first to the storage cost for these logs also increases with time.
propose online deep learning for detecting anomalies in log An efficient strategy for storage can greatly reduce the
data with pareto-optimal models and dynamic thresholds. overall system cost. Further, many offline algorithms
also use increased computational resources to improve
The contributions in this work include:
accuracy. An anomaly detection system should be able to
• A novel unsupervised anomaly detection framework produce results with good resource-accuracy trade-off.
ADA: Deep Adaptive Anomaly Detector, which utilizes B. Architectural Components
online deep learning to build highly accurate models on
Fig. 1(a) shows the ADA framework in detail. The compo-
the fly. ADA also incorporates new log patterns instantly
nents of this framework are described as follows:
in order to improve anomaly detection.
• An adaptive prediction strategy for selecting the pareto- – System Logs: This module stores and feeds the logs to
optimal ADA Event Model (ADA-EM) model to opti- the Feature Vector Generator module. The logs are col-
mize the computational resources and improve the la- lected from multiple sources in the system and contains
tency of anomaly detection. information about different system states and events of
• A dynamic threshold technique for reexamining and interest defined by the system administrators.
improving the thresholds of neural network models for – Feature Vector Generator: This module uses the Lan-
detecting abnormal events during anomaly detection. guage Model Processing (LMP) [16] to process incom-
• A proposal to leverage the insight from event predictions ing log streams and generates feature vectors. The re-
to redirect storage decisions in order to reduce the overall sulting features are stored in the feature vector database.
cost of storage. – Models: This module generates online deep neural net-
• Extensive evaluations to show accuracy of the models works using ADA-EM shown in Fig. 1(b) for detecting
built with online deep learning, effectiveness of adaptive anomalies in system logs.
technique and dynamic threshold in reducing the compu- – Predictor: This module initially uses the benchmark
tational resources and latency compared to state-of-the- model mi for predicting the incoming log event and
art approaches. sends the prediction results to the decision module.
Based on the prediction and adaptive principle shown in
Authorized licensed use limited to: BEIJING UNIVERSITY OF POST AND TELECOM. Downloaded on June 12,2023 at 07:04:44 UTC from IEEE Xplore. Restrictions apply.
2450
(6/77-),06/89 !')"
!" !& !'
%:/6)
Algorithm 1 (see §II-E), the corresponding next model The one-hot embedding is followed by the sequence encod-
will be loaded and used for predicting the next event. ing. Among such large amount of normal events in logs, there
– Decision: This module uses the prediction from the exists potential sequential information for language models.
predict module and decides whether the predicted event To capture this sequential information, an improved LSTM is
is a normal or abnormal event using the Algorithm 1. used in ADA-EM, which is well suited for this task. When
– Threshold Generator: This module stores the recent nor- applying LSTM recursively to a line in the event log from
1 2 H
mal event losses {lN , lN , · · · , lN } and abnormal event left to right, the sequence representation will be enhanced
1 2 K
losses {lA , lA , · · · , lA } for each model mi . Using the progressively with the information from subsequent tokens in
dynamic threshold computation (see §III), we obtain the this sequence.
current threshold for every model. Based on the input one-hot vectors, LSTM produces the
summary of the past input sequences through the cell state
C. Tokenization
vector ct . Given X , yt is the hidden state of the LSTM cell
In order to readily use the arbitrary log formats, we treat at time t, which can help to achieve the desired log event
every line in the log file as a sequence of tokens. In this prediction. After T times of recursive updates from Eqs. (1)
work, we consider word-level tokenization granularity in log to (5), the hidden representation at token t namely yt gives a
files. For tokenizing the words, we assume that the tokens in better representation of each token.
every line in the log are delimited by some known characters yt = ot ◦ tanh(ct ), (1)
(e.g., a space, a comma or a period). We split every ine
ct = ft ◦ ct−1 + it ◦ tanh(Wxc xt + Wyc yt−1 + bc ), (2)
in the log file based on the delimiter and define a shared
vocabulary of “words” over all fields in the log file. Essentially, ot = σ(Wxo xt + Wyo yt−1 + bo ), (3)
the vocabulary is composed of the most frequently appearing it = σ(Wxi xi + Wyi yt−1 + bi ), (4)
tokens in the system logs. Further, we treat any missing data
as a single feature and use the character “?” to represent it. ft = σ(Wxf xf + Wyf yt−1 + bf ). (5)
D. ADA Event Model (ADA-EM) After that, a fully connected layer with final output unit
In order to efficiently implement the ADA framework, we of vocabulary size produces the output of the model. Once
also propose ADA-EM shown in Fig. 1(b) leveraging LSTM the hidden representations Y = {y1 , y2 , · · · , yT +1 } is ob-
networks. The LSTM architecture was first introduced for tained, we input these vectors into a Multi-Layer Perception
machine translation in [17] and since then it has been exten- (MLP) [18] and get the final probability distribution of the
sively applied in language processing applications. Therefore, next token. Please note that we use two layers of MLP to
we train the LSTM-based ADA-EM to process instances of achieve better representation capacity in ADA-EM. We use
normal time-series in the system logs. Specifically, ADA-EM the ReLU activation function for the first layer in order to
takes the embedded sequences of tokens as input and outputs avoid over-fitting while the softmax function is used for the
the distribution for the next token. second layer to normalize the output of MLP layer and get
We begin with one-hot embedding, which produces unique the target output. Given the weight metrics Wr , Ws and bias
embedding vectors for every token in the vocabulary set. More vectors br , bs for these two layers, the target output pt can
specifically, suppose we have a line in the log file with T be formulated as follows:
tokens, and we can describe it as W = {w0 , w1 , · · · , wT } pt = softmax(bs + Ws (Relu(br + Wr (yt )))).
PT
where w0 represents the starting flag hbi followed by T tokens We use T1 t=1 H(xt+1 , pt ) as the cross-entropy loss func-
of the log line. The layer of one-hot embedding processes the tion along with resources in ADA-EM for two important
time-series log-line input with T + 1 tokens W into one-hot purposes: for obtaining an anomaly score for every log line
vectors X = {x0 , x1 , · · · , xT +1 }. and as the training objective to update the model weights.
Authorized licensed use limited to: BEIJING UNIVERSITY OF POST AND TELECOM. Downloaded on June 12,2023 at 07:04:44 UTC from IEEE Xplore. Restrictions apply.
2451
Algorithm 1: Adaptive decision making algorithm
1.0
1.0
Empirical cumulative distribution
0.8
0.8
Output: Decision, loss
0.6
0.6
Decision = unknown
for Each log event d ∈ D do
0.4
0.4
Normal Normal
loss = Predict(mi , d) Log−normal Log−normal
0.2
0.2
Gamma Gamma
if loss ≤ τi then Exponential
Uniform
Exponential
Uniform
select mi−1
0.0
0.0
Logistic Logistic
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Decision = Normal Theoretical cumulative distribution Theoretical cumulative distribution
else
select mn (a) Normal Distribution. (b) Abnormal Distribution.
Decision = Abnormal Fig. 2: Loss Distribution Fit.
end
In this work, we define an event as the operational unit
end
of work for any system process which has a finite set of
return Decision, loss
action sequences. Suppose that we have a set of deep neural
network models where each model mi has nx layers. Each
E. ADA Design model mi has one configuration ci , i ∈ n. C is the set for all
ADA is basically an event driven anomaly detection frame- possible configurations of models, C = {c1 , c2 , · · · , cn }. For
work that detects abnormal events in system logs. With an each configuration of ci , we have two interested mappings: a
assumption that initial training logs at the first hour represents mapping from ci to its computation resource R(ci ) and to its
normal behavior we begin the online training to build deep latency measure T (ci ). ADA searches for Pareto-optimal set P,
0
neural network models similar to Doyen et. al [14]. We such that there is no alternative model mi with a configuration
0
start with building a shallow LSTM model with this normal ci that requires less computation resource (R) and offers lower
log data. As the time progresses, the number of models latency (T). Formally, P is defined as follows:
0 0 0
also increase with each model having deeper layers than the P = {ci ∈ C : {ci ∈ C : R(ci ) < R(ci ), T (ci ) < T (ci )} = ∅}
previous model in time. The goal is to build a deep neural As mentioned before, we treat log-lines as sequence of
network model from the captured logs in real-time and at tokens and ADA learns normal behavior for a set of users
the same time use these models to predict any anomalies who produced a stream of system logs as follows:
in the logs. We acknowledge that at the beginning when the 1) When a log stream arrives, the language processing
model is not trained well, high false positives are observed, model (LMP) is first utilized to abstract the tokens and
however, as we build deeper models, the accuracy gets better build the log feature vector.
over time. Further, models can be added or updated according 2) The above discussed pareto-optimal policy is used to
to the variations observed during predictions. For instance, select the optimal model mi for the current system state.
during the first four hours, we generated a model m1 with 3) The next model is selected by the feedback of pre-
128 LSTM layers and this model produced a F1-score of diction results from the current model following the
~91% (see §IV). With the increasing number of log samples, Algorithm 1. If the prediction result is normal, we
the number of layers was also increased and we generated a select mi−1 which is the pareto-optimal model as it is
second model m2 with 258 layers which produced an F1-score shallower than current model and hence uses less com-
of ~93%. This process is repeated until a deep enough model putation and incurs less latency. Otherwise, the model
that produces the desired highest accuracy is obtained. We sort mn which is the deepest available model is chosen
all the generated models based on their depth and represent as it provides higher accuracy and is desired when
them as a set {m1 , m2 , · · · , mn } hereafter. an abnormal event is detected at the cost of increased
In ADA, every model in the model set is assigned a computation.
corresponding threshold value (see §III) for distinguishing 4) The threshold τ is selected for each model based on the
the normal events from abnormal events. During decision loss distribution of normal and abnormal data processed
making we use the Algorithm 1, and every prediction is by the model and is discussed in the following section.
measured against the threshold for normal events established
for the model. When an abnormal event is detected, the system III. T HRESHOLD C OMPUTATION
immediately raises an alarm. The resulting decision drives the
next model that should be used for predicting the next event The behavioural patterns in system logs change over time
in the log. By virtue of the adaptive principle as described in as newer threats and attacks are introduced to the system.
Algorithm 1, whenever a normal event is found we select a Thus, predetermined threshold for normal and abnormal be-
model that is shallower than the current model and a deeper haviours become inadequate and need to be updated regularly.
model otherwise. The resulting prediction loss is forwarded to Therefore, in this work we introduce a new dynamic threshold
the threshold generator for any required updates. computation technique that automatically learns from the on-
Authorized licensed use limited to: BEIJING UNIVERSITY OF POST AND TELECOM. Downloaded on June 12,2023 at 07:04:44 UTC from IEEE Xplore. Restrictions apply.
2452
6 0.50
Normal fit Model1 Model2 Model3
5 Threshold 0.46
Abnormal fit
0.42
Threshold
4 Original
Density
Distribution 0.38
3
FNR FPR 0.34
2 0.30
1 0.26
0 0.22
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 0 5 10 15 20 25 30 35 40 45 50
Loss Time Step
(a) Log normal distribution of log loss data. (b) Threshold update over time steps.
Fig. 3: ADA Dynamic Threshold.
going behaviours in the system and adjusts the corresponding variable and x is loss value of the variable X.
threshold for detecting anomalies. Proposition 1: Two different log-normal densities always
Essentially, we inspect the loss distribution of logged events have an overlapping section with the condition that β 2 −4αγ ≥
µ2 µ1 µ2 µ1
to automatically determine the threshold for every ADA- 0, where α = 2σ1 2 − 2σ1 2 , β = 2σ 2 − 2σ 2 and γ = 2σ 2 − 2σ 2 −
2 1 2 1 2 1
EM model. During the initial phase, for instance, the first σ1
log( σ2 ). The loss threshold x ∈ {x1 , x2 } that satisfies the
hour, we only have normal log loss distribution and hence Eq. (6) is the optimal loss threshold x∗ , where x1 and x2 are
µ1 ,σ1
in order to compute the initial threshold we use a small intersections of both log normal fits fN (x) and fAµ2 ,σ2 (x).
number of abnormal events ground truth from the LANL Proof 1: Let µ1 , σ1 and µ2 , σ2 , where σ1 , σ2 > 0, be the
dataset. Subsequently, once the models are introduced to more corresponding parameters of the density functions. Then
incoming events, the initial threshold and the loss distribution √ 1
(log(x)−µ1 )2
√ 1
(log(x)−µ2 )2
2πσ1 x
exp(− 2σ12
) = 2πσ2 x
exp(− 2σ22
).
of normal and abnormal events are updated dynamically to
reflect the observed behaviour in the log stream. (7)
Applying the logarithm to Eq. (7), we have,
We studied several classical loss distributions and employed
σ1 (log(x) − µ2 )2 (log(x) − µ1 )2
the Probability-Probability (P-P) plot to assess the appropriate log( )−(− 2 )+(− ) = 0. (8)
fit for LANL dataset and find the optimal distribution for σ2 2σ2 2σ12
Eq. (8) can be written in terms of a quadratic function in
determining the threshold. In Fig. 2 we plot the empirical
log(x),
distribution of losses obtained from the ADA-EM models
against the best fitting theoretical distributions. From this α(log(x))2 + βlog(x) + γ = 0, (9)
figure, we can clearly see that for both normal and abnormal µ2 µ1 µ2 µ1
loss distributions, log normal is the best fit. Therefore, in where α = 2σ1 2 − 2σ1 2 , β = 2σ 2 − 2σ 2 and γ = 2σ 2 − 2σ 2 −
2 1 2 1 2 1
ADA, we consider that log loss data follows the log normal log( σσ12 ).
distribution. In the Fig. 3(a) we show the losses with log- Now, according to Definition 1, L(x) can be written in,
N A
normal distribution for randomly chosen 10000 normal and L(x) = (1 − FX (x)) + (FX (x))
200 abnormal log events from the dataset. It is clear that point = PN (X ≥ x) + PA (X ≤ x) (10)
of intersection of both fits yields the best threshold choice in Z ∞ Z x
µ1 ,σ1
Fig. 3(a), as the sum of True Positive Rate (TPR) and False = fN (x) + fAµ2 ,σ2 (x).
Positive Rate (FPR) are minimum (see Proposition 1). x 0
Since f µ,σ (x) is the continuous function in interval
Definition 1: For both normal and abnormal losses obtained
(0, +∞), the first order partial derivative of L(x) with respect
by the ADA-EM models, the loss distributions are fitted by log
to x, for x > 0, is given as,
normal distribution. Hence, the probability density function of ∂L(x) µ1 ,σ1
real log data ploss (x) can be estimated by log normal density = −fN (x) + fAµ2 ,σ2 (x)
∂x
function f µ,σ (x):
1 (log(x) − µ1 )2
ploss (x) ≈ f µ,σ (x), = −√ exp(− )
2πσ1 x 2σ12
where,
1 (log(x) − µ)2 1 (log(x) − µ2 )2
f µ,σ (x) = √ exp(− ), x > 0. +√ exp(− ).
2πσx 2σ 2 2πσ2 x 2σ22
The parameter µ and σ are estimated via maximum likeli- Let ∂L(x)
∂x = 0, then we get x1 and x2 as given in Eq. (9)
hood estimation based on the loss dataset. The challenge now are candidates loss points. We observe that Eq. (9) can have
is to find an optimal threshold based on the fits of losses. The at most 2 solutions. Furthermore, it can be shown that two
loss threshold finding problem can be cast as the optimization different x1 and √x2 fulfill, √
− β 2 −4αγ−β β 2 −4αγ−β
problem formulated below, x1 = e 2α , x2 = e 2α ,
N A
min L(x) = (1 − FX (x)) + FX (x), x > 0, (6) 2
where β − 4αγ ≥ 0. The optimal threshold is computed as
x
where FXN A
(x) and FX (x) are the Cumulative Distribution the intersection at x = x∗ that fulfills Eq. (6), which concludes
Function (CDF) of log-normal distribution [19]. X is the the proof of Proposition 1.
Authorized licensed use limited to: BEIJING UNIVERSITY OF POST AND TELECOM. Downloaded on June 12,2023 at 07:04:44 UTC from IEEE Xplore. Restrictions apply.
2453
TABLE I: LANL dataset statistics. B. Experimental Setup
Datasets Events Source Computer Destination Computer In ADA, using the online deep learning similar to the
Authentication 1,051,430,459 16,230 15,895 algorithm described by Sahoo et al. [14], we build multiple
Red team 749 4 301 ADA-EM neural network models using LSTM and generate
features with LMP [16].
The variations for the proposed dynamic threshold depend For the evaluation, we build four models in ADA with 128,
on the recently observed log patterns. In this work, during 256, 512 and 1024 LSTM layers and compare them to the
experiments, we record the most recent 10000 normal event state-of-the-art approaches. In order to demonstrate the ability
losses and 200 abnormal event losses and update the threshold of the models, we limit the scope of the LANL dataset to
whenever more than 20% of the recorded losses changed. events recorded on day 8, as it contains the largest number of
As such, the loss for model m1 ranges from 0.1 - 3.1 and abnormal events (261) in over seven million system log events.
the threshold ranges from 0.41 - 0.46. Since normal log All experiments are performed on a MacBook Pro laptop, with
data occupied the majority in the whole dataset, more than Intel Core i5 CPU at 2.9 GHz and 8 GB (LPDDR3 2133 MHz)
80% of log event losses are clustered between 0.1 - 0.4. of RAM.
Correspondingly, the loss for model m2 ranges from 0.1 - 3.0
and the threshold ranges from 0.34 - 0.35. Whereas the loss C. Baselines
for model m3 ranges from 0.1 - 2.0 and the threshold ranges
In this work, the proposed ADA framework with the strate-
from 0.23 - 0.24. In ADA, as the time increases, threshold for
gies for adaptive model selection and dynamic threshold are
each model is updated to adapt to newly observed log patterns.
employed to build an online unsupervised anomaly detection
In Fig. 3(b) we provide detailed illustration of the thresholds system. Further, the model m3 performs optimally in terms of
updated for three different models over a span of 50 time steps F1-score and latency and hence we select m3 as the benchmark
where each time step is a composition of 200 log events. Since model in ADA when applying the adaptive algorithm. We
we employ the proposed adaptive strategy from Algorithm 1 compare the performance of this ADA system to the following
during the decision making, the pareto-optimal policy selects baselines.
the less computationally intensive shallower model m1 to State-of-the-art unsupervised deep learning approaches:
predict more than 80% of the whole test data. Essentially, we
1) NO-ADA: The NO-ADA system is a variant of the
use shallow models for predicting normal data and switch to
ADA framework with online training and unsupervised
deeper models only when we detect abnormal data. Therefore
learning, however, without the adaptive algorithm and
m1 ’s threshold is subjected to more changes than the other
dynamic thresholds. The performance of this baseline is
models. Subsequently, since the observed abnormal events
obtained with the benchmark model m3 .
are small, the deeper model m3 is rarely used and hence its
2) Fixed-ADA: The Fixed-ADA system is also a variant of
threshold remains almost stable during the testing.
the ADA framework with online training and unsuper-
IV. E VALUATION vised learning. In addition, it uses the adaptive strategy
shown in Algorithm 1 but the threshold of each model
In this section we evaluate the ADA framework and com- is kept constant like NO-ADA.
pare its performance with the state-of-the-art approaches. 3) RNNA [20]: The RNNA system uses the state-of-the-
art online and unsupervised learning approach where an
anomaly detection system is constructed using LSTM
A. Dataset networks with attention algorithms.
In this work, we use the Los Alamos National Labora- 4) Kitsune [21]: The Kitsune system uses the state-of-the-
tory (LANL) Cyber Security Dataset [15] to evaluate the art online and unsupervised learning approach that uses
ADA framework. The LANL dataset consists of Windows- the Autoencoder algorithms [22] to differentiate between
based system authentication event logs from LANL’s internal normal and abnormal events.
computer network. These logs were collected for a period of 5) DAGMM [23]: The DAGMM system uses the state-of-
58 consecutive days. The dataset contains over one billion the-art offline unsupervised learning approach called the
log entries comprising of authentication, network flow, DNS deep autoencoding gaussian mixture model.
lookup events and processes. Privacy related fields such as State-of-the-art unsupervised classical learning methods:
users, computers, and processes are anonymized in the dataset. 1) Isolation Forest (IF) [24]: It is an ensemble-based
The network activities recorded in the system logs include method for detecting outliers in the dataset.
both normal operational network activities as well as a series 2) One-class SVM (OCSVM) [25]: It is a max-margin
of abnormal activities termed as the red team activities. The based [26] incremental unsupervised outlier detection
red team activities mainly represent compromised account model.
credentials over a period of 30 days. Further, we used au- 3) Self Organizing Maps (SOM) [27]: The SOM networks
thentication events from the dataset, and the corresponding aim at dividing p-dimensional input space into a finite
statistics are summarized in Table I. number of partitions. The whole process is unsupervised
Authorized licensed use limited to: BEIJING UNIVERSITY OF POST AND TELECOM. Downloaded on June 12,2023 at 07:04:44 UTC from IEEE Xplore. Restrictions apply.
2454
100 1.00 1.2
Latency τ1 = 0.45924 M1
80 Exponential
0.95 0.9437 0.9525 1.0 τ2 = 0.34472
0.9333 τ3 = 0.23834 τ3 = 0.23832 M2
0.8 M3
Latency(ms)
60 0.9130
F1-score
Models
0.90 0.6
Loss
40 τ3 = 0.23834 M3
0.4 τ1 = 0.45924 M1
0.85
20 τ2 = 0.34470
0.2 τ2 = 0.34472 τ3 = 0.23832 τ = 0.23829 M2
0 0.80 3 M3
M1 M2 M3 M4 M1 M2 M3 M4 0.0
Models Models 1(N) 2(N) 3(A) 4(A) 5(A) 6(N) 7(A) 8(A) 9(N) 10(N)
Log Event
(a) Latency. (b) F1-Score. Fig. 5: Adaptive strategy with dynamic Threshold.
Fig. 4: ADA Online Model Performance. 40
M3
and it is carried out by presenting the vectors to all
30 ADA
Latency(ms)
encompassing neurons. NO-ADA
Models
4) Angle-Based Outlier Detection (ABOD) [28]: ABOD M2
20
detects the anomalies by the variance of angles between
M1
pairs of data samples. 10
1(N) 2(N) 3(A) 4(A) 5(A) 6(N) 7(A) 8(A) 9(N) 10(N)
D. Metrics Log Event
Fig. 6: Latency with ADA vs NO-ADA.
The following metrics are used to evaluate the performance
of ADA and the baseline state-of-the-art approaches. layers had the highest latency of 87 ms. We noticed that as
1) F-measure (F1-score): It is the harmonic mean of preci- number of layers increased, latency increased exponentially.
sion and recall [29] and is given as follows: Further, we also measured the F1-score of the models and
2 results are shown in Fig. 4(b). We observed that, as the number
F1-score = 1 1 , of layers increased, the depth of the model also increased
Precision + Recall
where the precision is defined as, and hence the deeper models produced a better F1 score than
TP their shallower counterparts. However, the deeper models have
Precision = , higher latency and demand more computational resources than
TP + FP
where T P is True Positives and F P is False Positives. shallower models. Therefore we only consider models m1 -
2) Recall: It is a measure of how many instances were m3 for the evaluation as they had lower latency and produced
identified correctly and it is given as follows: acceptable F1-scores. Before proceeding further, the proposed
TP adaptive strategy shown in Algorithm 1 needed a benchmark
Recall = ,
TP + FN model to begin with. Therefore, we selected the model m2
where F N is False Negatives. as our benchmark model by considering the trade-off between
3) Accuracy: It is a measure of how correctly an anomaly accuracy and latency of the three models.
detection system operates by measuring the percentage
F. ADA Performance
of TP, True Negatives (TN) along with the number of
false alarms in terms of FP and FN that the system We evaluated the performance of ADA using these models
produces [30] and is given as follows: by applying the adaptive strategy from Algorithm 1 and
TP + TN dynamic threshold from §III. For these experiments, we ran-
Accuracy = . domly selected 10 log events and simulated an attack scenario
TP + FN + FP + TN
4) Latency: The Latency is measured from when a log where the events 1-2 show normal authentications and events
event is sent as input to the model and until the output 3-5 show unauthorized authentications followed by event 6
prediction is obtained. which is a normal authentication attempt and events 7-8 again
perform unauthorized login attempts and finally the events
E. Online Model Performance 9-10 display normal authentications. Following the adaptive
To evaluate the performance of ADA, we built four ADA- principle and dynamic threshold, the resulting model used for
EM models for the following experiments using online deep predicting every event and the corresponding threshold used
learning algorithm similar to sahoo et al. [14]. We begin by for distinguishing normal and abnormal events are shown in
training the first model m1 with 128 LSTM layers followed Fig. 5 and the normal and abnormal events are indicated on
by the model m2 with 256 LSTM layers, model m3 with 512 the x-axis with the letters N and A next to events 1-10.
LSTM layers and the model m4 with 1024 LSTM layers. As shown in the Fig. 5, during the initial stage, the bench-
To ensure that models built using online deep learning mark model m2 was selected for predicting the first event. If
operated efficiently before applying the adaptive strategy from this log event was normal, then according to the Algorithm 1,
Algorithm 1 and dynamic threshold discussed in §III we the next pareto-optimal model m1 was selected for predicting
measured the latency and F1-score of the models. For Latency the second log event. However, the computationally intensive
we measured the execution time incurred by the models for model m3 was selected to predict the fourth event to produce
predicting an event and the results are shown in Fig. 4(a). high accuracy as the third event was found to be abnormal.
Noticeably, model m1 had the lowest latency of 14ms as it The threshold of the model used for prediction is updated and
had only 128 LSTM layers while model m4 with 1024 LSTM recalculated according to §III. Since the number of log events
Authorized licensed use limited to: BEIJING UNIVERSITY OF POST AND TELECOM. Downloaded on June 12,2023 at 07:04:44 UTC from IEEE Xplore. Restrictions apply.
2455
80 TABLE III: Machine learning results comparison.
74GB
Model F1-score (%) Accuracy (%) Recall (%) Latency (ms)
60
Storage(GB)
SOM [27] 48 92 33 0.03
40 IF [24] 47 88.3 84.5 0.17
OCSVM [25] 37 60 95 0.35
20 ABOD [28] 25 95 27 0.8
ADA 95 91 92 15.7
49KB
0
NO-ADA ADA of 95% with the lowest computational latency of 15.7 ms
Methods
Fig. 7: Storage with ADA vs NO-ADA. among the three variants. Further, the performance of ADA
TABLE II: Fixed threshold and adaptive threshold comparison. is very close to NO-ADA in terms of accuracy and recall and
much better in terms of latency. This is because normal log
Model F1-score (%) Accuracy (%) Recall (%) Latency (ms)
events occupied a higher percentage in the system log events.
NO-ADA 94 94 93 37 In ADA, due to adaptive strategy shown in Algorithm 1,
Fixed-ADA 91 83 92 18.8
ADA 95 91 92 15.7 during normal log events the pareto-optimal policy tends to
select the model with less computational cost and latency but
was very small in this experiment, the threshold values did not still produce acceptable accuracy. In ADA, the best model
change that frequently. The results clearly show the benefit which produces highest accuracy is only employed when
of applying the adaptive strategy as the pareto-optimal policy abnormal log event appears as these models demand increased
selected shallower models every time a normal event was computational resources. The adaptive strategy with fixed
observed without compromising the identification of abnormal threshold also decreases the latency in Fixed-ADA variant. In
event which were effectively predicted with deeper models to addition, since the threshold in ADA is updated continuously
produce higher accuracy. to adapt to newly observed log patterns in online system, ADA
We also compared the performance of ADA with NO-ADA can achieve highest F1-score compared to that in NO-ADA
variant and the results are shown in Fig. 6. We observed from and Fixed-ADA. Moreover, in Fixed-ADA, we observed that
the results that in NO-ADA, there was no adaptive strategy and the number of False Positives increased during testing and
hence the pareto-optimal policy selected the deepest neural therefore we used the deeper models more often than in ADA.
network model m3 as it produces the highest accuracy at
H. Performance Comparison
the cost of increased computational resources and latency.
Therefore, with NO-ADA, we achieved an accuracy of 94% at Studies by Mirsky et al. [21] show that offline machine
the expense of increased computational resources and latency learning algorithms perform considerably better than online
of 37 ms per log event. However, ADA achieved the accuracy machine learning algorithms. Since they have access to the
of 91% which is close to m3 but with less computational entire dataset during the training, offline algorithms perform
resources and 57.6% lower latency. multiple passes over the data to build a good model that has
learned well. However, constructing neural network models us-
G. Storage ing online algorithm is useful when resources, like the training
Further, we also analyzed that in system logs, the percentage data, computation and/or memory are limited. Since we utilize
of abnormal events are very low in comparison to normal online algorithms [14] and enhance the anomaly detection
events. However, due to the necessity to train the anomaly with adaptive model selection and dynamic thresholds, we
detection models, organizations usually store the data for compared the performance of ADA to both state-of-the-art
prolonged periods. Therefore in this work we propose to online and offline algorithms.
store only abnormal data and the most recently observed In Table III we summarize the experimental results for
loss values for the normal data and abnormal data. This is the state-of-the-art classical unsupervised machine learning
sufficient for ADA to accurately predict the behaviours in the algorithms along with ADA for anomaly detection in system
system logs and dynamically compute thresholds based on logs. As these classical algorithms use offline learning to train
current behaviours. The resulting storage required with ADA and build the models we used 80% of normal events from day
in comparison to existing methods which for simplicity we 8 in LANL dataset [15] to train and build models for these
refer to as NO-ADA is shown in Fig. 7. It is evident that with algorithms. For testing we used 10000 normal events from the
ADA we incur far less storage cost as we consume just 48 KB remaining 20% logs and 200 abnormal events.
of storage for the LANL dataset in comparison to storing the Overall, we observed that classical unsupervised machine
entire 74 GB of the dataset. learning methods demanded very low computational resources
For the above mentioned experiments, in Table II we list and produced results with very less latency of ~0.8 ms. How-
the F1-score, accuracy, recall and latency of NO-ADA, Fixed- ever, due to unbalanced and complex nature of the logged
ADA and ADA variants. We observed that NO-ADA has a events, their F1-scores are very low in comparison to ADA
good performance in terms of F1-score, accuracy and recall, even though they were pre-trained with the ground-truth.
but at the cost of increased latency due to increased model Since ADA is based on unsupervised deep learning, we
complexity. Noticeably, ADA achieved the highest F1-score also compared the performance of ADA with other state-of-
Authorized licensed use limited to: BEIJING UNIVERSITY OF POST AND TELECOM. Downloaded on June 12,2023 at 07:04:44 UTC from IEEE Xplore. Restrictions apply.
2456
TABLE IV: Deep learning results comparison. understanding of the underlying network protocols. In [9],
Model F1-score (%) Accuracy (%) Recall (%) Latency (ms) authors used adversarial training of VAE to detect anomalies in
RNNA [20] 86 76 78 530 Key Performance Monitors of network but, this is not suitable
Kitsune [21] 39 64 55 8.6 for online systems since the adversarial training is unstable and
DAGMM [23] 44 70 47 0.5
ADA 95 91 92 15.7 difficult to converge. Khatuya et al. [33] proposed ADELE,
with an aim to select features from system logs in order to
the-art deep learning algorithms and the results are given in create groundwork for a proactive, online failure prediction
Table IV. Overall, the results presented in Table IV clearly system. However, this system can perform only short-term
show that ADA outperforms RNNA, Kitsune and DAGMM failure predictions in the current environment. Moreover, it
in terms of F1-score, accuracy and recall. Specifically, the needs to be trained for at least a month to compute the anomaly
F1-score achieved by ADA is about 11% higher compared score.
to RNNA and 144% higher compared to Kitsune and 116% Authors from [34] highlight that, Anomaly detection is
higher compared to DAGMM, respectively. With regards to time sensitive and decisions have to be made in streaming
accuracy, RNNA, Kitsune and DAGMM’s accuracy are lower fashion. This enables the system administrators to intervene
than ADA by nearly 16%, 30% and 23%. Whereas the recall in an ongoing attack or fix a system performance issue. In
of RNNA, kitsune and DAGMM are nearly 15%, 40% and this regard, offline learning strategies like [6], [9], [12], [31],
49% less than ADA. As for the latency, RNNA performed that requires to know the prior pattern of normal and abnormal
the worst since it needed 530 ms to predict just one event. events are not suitable for the new detection systems and also
This is due to the design of RNNA where they employ user for existing systems trained on outdated events. Unlike the
perspective and use a separate LSTM for each user in the aforementioned research works, our proposed ADA framework
dataset. As the number of users increase the overall latency builds anomaly detection systems using unsupervised online
of the system also increases. Although Kitsune and DAGMM learning with the adaptive strategy and computes thresholds
had lower latency compared to ADA and RNNA, they did dynamically to improve the F1-score and accuracy of anomaly
not perform well w.r.t. other metrics. Therefore we conclude detection, reduce the latency and storage of the overall system.
that with ADA we efficiently utilize the available resources
and produce highly accurate deep neural network models and
consume far less resources compared to the state-of-the-art VI. C ONCLUSION AND F UTURE W ORK
approaches. In this work we studied the importance of system logs
and anomaly detection systems. We studied the limitations
V. R ELATED W ORK of existing anomaly detection approaches such as increased
We classify the related works into following categories. computational overhead, latency, failure to detect new threats,
Supervised learning approaches: Supervised learning etc. We then presented ADA, an adaptive deep log anomaly
techniques like [6], [7], [31], [32] have been widely explored detection framework for efficiently detecting anomalies in
for network anomaly detection. In these methods, both normal system logs. To overcome the limitations of existing works, we
and abnormal vectors are required to train a binary classifier proposed to build online unsupervised models with adaptive
to detect future anomalies. The main disadvantage of these model selection and dynamic thresholds for improving latency,
approaches is that they heavily depend on an experts’ ex- reduce computational overhead and dynamic updates. Using
perience to label the log data. However, even for an expert pareto-optimal policy we always select the optimal model with
it is challenging to distinguish and define anomalies and is best configurations according to the current system environ-
often expensive and labor intensive. Another downside of these ment. With experiments we demonstrate that ADA improves
approaches is that newer anomalies that were not part of the the performance of the log anomaly detection significantly,
training data may not be detected. and in particular the F1-score and latency compared to state-
Unsupervised learning approaches: Recently, Du et al. [4] of-the-art methods along with decreased storage. As part of
proposed an online log anomaly detector where customized future work we will explore and incorporate other deep neural
parsing methods are employed on the logs to generate se- network algorithms e.g., variational autoencoders and verify
quences for LSTM or clustering algorithms to detect DoS at- the benefits compared to current design. In addition, we will
tacks. However they require pre-define and customized parsing collect and integrate log data by increasing the scale and
methods for detection. In [11], [20] authors employed RNN scope to different systems and test the robustness of anomaly
and LSTM attention algorithms to improve the performance detection systems built using the ADA framework.
of anomaly detection in logs. However, they are defined from
VII. ACKNOWLEDGEMENT
a user’s perspective, where each user’s behaviours are trained
on a separate LSTM and hence the number of LSTM networks This work has been funded by EU H2020 COSAFE project
increase with the increase in users over time. Mirsky et (Grant ID: 824019) and China Scholarship Council (Grant ID:
al. [21] presented kitsune framework by using an ensembles of 201706050095). The authors also appreciate Ricardo Zimmer,
autoencoders for network anomaly detection. However, kitsune Trevor Khwam Tabougua and Shahroz Shahjahan for their
adds a separate feature extraction step which requires a basic helpful inputs on the topic of adaptive threshold.
Authorized licensed use limited to: BEIJING UNIVERSITY OF POST AND TELECOM. Downloaded on June 12,2023 at 07:04:44 UTC from IEEE Xplore. Restrictions apply.
2457
R EFERENCES [22] P. Baldi, “Autoencoders, unsupervised learning, and deep architectures,”
in Proceedings of ICML workshop on unsupervised and transfer learn-
[1] A. Bohara, M. A. Noureddine, A. Fawaz, and W. H. Sanders, “An ing, pp. 37–49, 2012.
unsupervised multi-detector approach for identifying malicious lateral [23] B. Zong, Q. Song, M. R. Min, W. Cheng, C. Lumezanu, D. Cho, and
movement,” in 2017 IEEE 36th Symposium on Reliable Distributed H. Chen, “Deep autoencoding gaussian mixture model for unsupervised
Systems (SRDS), pp. 224–233, IEEE, 2017. anomaly detection,” ICLR, 2018.
[2] N. Weiss and R. Miller, “The Target and other financial data breaches: [24] F. T. Liu, K. M. Ting, and Z.-H. Zhou, “Isolation-based anomaly
Frequently asked questions,” 2015 (last accessed July 20, 2019). https: detection,” ACM Transactions on Knowledge Discovery from Data
//fas.org/sgp/crs/misc/R43496.pdf. (TKDD), vol. 6, no. 1, p. 3, 2012.
[3] TrendMicro, “APT myths and challenges,” 2012 (last accessed July [25] Y. Chen, X. S. Zhou, and T. S. Huang, “One-class svm for learning in
20, 2019). http://blog.trendmicro.com/trendlabs-security-intelligence/ image retrieval.,” in ICIP (1), pp. 34–37, Citeseer, 2001.
infographic-apt-myths-and-challenges/. [26] Y. Kong and Y. Fu, “Max-margin action prediction machine,” IEEE
[4] M. Du, F. Li, G. Zheng, and V. Srikumar, “Deeplog: Anomaly detection transactions on pattern analysis and machine intelligence, vol. 38, no. 9,
and diagnosis from system logs through deep learning,” in Proceedings pp. 1844–1858, 2015.
of the 2017 ACM SIGSAC Conference on Computer and Communica- [27] L. Aguayo and G. A. Barreto, “Novelty detection in time series using
tions Security (CCS), pp. 1285–1298, ACM, 2017. self-organizing neural networks: A comprehensive evaluation,” Neural
Processing Letters, vol. 47, no. 2, pp. 717–744, 2018.
[5] Amazon, “CloudWatch pricing,” 2019 (last accessed July 30, 2019).
[28] H.-P. Kriegel, A. Zimek, et al., “Angle-based outlier detection in high-
https://aws.amazon.com/cloudwatch/pricing/?nc1=h ls.
dimensional data,” in Proceedings of the 14th ACM SIGKDD interna-
[6] Y. Li, J. Sun, W. Huang, and X. Tian, “Detecting anomaly in large-scale
tional conference on Knowledge discovery and data mining, pp. 444–
network using mobile crowdsourcing,” in IEEE INFOCOM 2019-IEEE
452, ACM, 2008.
Conference on Computer Communications, pp. 2179–2187, IEEE, 2019.
[29] A. A. Ghorbani, W. Lu, and M. Tavallaee, Network intrusion detection
[7] B. Le Bars and A. Kalogeratos, “A probabilistic framework to node-level and prevention: concepts and techniques, vol. 47. Springer Science &
anomaly detection in communication networks,” in IEEE INFOCOM Business Media, 2009.
2019-IEEE Conference on Computer Communications, pp. 2188–2196, [30] S. Axelsson, “The base-rate fallacy and the difficulty of intrusion detec-
IEEE, 2019. tion,” ACM Transactions on Information and System Security (TISSEC),
[8] D. Yuan, H. Mai, W. Xiong, L. Tan, Y. Zhou, and S. Pasupathy, vol. 3, no. 3, pp. 186–205, 2000.
“Sherlog: error diagnosis by connecting clues from run-time logs,” in [31] Y. Yuan, G. Kaklamanos, and D. Hogrefe, “A novel semi-supervised
ACM SIGARCH computer architecture news, vol. 38, pp. 143–154, adaboost technique for network anomaly detection,” in Proceedings of
ACM, 2010. the 19th ACM International Conference on Modeling, Analysis and
[9] W. Chen, H. Xu, Z. Li, D. Peiy, J. Chen, H. Qiao, Y. Feng, and Z. Wang, Simulation of Wireless and Mobile Systems, pp. 111–114, ACM, 2016.
“Unsupervised anomaly detection for intricate kpis via adversarial train- [32] R. Bitton and A. Shabtai, “A machine learning-based intrusion detection
ing of vae,” in IEEE INFOCOM 2019-IEEE Conference on Computer system for securing remote desktop connections to electronic flight bag
Communications, pp. 1891–1899, IEEE, 2019. servers,” IEEE Transactions on Dependable and Secure Computing,
[10] Keras, “LSTM,” 2019 (last accessed July 30, 2019). https://keras.io/ 2019.
layers/recurrent/#lstm. [33] S. Khatuya, N. Ganguly, J. Basak, M. Bharde, and B. Mitra, “Adele:
[11] A. R. Tuor, R. Baerwolf, N. Knowles, B. Hutchinson, N. Nichols, Anomaly detection from event log empiricism,” in IEEE INFOCOM
and R. Jasper, “Recurrent neural network language models for open 2018-IEEE Conference on Computer Communications, pp. 2114–2122,
vocabulary event-level cyber anomaly detection,” in Workshops at the IEEE, 2018.
Thirty-Second AAAI Conference on Artificial Intelligence, 2018. [34] R. Chalapathy and S. Chawla, “Deep learning for anomaly detection: A
[12] N. Zhao, J. Zhu, R. Liu, D. Liu, M. Zhang, and D. Pei, “Label-less: A survey,” arXiv preprint arXiv:1901.03407, 2019.
semi-automatic labelling tool for kpi anomalies,” in IEEE INFOCOM
2019-IEEE Conference on Computer Communications, pp. 1882–1890,
IEEE, 2019.
[13] A. R. Tuor, R. Baerwolf, N. Knowles, B. Hutchinson, N. Nichols,
and R. Jasper, “Recurrent neural network language models for open
vocabulary event-level cyber anomaly detection,” in Workshops at the
Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
[14] D. Sahoo, Q. Pham, J. Lu, and S. C. Hoi, “Online deep learning: learning
deep neural networks on the fly,” in Proceedings of the 27th International
Joint Conference on Artificial Intelligence, pp. 2660–2666, AAAI Press,
2018.
[15] A. D. Kent, “Cyber security data sources for dynamic network research,”
in Dynamic Networks and Cyber-Security, pp. 37–65, World Scientific,
2016.
[16] T. Kudo and J. Richardson, “Sentencepiece: A simple and language inde-
pendent subword tokenizer and detokenizer for neural text processing,”
in Proceedings of the 2018 Conference on Empirical Methods in Natural
Language Processing: System Demonstrations, pp. 66–71, 2018.
[17] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares,
H. Schwenk, and Y. Bengio, “Learning phrase representations using
rnn encoder-decoder for statistical machine translation,” arXiv preprint
arXiv:1406.1078, 2014.
[18] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, Learning Internal
Representations by Error Propagation. 1988.
[19] W. J. Dixon and F. J. Massey Jr, “Introduction to statistical analysis.,”
1951.
[20] A. Brown, A. Tuor, B. Hutchinson, and N. Nichols, “Recurrent neural
network attention mechanisms for interpretable system log anomaly
detection,” in Proceedings of the First Workshop on Machine Learning
for Computing Systems, p. 1, ACM, 2018.
[21] Y. Mirsky, T. Doitshman, Y. Elovici, and A. Shabtai, “Kitsune: an
ensemble of autoencoders for online network intrusion detection,” arXiv
preprint arXiv:1802.09089, 2018.
Authorized licensed use limited to: BEIJING UNIVERSITY OF POST AND TELECOM. Downloaded on June 12,2023 at 07:04:44 UTC from IEEE Xplore. Restrictions apply.
2458