0% found this document useful (0 votes)
24 views

ARCADE Adversarially Regularized Convolutional Autoencoder For Network Anomaly Detection

This paper proposes ARCADE, an unsupervised deep learning approach for network anomaly detection. ARCADE uses a convolutional autoencoder trained exclusively on normal traffic to build a profile. It can detect anomalies and intrusions by examining only the first few packets of network flows. The method is more effective than state-of-the-art techniques while having significantly fewer parameters.

Uploaded by

davidtop666888
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

ARCADE Adversarially Regularized Convolutional Autoencoder For Network Anomaly Detection

This paper proposes ARCADE, an unsupervised deep learning approach for network anomaly detection. ARCADE uses a convolutional autoencoder trained exclusively on normal traffic to build a profile. It can detect anomalies and intrusions by examining only the first few packets of network flows. The method is more effective than state-of-the-art techniques while having significantly fewer parameters.

Uploaded by

davidtop666888
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL. 20, NO.

2, JUNE 2023 1305

ARCADE: Adversarially Regularized Convolutional


Autoencoder for Network Anomaly Detection
Willian Tessaro Lunardi , Member, IEEE, Martin Andreoni Lopez , Member, IEEE, and Jean-Pierre Giacalone

Abstract—As the number of heterogenous IP-connected devices and suspicious activities by monitoring network traffic. IDSs
and traffic volume increase, so does the potential for security can be implemented as signature-based, anomaly-based, or
breaches. The undetected exploitation of these breaches can hybrid. Signature-based IDSs detect intrusions by comparing
bring severe cybersecurity and privacy risks. Anomaly-based
Intrusion Detection Systems (IDSs) play an essential role in monitored behaviors with pre-defined intrusion patterns, while
network security. In this paper, we present a practical unsu- anomaly-based IDSs focus on knowing normal behavior to
pervised anomaly-based deep learning detection system called identify any deviation [2].
ARCADE (Adversarially Regularized Convolutional Autoencoder The vast majority of existing network IDSs are based on
for unsupervised network anomaly DEtection). With a convolu- the assumption that traffic signatures from known attacks can
tional Autoencoder (AE), ARCADE automatically builds a profile
of the normal traffic using a subset of raw bytes of a few initial be gathered so that new traffic can be compared to these sig-
packets of network flows so that potential network anomalies and natures for detection. Despite high detection capabilities for
intrusions can be efficiently detected before they cause more dam- known attacks, signature-based approaches cannot detect novel
age to the network. ARCADE is trained exclusively on normal attacks since they can only detect attacks for which a signature
traffic. An adversarial training strategy is proposed to regular- was previously created. Regular database maintenance cycles
ize and decrease the AE’s capabilities to reconstruct network
flows that are out-of-the-normal distribution, thereby improv- must be performed to add novel signatures for threats as they
ing its anomaly detection capabilities. The proposed approach are discovered. Acquiring labeled malicious samples, however,
is more effective than state-of-the-art deep learning approaches can be extremely difficult or impossible to obtain. The defini-
for network anomaly detection. Even when examining only two tion of signature-based IDSs, or any other supervised approach
initial packets of a network flow, ARCADE can effectively detect for the task, becomes even more challenging when the known
malware infection and network attacks. ARCADE presents 20
times fewer parameters than baselines, achieving significantly class imbalance problem is faced while dealing with public
faster detection speed and reaction time. network traffic datasets is considered. Network traffic datasets
are known for being highly imbalanced towards examples of
Index Terms—Unsupervised anomaly detection, autoencoder,
generative adversarial networks, automatic feature extraction, normality (non-anomalous/non-malicious) [3] while lacking in
deep learning, cybersecurity. examples of abnormality (anomalous/malicious) and offering
only partial coverage of all possibilities can encompass this
latter class [4].
I. I NTRODUCTION In contrast, anomaly-based IDSs relies on building a pro-
HE PROLIFERATION of IP-connected devices is sky- file of the normal traffic. These systems attempt to estimate
T rocketing and is predicted to surpass three times the
world’s population by 2023 [1]. As the number of con-
the expected behavior of the network to be protected and gen-
erate anomaly alerts whenever a divergence between a given
nected devices increases and 5G technologies become more observation and the known normality distribution exceeds a
ubiquitous and efficient, network traffic volume will follow pre-defined threshold. Anomaly-based IDSs do not require
suit. This accelerated growth raises overwhelming security a recurrent update of databases to detect novel attack vari-
concerns due to the exchange of vast amounts of sensi- ants, and their main drawback usually is the False Alarm Rate
tive information through resource-constrained devices and (FAR), as it is challenging to find the boundary between the
over untrusted heterogeneous technologies and communica- normal and abnormal profiles. These approaches have gained
tion protocols. Advanced security controls and analysis must popularity in recent years due to the explosion of attack vari-
be applied to maintain a sustainable, reliable, and secure ants [5], [6], which relates to their ability to detect previously
cyberspace. IDSs play an essential role in network security, unknown or zero-day threats. Additionally, they do not suf-
allowing for detecting and responding to potential intrusions fer from the dataset imbalance problem since it only requires
normal traffic during training.
Manuscript received 30 April 2022; revised 19 November 2022 and Deep Learning (DL) has emerged as a game-changer to help
25 November 2022; accepted 13 December 2022. Date of publication automatically build network profiles through feature learning.
16 December 2022; date of current version 6 July 2023. The associate editor
coordinating the review of this article and approving it for publication was It can effectively learn structured and complex non-linear traf-
K. El-Khatib. (Corresponding author: Willian Tessaro Lunardi.) fic feature representations directly from the raw bytes of a
The authors are with the Secure System Research Center, Technology large volume of normal data. Based on a well-represented
and Innovation Institute, Abu Dhabi, UAE (e-mail: [email protected];
[email protected]; [email protected]). traffic profile, it is expected that the system’s capabilities for
Digital Object Identifier 10.1109/TNSM.2022.3229706 isolating anomalies from the normal traffic will increase while
1932-4537 
c 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Northeastern University. Downloaded on August 11,2023 at 09:08:50 UTC from IEEE Xplore. Restrictions apply.
1306 IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL. 20, NO. 2, JUNE 2023

decreasing the FAR. However, the naive adoption of DL may of ARCADE’s effectiveness and complexity with respect to the
lead to misleading design choices and the introduction of sev- considered baselines. Finally, Section VI concludes this paper.
eral drawbacks, such as slow detection and reaction time. In
addition to carefully defining the model’s architecture, training II. BACKGROUND
artifices could be exploited to improve the method’s effec- A. Generative Adversarial Networks
tiveness without degrading the efficiency due to the increased
number of parameters and model size. The GANs [11] framework establishes a min-max adversar-
In this paper, we propose ARCADE, an unsupervised DL ial game between a generative model G and a discriminative
approach for network anomaly detection that automatically model D. The discriminator D(x) computes the probability that
builds a profile of the normal traffic (training exclusively a point x in data space is a sample from the data distribution
on the normal traffic) using a subset of bytes of a few ini- rather than a sample from our generative model. The gen-
tial packets of network traffic flow as input data. It allows erator G(z) maps samples z from the prior p(z) to the data
early attack detection preventing any further damage to the space. G(z) is trained to maximally confuse the discrimina-
network security while mitigating any unforeseen downtime tor into believing that the samples it generates come from the
and interruption. The network traffic can be originated from data distribution. The process is iterated, leading to the famous
real-time packet sniffing over a network interface card or from minimax game [11] between generator G and discriminator D
   
a .pcap file. The proposed approach combines two deep min max E log D(x) + E log 1 − D(x̃) , (1)
neural networks during training: (i) an AE trained to encode G D x ∼Pr x̃ ∼Pg
and decode (reconstruct) normal traffic; (ii) a critic trained where Pr is the data distribution and Pg is the model distri-
to provide high score values for real normal traffic samples, bution implicitly defined by x̃ = G(z), where z ∼ p(z) is the
and low scores values for their reconstructions. An adversarial noise drawn from an arbitrary prior distribution.
training strategy is settled where the critic’s knowledge regard- Suppose the discriminator is trained to optimality before
ing the normal traffic distribution is used to regularize the each generator parameter update. In that case, minimiz-
AE, decreasing its potential to reconstruct anomalies, address- ing the value function amounts to minimizing the Jensen-
ing the known generalization problem [7], [8], [9], where (in Shannon divergence (JSD) between Pr and Pg [11], but doing
some scenarios) anomalies are reconstructed as well as normal so often leads to vanishing gradients as the discriminator
samples. During detection, the error between the input traffic saturates [12], [13].
sample and its reconstruction is used as an anomaly score, i.e.,
traffic samples with high reconstruction error are considered B. Wasserstein Generative Adversarial Networks
more likely to be anomalous. The significant contributions of
To overcome the undesirable JSD behavior,
this paper are summarized as follows:
Arjovsky et al. [12] proposed Wasserstein Generative
• An unsupervised DL-based approach for early anomaly
Adversarial Networks (WGAN) that leverages Wasserstein
detection that automatically builds the network traffic pro-
distance W(q, p) to produce a value function that has better
file based on the raw packet bytes of network flows of
theoretical properties than the original. They modified the
the normal traffic. It can detect (novel) network anomalies
discriminator to emit an unconstrained real number (score)
given a few initial packets of network flows, allowing it
rather than a probability. In this context, the discriminator is
to prevent network attacks before they could cause further
now called a critic. The min-max WGAN training objective
damage.
is given by
• A Wasserstein Generative Adversarial Networks with    
Gradient Penality (WGAN-GP) adversarial training to min max E C (x) − E C (x̃) . (2)
G C x∼Pr x̃∼Pg
regularize AEs, decrease its generalization capabili-
ties towards out-of-the-normal distribution samples, and When the critic C is Lipschitz smooth, this approach approx-
improve its anomaly detection capabilities. imately minimizes the Wasserstein-1 distance W (Pr , Pg ). To
• A compact convolutional AE model inspired by Deep enforce Lipschitz smoothness, the weights of C are clipped
Convolutional GAN (DCGAN) [10]. The model presents to lie within a compact space [−c, c]. However, as described
higher accuracy, faster reaction time, 20 times fewer in [12], weight clipping is a terrible approach to enforcing the
parameters than baselines. Lipschitz constraint.
• An extensive validation of ARCADE conducted on sev- Gulrajaniet al. [13] proposed an alternative approach where
eral network traffic datasets to assess its capabilities in a soft version of the constraint is enforced with a penalty on the
detecting anomalous traffic of several types of malware gradient norm for random samples x̂ ∼ Px̂ . When considering
and attacks. the WGAN-GP proposed in [13], the critic’s loss is given by
The remainder of the paper is laid out as follows: Section II    
E C (x) − E C (x̃) + λC LGP , (3)
provides the necessary background for Generative Adversarial x∼Pr x̃∼Pg
Networkss (GANs). Section III reviews and discusses previous
where λC is the penalty coefficient, and
relevant works in the field of DL for anomaly detection and  
network anomaly detection. Section IV describes the proposed LGP = E (∇x̂ C (x̂)2 − 1)2 , (4)
network flows preprocessing pipeline, model architecture, loss x̂∼Px̂
functions, and adversarial training strategy. Section V presents where Px̂ is the distribution defined by the following sampling
the ablation studies and experimental analysis and comparison process: x ∼ Pr , x̃ ∼ Pg , α ∼ U (0, 1), and x̂ = αx+(1−α)x̃.
Authorized licensed use limited to: Northeastern University. Downloaded on August 11,2023 at 09:08:50 UTC from IEEE Xplore. Restrictions apply.
TESSARO LUNARDI et al.: ARCADE 1307

III. R ELATED W ORK the computational inefficiency, which can be addressed by


Herein, we discuss the relevant works employing DL for adding an extra network that learns the mapping from data
anomaly detection. We first present DL anomaly detection instances onto latent space, i.e., an inverse of the genera-
approaches that have emerged as leading methodologies in tor, resulting in methods like EBGAN [27]. Akcay et al. [28]
the field of image and video. Then, we comprehensively ana- proposed GANomaly that further improves the generator over
lyze these novel DL methods and their potential application the previous works by changing the generator to an encoder-
to network anomaly detection. We categorize unsupervised decoder-encoder network. The AE is trained to minimize a
anomaly detection methods into generative models or pre- per-pixel value loss, whereas the second encoder is trained to
trained networks, introduced in Sections III-A and III-B, reconstruct the latent codes produced by the first encoder. The
respectively. Finally, Section III-C presents the DL-related latent reconstruction error is used as an anomaly score.
works for network traffic classification and baselines for The idea behind AEs is straightforward and can be defined
unsupervised anomaly detection. under different Artificial Neural Network (ANN) architectures.
Several authors have already investigated the applicability of
AEs for network anomaly detection [5], [6]. However, its naive
A. Generative-Based Anomaly Detection adoption can lead to unsatisfactory performance due to its vul-
Generative models, such as AEs [6], [23] and GANs [11], nerability to noise in the training data and its generalization
[12], [13], can generate samples from the manifold of the train- capabilities. We propose an adversarial regularization strat-
ing data. Anomaly detection approaches using these models egy with a carefully designed and compact AE parameterized
are based on the idea that anomalies cannot be generated since by 1D-Convolutional Neural Network (CNN). The adversar-
they do not exist in the training set. ial training is employed to deal with the aforementioned AE’s
AEs are neural networks that attempt to learn the iden- weaknesses. Similarly to GANomaly, our approach employs
tity function while having an intermediate representation of an adversarial penalty term to the AE to enforce it to pro-
reduced dimension (or some sparsity regularization) serving as duce normal-like samples. Therefore, we also consider the
a bottleneck to induce the network to extract salient features GANomaly framework as a baseline and compare it with the
from some dataset. These approaches aim to learn some low- proposed ARCADE for network anomaly detection.
dimensional feature representation space on which normal data
instances can be well reconstructed. The heuristic for using
these techniques in anomaly detection is that, since the model B. Pretrained-Based Anomaly Detection
is trained only on normal data, normal instances are expected Pretrained-based anomaly detection methods use mod-
to be better reconstructed from the latent space than anomalies. els trained on large datasets, such as ImageNet, to extract
Thus, the distance between the input data and its reconstruc- features [29]. These pre-trained models produce separable
tion can be used as an anomaly score. Although AEs have been semantic embeddings and, as a result, enable the detec-
successfully applied across many anomaly detection tasks, in tion of anomalies by using simple scoring methods such as
some cases, they fail due to their strong generalization capa- k-Nearest Neighbor (k-NN) or Gaussian Mixture Models [30].
bilities [7], i.e., sometimes anomalies can be reconstructed as Surprisingly, the embeddings produced by these algorithms
well as normal samples. Bergmann et al. [23] shows that AEs lead to good results even on datasets that are drastically dif-
using the Structural Similarity Index Measure (SSIM) [24] ferent from the pretraining ones. Recently, [31] showed that
can outperform complex architectures that rely on a per-pixel using a k-Nearest Neighbor for anomaly detection as a scor-
value 2 -loss. Gong et al. [8] tackle the generalization problem ing method on the extracted features of a pre-trained ResNet
by employing memory modules which can be seen as a dis- model trained on the ImageNet produces highly effective and
cretized latent space. Zhai et al. [9] connect regularized AEs general anomaly detection methods on images. That alone sur-
with energy-based models to model the data distribution and passed almost all unsupervised and self-supervised methods.
classify samples with high energy as an anomaly. In [32], it is shown that fine-tuning the model using either
GAN-based approaches assume that only positive samples center loss or contrasting learning leads to even better results.
can be generated. These approaches generally aim to learn The application of those methods for network anomaly
a latent feature space of a generative network so that the detection is challenging primarily due to the detection’s com-
latent space well captures the normality underlying the given plexity related to the additional required scoring step. Even
data [25]. Some residual between the real and generated with a compact model, such as the proposed in Section IV-B
instances is then defined as an anomaly score. One of the early with 184k parameters, or the EfficientNet B0 with 5.3M
GAN-based methods for anomaly detection is AnoGAN [26]. parameters, the requirement for a post-processing scoring pro-
The fundamental intuition is that given any data instance x; it cedure makes it unsuitable for online detection, e.g., after
aims to search for an instance z in the learned latent features forwarding the sample through the model for feature extrac-
space of the generative network G so that the correspond- tion, computing the anomaly score for a given sample’s feature
ing generated instance G(z) and x are as similar as possible. vector with k-Nearest Neighbor as the scoring method (as
Since the latent space is enforced to capture the underly- proposed in [31]), implies O(nl) time complexity, where n is
ing distribution of training data, anomalies are expected to the number of training samples, and l is the length of the
be less likely to have highly similar generated counterparts feature vectors. These techniques appear unexplored and may
than normal instances. One main issue with AnoGAN is stand out for offline network anomaly detection.

Authorized licensed use limited to: Northeastern University. Downloaded on August 11,2023 at 09:08:50 UTC from IEEE Xplore. Restrictions apply.
1308 IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL. 20, NO. 2, JUNE 2023

TABLE I
D EEP L EARNING R ELATED W ORKS FOR N ETWORK I NTRUSION D ETECTION . F OR W ORKS T HAT U SED R AW N ETWORK T RAFFIC AS I NPUT,
W HEN S PECIFIED , W E P RESENT THE N UMBER OF PACKETS ( N ) AND B YTES ( L ) U SED AS I NPUT

TABLE II
ISCX-IDS DATASET

TABLE III
USTC-TFC DATASET

C. Deep Learning for Network Traffic Classification a few works in which adversarial training or unsupervised
Several works have studied DL network traffic classifi- anomaly detection was addressed and which rely on hand-
cation under the supervised setting. Few works also have designed features. Vu et al. [14] proposed the use of a GANs
studied adversarial training strategies for network traffic clas- for dealing with the imbalanced data problem in network traffic
sification based on hand-designed features. Nonetheless, fea- classification. The synthetic samples are generated to aug-
ture learning-based unsupervised network anomaly detection ment the training dataset. The sample’s generation, as well
with adversarial training appears currently unexplored. Table I as the classification, is done based on 22 statistical features
summarizes our related works, which are categorized into: extracted from network flows. Truong-Huu et al. [5] studied
(i) Unsupervised anomaly detection (UD) when only nor- the capability of a GAN for unsupervised network anomaly
mal traffic is considered at the training stage. (ii) Adversarial detection where 39 hand-designed features extracted from traf-
Training (AT) when GAN-based strategies are applied dur- fic flows and sessions are used as input. Results show that
ing training. (iii) Raw Traffic (RT) when the input is the raw their proposed approach managed to obtain better results when
network traffic. When RT is considered, the considered pro- compared to the autoencoder without any enhanced adver-
tocol layers, the number of initial bytes l, and the number of sarial training. Doriguzzi-Corin et al. [15] proposed a spatial
packets n are presented. representation that enables a convolutional neural network to
Most deep learning-based traffic classification and anomaly learn the correlation between 11 packet’s features to detect
detection approaches rely on feature engineering. We highlight Distributed Denial of Service (DDoS) traffic.

Authorized licensed use limited to: Northeastern University. Downloaded on August 11,2023 at 09:08:50 UTC from IEEE Xplore. Restrictions apply.
TESSARO LUNARDI et al.: ARCADE 1309

UD bullet is partially filled in Table I). The extracted fea-


tures from an intermediate layer of the MLP are used as the
input for a MLP-based AE. The anomaly score is based on a
2 -distance between the extracted features and the AE recon-
struction. Results indicate that normal and malware traffic,
such as the Mirai Botnet, can be effectively separated even
Fig. 1. Visualization of network flows from four distinct traffic classes of when only two packets are used for detection. We implemented
the USTC-TFC dataset. In this instance, 784 initial bytes of nine network and included D-PACK in our experiments as a baseline model.
flows (of four traffic classes) were reshaped into 28 × 28 grayscale images.
(a) FTP. (b) Geodo. (c) Htbot. (d) World of Warcraft.
IV. M ETHODOLOGY
In this section, we present our so-called “ARCADE”
proposed approach. The network flow preprocessing proce-
Network traffic feature learning is predominantly performed
dure is presented in Section IV-A. The model’s architec-
through ANN architectures like 1D-CNN, 2D-CNN, and Long
ture is presented in Section IV-B. The AE distance metrics
Short-Term Memory (LSTM). Extracted bytes from network
and adversarial training are presented in Section IV-C and
traffic flows (or packets) are kept sequential for the 1D-
Section IV-D, respectively. Finally, the anomaly score calcu-
CNN and LSTM case, whereas for the 2D-CNNs, extracted
lation is presented in Section IV-E.
bytes are seen as pixels of grayscale images, as illustrated in
Figure 1. Wang et al. [16] proposed an approach that relies
on the advantages of both 2D-CNNs and LSTMs to extract A. Network Traffic Flow Preprocessing
spatial-temporal features of network traffic. Results show that Network traffic classification or anomaly detection can be
accuracy is improved when both architectures are combined. performed at different granularity units, e.g., packet, flow, and
Wang et al. [17] proposed a supervised DL approach for session. It is worth noting that most of the works shown
malware traffic classification that uses 2D-CNNs to extract in Table I considered either flows or sessions as the rele-
spatial features from headers and payloads of network flows vant classification objects. A network flow is a unidirectional
and sessions. Two different choices of raw traffic images sequence of packets with the same 5-tuple (source IP, source
(named “ALL” and “L7”) dependent on the protocol layers port, destination IP, destination port, and transport-level pro-
considered to extract the input data are used to feed the tocol) exchanged between two endpoints. A session is defined
classifier, showing that sessions with “ALL” are the most as a bidirectional flow, including both directions of traffic.
informative and reach elevate performance for all the metrics We increment the aforementioned flow definition by consid-
considered. Yu et al. [18] proposed a self-supervised learning ering that a network flow is to be terminated or inactivated
2D-CNN Stacked Autoencoder (SAE) for feature extraction, when the flow has not received a new packet within a spe-
which is evaluated through different classification tasks with cific flow timeout (e.g., 120 seconds). When the underlying
malware traffic data. Wang et al. [19] have shown that 1D- network protocol is TCP, we consider the network connection
CNN outperforms 2D-CNN for encrypted traffic classification. closed (and the corresponding flow completed) upon detect-
Aceto et al. [20] performed an in-depth comparison on the ing the first flow packet containing a FIN flag. Note that, in
application of Multilayer Perceptron (MLP), 1D-CNN, 2D- the case of TCP sessions, a network connection is considered
CNN, and LSTM architectures for encrypted mobile traffic closed when both sides have sent a FIN packet. Upon the
classification. Numerical results indicated that 1D-CNN is a termination of a network flow, unprocessed buffered packets
more appropriate choice for network traffic classification since should be discarded.
it can better capture spatial dependencies between adjacent It is well known that the initial packets of each network flow
bytes in the network packets due to the nature of the input data contain the most information that allows for the discrimina-
that is, by definition, one-dimensional. Lotfollahi et al. [21] tion between normal and abnormal activities [6], [20], [22],
used 1D-CNN to automatically extract network traffic features depicting the fundamental concept behind early detection
and identify encrypted traffic to distinguish Virtual Private approaches, which conduct the detection given a small num-
Network (VPN) and non-VPN traffic. Ahmad et al. [22] ber of initial packets of a flow. The smaller the number of
employed 1D-CNN-based classifier for early detection of packets required as input for the anomaly detection procedure,
network attacks. It is shown that a high degree of accuracy the lower the reaction time and overhead imposed by the DL
can be achieved by analyzing 1 to 3 packets. method. Instead of analyzing every packet of a network flow
The works mentioned above perform the task of traf- on a time window, we use the n initial packets of a network
fic classification or anomaly detection based on labeled flow as input. In this sense, n denotes the exact number of
datasets. Recently, [6] proposed an “unsupervised” approach initial packets of a network flow required to form the input
for anomaly detection, so-called D-PACK, in which only nor- for ARCADE. For each active flow, n packets are buffered
mal traffic is used during training. The model architecture is and trimmed into a fixed length of 100 bytes, starting with
composed of 1D-CNN that performs feature extraction, fol- the header fields, i.e., packets are truncated to 100 bytes if
lowed by MLP softmax classifier given a labeled dataset of larger, otherwise, padded with zeros. Packets are cleaned such
normal traffic, i.e., they assume the normal traffic is labeled that MAC and IP addresses are anonymized. Finally, bytes
into multiple classes (that is the reason why its respective are normalized in [0, 1] and packets concatenated into the

Authorized licensed use limited to: Northeastern University. Downloaded on August 11,2023 at 09:08:50 UTC from IEEE Xplore. Restrictions apply.
1310 IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL. 20, NO. 2, JUNE 2023

decoding process can be summarized as


 
x̃ = D E (x) = G(x), (5)
where x̃ is the reconstruction of the input. The encoder
uses strided convolutions to down-sample the input, fol-
lowed by batch normalization and Leaky Rectified Linear Unit
(Leaky ReLU). Differently from a deterministic pooling oper-
Fig. 2. An illustration of the proposed network traffic preprocessing pipeline
with n = 2. Packets with the same color represent network flows. Traffic can ation, strided convolutions allow the model to learn its own
be originated from a real-time packet sniffing or a .pcap file. Packets are downsampling/upsampling strategy. The decoder uses strided
filtered according to their 5-tuple, and n initial packets are buffered. MAC and transpose convolutions to up-sample the latent space, followed
IP addresses are masked, and according to their length, packets are truncated
(if larger than l) or padded with zeros (if smaller than l). Finally, bytes are by Rectified Linear Unit (ReLU) and batch normalization. The
normalized, and packets are concatenated. Traffic preemption is not required. critic function C : Rw → R, whose objective is to pro-
vide a score to the input x and the reconstruction x̃, has
a similar architecture to the encoder E. It also uses strided
convolutions to down-sample the input and Leaky ReLU; how-
ever, following [13], we use layer normalization instead of
batch normalization. The number of layers and filter size were
defined so that ARCADE could effectively detect anoma-
lies and still provide quick reaction time. Table X precisely
presents the proposed model architecture.

Fig. 3. An illustration of the proposed model architecture. Note that C. Autoencoder Distance Metric
ARCADE is parametrized by 1D CNNs, as described in Section IV-B.
The core idea behind ARCADE is that the model must
learn the normal traffic distribution to reconstruct it correctly.
The hypothesis is that the model is conversely expected to
final input form, i.e., a sample x can be denoted as x ∈ Rw fail to reconstruct attacks and malware traffic as it is never
where w = 100n is the sequence length. Figure 2 illustrates trained on such abnormal situations. A loss function must be
the essential steps of the proposed network traffic flow prepro- defined to train an AE to reconstruct its input. For simplicity,
cessing pipeline. Note that traffic preemption is not required. a per-value L2 loss is typically used between the input x and
However, it is crucial to consider the reaction time, which reconstruction x̃, and can be expressed as
relates to the processing power capabilities of the device in
which the ARCADE will run. The analysis of the complexity 
w
 2
L2 (x, x̃) = xi − x̃i , (6)
and detection speed of ARCADE given different devices is
i=1
provided in Section V-E.
where xi is the i-th value in the sequence. During the evalu-
ation phase, the per-value 2 -distance of x and x̃ is computed
B. Model Architecture to obtain the residual map.
As demonstrated by [23], AEs that make use of L2 loss may
Several recent papers focus on improving training stability fail in some scenarios to detect structural differences between
and the resulting quality of GAN samples [10], [12], [13]. Our the input and their reconstruction. Adapting the loss and
proposed model is inspired by DCGAN [10], who introduce a evaluation functions to the SSIM [24] that capture local inter-
convolutional generator network by removing fully connected dependencies between the input and reconstruction regions can
layers and using convolutional layers and batch-normalization improve the AE‘s anomaly detection capabilities. This is also
throughout the network. Strided convolutions replace pool- verified in this work, as demonstrated in Section V. The SSIM
ing layers. This results in a more robust model with higher index defines a distance measure between two K × K patches
sample quality while reducing the model size and number p and q is given by
of parameters to learn. Our proposed architecture shown in   
Figure 3 consists of two main components: (i) the AE (which 2μp μq + c1 2σpq + c2
SSIM(p, q) =  2  , (7)
can be seen as the generator) composed of an encoder E and μp + μ2q + c1 σq2 + σq2 + c2
a decoder D, and (ii) the critic C. Given the findinds in [20],
where μp and μq are the patches’ mean intensity, σp2 and σq2
our functions E, D and C are parameterized by 1D-CNNs.
are the variances, and σpq the covariance. The constants c1 and
Note that ARCADE can be easily modified to be used as an
c2 ensure numerical stability and are typically set to c1 = 0.01
anomaly detection method for other anomaly detection tasks,
and c2 = 0.03.
such as image or time series anomaly detection. Moreover, the
The SSIM is commonly used to measure the similarity
proposed adversarial regularization strategy can be applied to
between images, performed by sliding a K × K window that
any AE, independent of its ANN architecture.
moves pixel-by-pixel. Since in our case x is a sequence, we
The AE consists of an encoder function E : Rw → Rd
split it into n subsequences of length l, i.e., each subsequence
and a decoder function D : Rd → Rw , where d denotes the  
dimensionality of the latent space. The overall encoding and xi = xj ∈ [0, 1] : j ∈ {1 + (i − 1)l , . . . , il } ,

Authorized licensed use limited to: Northeastern University. Downloaded on August 11,2023 at 09:08:50 UTC from IEEE Xplore. Restrictions apply.
TESSARO LUNARDI et al.: ARCADE 1311

Algorithm 1 Proposed Adversarial Training. We Use m = 64,


λC = 10, λG = 100, α = 1e-4, β1 = 0, and β2 = 0.9
Require: Batch size m, maximum training iterations maxepoch , C
penalty coefficients λC and λG , Adam hyperparameters α, β1 , β2 ,
critic and autoencoder initial parameters ψ0 and θ0 , respectively.
1: while current epoch is smaller than maxepoch do
Fig. 4. An illustration of the advantages of the SSIM over L2 for the
segmentation of the discrepancies between a subset of bytes of a packet and 2: Sample a batch of normal traffic samples {x(i) }m
i=1 ∼ Pr
their respective reconstructions. 3: x̃ ← Gθ (x)
4: for i ← 1 to m do
5: Sample a random number  ∼ U (0, 1)
where i ∈ {1, 2, . . . , n} and l = 100 can be seen as the 6: x̂ ← x(i) + (1 − )x̃(i)
(i)
subset of 100 bytes of the i-th packet that was originally 7: LC ← Cψ (x(i) ) − Cψ (x̃(i) ) + λC (∇x̂ Cψ (x̂)2 − 1)2
used to compose the sequence x. Finally, subsequences are ψ ← Adam(∇ψ m 1 m L(i) , ψ, α, β , β )
√ 8: i=1 C 1 2
reshaped xi ∈ Rl → xi ∈ RK ×K , where K = l and l is 9: LG ← MSSIM(x, x̃) + λG Cψ (x̃)
a perfect square number. An illustration of this procedure is 10: θ ← Adam(∇θ m 1 m −L(i) , θ, α, β , β )
i=1 G 1 2
shown in Figure 4. The mean SSIM gives the overall structural
similarity measure of the sequence (MSSIM), defined as

1 
n M
  input, and the reconstruction of the input has been used as an
MSSIM(x, x̃) = SSIM xi (j ), x̃i (j ) , (8) anomaly score. Another widely adopted anomaly score is the
nM
i=1 j =1
feature matching error based on an intermediate layer of the
where M is the number of local windows, and xi (j ) and discriminator [5], [26].
x̃i (j ) are the contents at the j-th local window of the i-th Experiments with the feature matching error as an anomaly
subsequences xi and x̃i . score did not significantly improve ARCADE’s performance.
At the same time, it considerably increased the inference time
D. Adversarial Training since it is required to feed x and x̃ through C for feature extrac-
tion. Similarly, we found that using MSSIM as an anomaly
We address the generalization problem by regularizing the
score leads to a more discriminative anomaly function when
AE through adversarial training. Additionally to maximiz-
compared to L2 . However, the gains in efficiency are not
ing MSSIM, we further maximize the reconstruction scores
meaningful enough to justify the loss in efficiency due to the
provided by the critic C. By doing so, besides generating
SSIM’s complexity. Therefore, for a given sample x in the test
contextually similar reconstructions, the AE must reconstruct
set, its anomaly score computed using ARCADE is denoted
normal-like samples as faithfully as possible so the scores
as A(x) = L2 (x, x̃).
given by the critic C are maximized. During training, the AE
is optimized to maximize
  V. E XPERIMENTAL E VALUATION
LG = E MSSIM(x, x̃) + λG C (x̃) , (9)
x∼Pr The present section investigates and compares the
performance of ARCADE with baselines on three network
where λG is the regularization coefficient that balance the
traffic datasets. The considered datasets and baselines are
terms of the AE’s objective function.
described in Section V-A and Section V-B, respectively.
In Equation (9), it is assumed that critic C can provide high
Implementation, training details, and hyper-parameter tuning
scores for real normal traffic samples and low scores for recon-
are described in Section V-C. In Section V-D, we assess
struction. To do so, the critic C must learn the normal and
the effectiveness of ARCADE and baselines on the three
reconstruction data distributions. Therefore, during training,
considered datasets. Section V-E present the analysis of the
the critic C is optimized to maximize
complexity and the detection speed of ARCADE and D-PACK
 
LC = E C (x) − C (x̃) + λC LGP , (10) baseline.
x∼Pr
where LGP is given by Equation (4), and λC = 10 as sug- A. Datasets Description
gested in [13]. Our adversarial training strategy is based on the We used three datasets to evaluate the performance of
WGAN-GP framework described in Section II-B. Algorithm 1 the proposed approach with real-world normal and malicious
summarizes the essential steps of the proposed adversarial network traffic: ISCX-IDS [33], USTC-TFC [17], and MIRAI-
training. RGU [34]. The choice of datasets is based on the requirement
for raw network traffic. The selected datasets are among the
E. Anomaly Score most well-known datasets for intrusion detection, which pro-
An anomaly score A(x) is a function that provides a score to vide raw network traffic (.pcap) in addition to hand-designed
a sample x in the test set concerning samples in the training set. features (.csv). For example, the KDD’99 and NSL-KDD
Samples with more significant anomaly scores are considered datasets provide only hand-designed extracted features, which
more likely to be anomalous. Traditionally, AE strategies for limits their use in this work. It is worth noting that the number
anomaly detection rely on the reconstruction error between the of flows presented in dataset Table II, III, and IV described

Authorized licensed use limited to: Northeastern University. Downloaded on August 11,2023 at 09:08:50 UTC from IEEE Xplore. Restrictions apply.
1312 IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL. 20, NO. 2, JUNE 2023

TABLE IV
MIRAI-RGU DATASET ARCADE. GANomaly was chosen due to the similarities in
the adversarial regularization strategies. Comparing ARCADE
with GANomaly allows us to assess the effectiveness of our
proposed adversarial strategy. Moreover, comparing ARCADE
with AE-2 and AE-SSIM allows us to confirm the find-
ings presented by [23] and also assess the effectiveness of
the proposed adversarial regularization since ARCADE with
λG = 0 is equivalent to AE-SSIM. We also implemented
and performed experiments with probabilistic models such as
Variational Autoencoder (VAE) and Adversarial Autoencoder
(AAE); however, they did not produce satisfactory results
when compared to deterministic AEs. Therefore, their results
are not reported. Below we describe each competing method
and its respective parameters.
1) Shallow Baselines: (i) One-Class SVM (OC-SVM) [35]
with Gaussian kernel. We optimize the hyperparameters
below are the number of flows achieved after the preprocessing
γ and ν via grid search using the validation set with
procedure proposed in Section IV-A.
γ ∈ {2−10 , 2−9 , . . . , 20 }, and ν ∈ {0.01, 0.02, . . . , 0.1}.
The ISCX-IDS dataset [33] is a realistic-like dataset orig-
(ii) Kernel density estimation (KDE). We optimize the band-
inally proposed for the development of enhanced intrusion
width h of the Gaussian kernel via grid search, given ten
detection and anomaly-based approaches. The network traf-
values spaced evenly between −1 to 1 on a logarithmic scale.
fic was collected for seven days. Packets collected on the first
(iii) Isolation Forest (IF) [36]. As recommended in the original
and sixth days are normal traffic. Normal and attack pack-
work, we set the number of trees to 100 and the subsampling
ets are collected on the second and third days. In the fourth,
size to 256. For all three shallow baselines, we reduce the
fifth, and seventh days, besides the normal traffic, HTTP DoS,
dimensionality of the data via Principal Component Analysis
DDoS using an IRC Botnet, and Brute Force (BF) SSH pack-
(PCA), where we choose the minimum number of eigenvectors
ets are collected, respectively. Table II provides an overview of
such that at least 95% of the variance is retained.
the ISCX-IDS dataset. The USTC-TFC dataset [17] includes
2) Deep Baselines: (i) D-PACK [6], recently proposed
ten classes of normal traffic and ten classes of malware traf-
for unsupervised network anomaly detection, can be consid-
fic from public websites, which were collected from a real
ered the state-of-the-art DL method for the task. D-PACK’s
network environment from 2011 to 2015. Table III provides an
performance serves as a point of comparison for ARCADE’s
overview of the USTC-TFC dataset. The MIRAI-RGU dataset
effectiveness and efficiency. The original D-PACK formula-
includes normal traffic from Internet of the Things (IoT)
tion assumes that normal traffic is split into multiple classes,
Internet Protocol (IP) cameras and ten classes of malicious
as in the USTC-TFC dataset. However, this is not the cir-
traffic from the Mirai botnet malware, such as HTTP flood,
cumstance for most public datasets, such as the other two
UDP flood, DNS flood, Mirai infection traffic, VSE flood,
datasets considered here. We empirically assessed that remov-
GREIP flood, GREETH flood, TCP ACK flood, TCP SYN
ing the softmax classifier degrades the method’s efficiency.
flood, and UDPPLAIN flood. Table IV provides an overview
Therefore, we keep the original D-PACK formulation even
of the MIRAI-RGU dataset.
for datasets without labeled normal training data. The network
We split each dataset into training, validation, and test sets.
architecture, training strategy, and hyperparameters were kept
The training set is composed only of normal samples. Normal
as recommended in the original work. (ii) GANomaly [28]
and anomaly samples are used only for testing and validation.
was originally proposed for image anomaly detection. Here,
We balance the test set such that each subset of classes in the
we do not employ it as an out-of-the-box anomaly detection
test set presents the same number of samples. Note that the
approach. However, we use its adversary training framework
normal traffic in the test set is not a subset of the training
with the proposed 1D-CNN model architecture presented in
set. The validation set is composed of 5% of the samples of
Section IV-B. The idea is to compare GANomaly’s train-
each class from the test set, randomly selected and removed
ing strategy with our proposed adversarial training strategy.
for validation purposes.
Note that GANomaly defines the generator G as an encoder-
decoder-encoder. Therefore, a second encoder E  with the
B. Competing Methods same architecture of E (without sharing parameters) is added to
We compare ARCADE to three shallow and four deep learn- the proposed AE, where the input of E  is the outcome of the
ing methods for anomaly detection. The chosen shallow base- decoder D, i.e., the input for encoder E  is the reconstruction
lines are well-known methods typically applied to anomaly of the input. Finally, we modify the critic C to align with their
detection problems and commonly used as a benchmark for proposed discriminator D. We modify C, so batch normaliza-
novel anomaly detection methods. D-PACK is the state-of- tion is used instead of layer normalization, and a Sigmoid
the-art unsupervised network anomaly detection method using activation function is added after the last layer. The anomaly
raw network traffic bytes as input. We implemented and score is given by the 2 -distance between the latent space
used D-PACK’s effectiveness and efficiency as a baseline for of E and the latent space of E  . We performed grid search

Authorized licensed use limited to: Northeastern University. Downloaded on August 11,2023 at 09:08:50 UTC from IEEE Xplore. Restrictions apply.
TESSARO LUNARDI et al.: ARCADE 1313

eigenvectors such that the sum of their explained variance is


at least 95%, i.e., d ≈ 50 with n = 2 for all three datasets.
Given the hyperparameters and training recipe above, we
performed ablation experiments to assess the performance of
ARCADE with and without the proposed adversarial regu-
larization and varying input sizes. The validation set was
used for the ablation experiments. To assess the effective-
Fig. 5. ARCADE’s mean AUROC (%) convergence given varying values
for the adversarial regularization coefficient λG . In the case where λG = 0, ness of the proposed adversarial regularization, we performed
results are equivalent to the AE-SSIM. experiments with the ICSX-IDS dataset with n = 5, and
λG ∈ {0, 0.001, 0.01, 0.02, 0.03}. Figure 5 illustrates the
TABLE V mean AUROC convergence (lines) and standard deviation
ARCADE’ S M EAN AUROC (%) ON THE T HREE C ONSIDERED DATASETS
G IVEN VARYING I NPUT S IZES . R ESULTS A RE IN THE F ORMAT mean ± std.
(error bars amplified 50 times for visualization purposes).
O BTAINED OVER 10-F OLDS We can verify that the proposed adversarial-based regulariza-
tion improves the capabilities of the AE for network anomaly
detection concerning the same AE without the adversarial
regularization, i.e., with λG = 0 ARCADE is equivalent to
AE-SSIM. The proposed adversarial training strategy can be
exploited to improve the network anomaly detection capabili-
ties of similar DL approaches, especially for scenarios where
increasing the model size is not an option due to hardware
TABLE VI constraints. Based on these results, we fix the adversarial
AUROC (%) OF ARCADE AND S HALLOW BASELINES . R ESULTS A RE IN regularization coefficient to λG = 0.01 for all the follow-
THE F ORMAT mean ± std. O BTAINED OVER 10-F OLDS . W E P RESENT
R ESULTS FOR THE ISCX-IDS W ITH n ∈ {2, 5}, ing experiments. We also analyze the ARCADE performance
D ENOTED AS ISCX-IDSn given different input sizes. Table V presents the mean AUROC
and standard deviations on the three datasets with n ∈
{2, 3, 4, 5, 6}. ARCADE achieves nearly 100 AUROC with
n = 2 on the USTC-TFC and MIRAI-RGU datasets. For the
ISCX-IDS dataset, the method achieves 86.7 and 99.1 AUROC
with 2 and 4 packets, respectively. The following experiments
further investigate the considerable difference in performance
given varying values of n. For the MIRAI-RGU dataset, the
AUROC decreases with n > 5. Scaling the model depth and
width given the input size could help since, for larger input
optimize wrec ∈ {50, 75, 100, 125, 150} and results suggest sizes, more layers and channels would lead to an increased
that wrec = 75 lead to best results. All the other parame- receptive field and more fine-grained patterns. Note that addi-
ters were kept as suggested in the original work. (iii) AE-2 tional ablation experiments concerning the 2 and SSIM loss
is an AE with the same proposed network architecture in functions are provided in the following section.
Section IV-B, where L2 loss is used as distance metric during
training, and L2 is also used for the anomaly score computa-
tion. (iv) AE-SSIM is an AE with the same proposed network D. Network Traffic Anomaly Detection Results
architecture in Section IV-B, where MSSIM loss is used for We now systematically compare the proposed ARCADE’s
training, and L2 is used for computing the anomaly scores. effectiveness with the baselines. Table VI presents the results
In this work, we used the PyTorch Image Quality (PIQ) [37] of the considered shallow baselines on the three network traffic
implementation of the SSIM loss with Gaussian kernel and datasets. ARCADE outperforms all of its shallow competitors.
kernel size K = 3, obtained through a grid search optimization Table VII presents the results of ARCADE and considered
with K ∈ {3, 5, 7, 9}. deep baselines. Here, we expand the evaluations to include a
one-class anomaly detection setting, where each anomaly class
is evaluated separately. Therefore, the table also includes the
C. Training Recipe and Ablation Study AUROC and F1-score concerning the evaluation performed
The training objective (described in Section IV-D) is opti- exclusively on each anomaly class presented in each dataset.
mized via Adam optimizer [38] with α = 1e −4, β1 = 0, Note that the anomaly samples used for this evaluation are
and β2 = 0.9. It is worth noting again that Algorithm 1 not necessarily a subset of the test set and were fixed for all
describes the main steps of the proposed adversarial train- methods. This allows each method to be evaluated separately
ing procedure. Additionally, we employ for all approaches a against each attack or malware in each dataset.
two-phase (“searching” and “fine-tuning”) learning rate 1e−4 The results for the deep baselines, considering normal and
for 100 epochs. In the fine-tuning phase, we train with the all anomalies, show that ARCADE outperforms all other
learning rate 1e−5 for another 50 epochs. The latent size d methods on the three considered datasets. The methods rank
is computed with PCA, equivalent to the minimum number of ARCADE, AE-SSIM, AE-2 , GANomaly, and D-PACK for

Authorized licensed use limited to: Northeastern University. Downloaded on August 11,2023 at 09:08:50 UTC from IEEE Xplore. Restrictions apply.
1314 IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL. 20, NO. 2, JUNE 2023

TABLE VII
AUROC AND F1-S CORE (%) OF ARCADE AND D EEP BASELINES . E ACH M ETHOD WAS T RAINED E XCLUSIVELY ON N ORMAL
N ETWORK T RAFFIC , AND THE R ESULTS A RE IN THE F ORMAT M EAN (± STD .) O BTAINED OVER 10-F OLDS .
F OR THE ISCX-IDS, W E RUN T WO E XPERIMENTS W ITH n ∈ {2, 5}

results on the ISCX-IDS with n = 2, USTC-TFC with n = 2, is worth noting that GANomaly used the same AE architecture
and MIRAI-RGU with n = 2. In experiments with the ISCX- as ARCADE with the requirement of an additional encoder,
IDS with n = 5, the methods rank ARCADE, GANomaly, as described in Section V-B2.
AE-SSIM, AE-2 , and D-PACK. Despite having approxi- The isolated validations for the ISCX-IDS with n = 2 show
mately 20 times more parameters than the proposed model, that ARCADE achieved the best F1-score values for all classes
D-PACK achieved the worst results among the deep baselines. and best AUROC values for Infiltration, DDoS, and BF SSH,
Results for the AE-SSIM and AE-2 , similarly to the results where D-PACK achieved the best AUROC for HTTP DoS.
provided in [23], show that using SSIM as a distance metric With n = 5, ARCADE achieved the best results for Infiltration
during training can improve the AE’s capabilities in detecting and HTTP DoS, where D-PACK achieved the best results for
anomalies. ARCADE, which also uses SSIM as a distance DDoS, and GANomaly achieved the best results for BF SSH.
metric during training and employs the proposed adversarial The low performance of the considered methods on DoS and
regularization strategy, achieved better results than AE-SSIM, DDoS with n = 2 indicates that analyzing the spatial relation
emphasizing the advantages of the proposed adversarial train- between bytes among multiple subsequent packets is essen-
ing strategy. The GANomaly framework, comprised of its tial to detect such attacks. A single packet of a flood attack
distinct model architecture, adversarial training strategy, and does not characterize an anomaly; e.g., an SYN packet can
anomaly score, did not achieve better results than ARCADE. It be found within the normal traffic. However, multiple SYN

Authorized licensed use limited to: Northeastern University. Downloaded on August 11,2023 at 09:08:50 UTC from IEEE Xplore. Restrictions apply.
TESSARO LUNARDI et al.: ARCADE 1315

TABLE VIII
T HE ACCURACY, P RECISION , R ECALL , AND F1-S CORE VALUES IN % OF ARCADE AND D-PACK FOR THE 99 TH P ERCENTILE AND M AXIMUM
T HRESHOLDS . R ESULTS A RE IN THE F ORMAT OF THE M EAN (± STD .) O BTAINED OVER T EN S EEDS

packets in sequence can be characterized as an SYN flood we implemented. Table VIII presents the accuracy, precision,
attack. This indicates that the spatial relation between bytes recall, and F1-score of ARCADE and D-PACK given both
among subsequent packets is essential to detect such attacks. thresholds, with n = 2 for the USTC-TFC and MIRAI-RGU
In isolated experiments with anomaly classes from the USTC- datasets, and n = 5 for the ISCX-IDS dataset. The results of
TFC dataset, ARCADE achieved maximum results with 100 the 99th threshold show that ARCADE achieved the highest
AUROC and 100 F1-score in all malware classes. Results recall rate for the USTC-TFC and MIRAI-RGU datasets. This
from the isolated experiments with anomaly classes from the is because ARCADE produced no false negatives. ARCADE
MIRAI-RGU show that, if we consider D-PACK, AE-2 , AE- achieved an 11.79% higher F1-score than D-PACK. When
SSIM, and GANomaly, there is no clear winner. ARCADE the maximum threshold is used, the ARCADE enhance-
achieved the best AUROC and F1-score values on the 8 and 6 ment in performance is more clearly seen. As expected,
classes, respectively. GANomaly ranked second with four best both approaches were able to achieve the highest precision.
AUROC and three best F1-score values. However, D-PACK only achieved 8.69% mean recall, while
In practice, a threshold value must be set to distinguish ARCADE achieved 64.54%. This is an improvement of
between normal and anomalous traffic based on the anomaly 642.69%. Figure 6 shows the anomaly score distribution of
score distribution of the normal traffic. In a supervised sce- ARCADE and D-PACK computed using the model parame-
nario where the normal and known anomalies’ anomaly score ters that led to the best AUROC obtained over 10-folds on
distribution does not overlap, the maximum anomaly score the three datasets. The detection rate is reported and calcu-
of the normal traffic can lead to 100% Detection Rate (DR) lated using the 99th percentile threshold, also presented in the
and 0% FAR. This is commonly adopted since it leads to figures. When considering the best model parameters and a
small FAR. To avoid the impact of extreme maximum anomaly 99th percentile threshold, ARCADE outperformed D-PACK in
scores, the 99th percentile of the anomaly score distribu- terms of detection rates by 22.35%, 3.44%, and 0.14% on the
tion of the normal traffic can be used as an alternative. The ISCX-IDS, USTC-TFC, and Mirai-RGU datasets, respectively.
downside of this approach is that approximately 1% FAR
is expected. Regardless, the definition of the threshold is
problem-dependent and is strongly related to IDS architecture E. Model Complexity and Detection Speed
altogether, e.g., in a hybrid IDS (anomaly-based and signature- Here we evaluate the efficiency of ARCADE by comparing
based), where the anomaly-based method is used as a filter with D-PACK the number of samples processed per second,
to avoid unnecessary signature verification, a high threshold model sizes, and floating-point operations (FLOPS). Figure 7
could lead to low detection rates. In this case, a lower thresh- presents ARCADE and D-PACK efficiency, effectiveness, and
old, such as the 99th percentile (or even smaller), would be model size on the ISCX-IDS with n = 2 and n = 5, where
preferable since the signature-based approach would further our ARCADE significantly outperforms D-PACK in all eval-
validate false positives. uated measures. We analyze the detection speed performance
We further compare ARCADE and D-PACK considering of ARCADE and D-PACK by assessing how many samples
accuracy, precision, recall, and F1-score given two thresholds: per second they can process in different environments with
(i) the 99th percentile, and (ii) the maximum value of the distinct processing capabilities that we categorize as edge,
normal traffic anomaly scores. This comparison aims to ana- fog, and cloud. The device specifications and the experimen-
lyze the effectiveness of ARCADE compared to the D-PACK tal environment are summarized in Table XI. We consider a
baseline, originally proposed for network anomaly detection. Raspberry Pi 4B as an edge device, UP Xtreme and Jetson
The other deep baselines use the same model architecture Xavier NX as fog devices, and a desktop personal com-
as ARCADE and can be seen as contributions to this work puter with an AMD Ryzen Threadripper 3970X 32-core CPU,

Authorized licensed use limited to: Northeastern University. Downloaded on August 11,2023 at 09:08:50 UTC from IEEE Xplore. Restrictions apply.
1316 IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL. 20, NO. 2, JUNE 2023

Fig. 6. The anomaly score distributions for normal and abnormal traffic from the test set of each considered dataset. Anomaly scores were computed using
the best model’s parameters over 10-folds for each method. The DR was calculated based on the 99th percentile threshold of the normal traffic scores. Blue
and red bars represent normal and abnormal traffic flows, respectively.

TABLE IX
M EAN D ETECTION S PEED C OMPARISON B ETWEEN
ARCADE AND D-PACK

Fig. 7. Comparison between efficiency, effectiveness, and model size of


ARCADE and D-PACK. We report AUROC (%) vs. floating-point operations
(FLOPS) required for a single forward pass is reported with n ∈ {2, 5}.
The size of each circle corresponds to the model size (number of parame-
ters). ARCADE achieves higher AUROC with approximately 20 times fewer
parameters than D-PACK.
since there is a clear trade-off between the model’s effec-
tiveness and complexity. In this sense, the proposed model
NVIDIA GeForce RTX 3090 GPU, and 128 GB RAM as a
can be easily adapted by changing the number of layers
cloud device. Detection speed experiments were conducted
and channels, together with the input size, to better suit the
with and without GPU support to account for the fact that edge
needs of a particular environment given its processing power
(and sometimes fog) nodes may not have a GPU device, as is
capabilities.
the case with the Raspberry Pi 4 and the UP Xtreme board.
The NVIDIA Jetson Xavier NX and the personal computer
were given a GPU warm-up stage of 5 seconds immediately VI. C ONCLUSION
before starting the experiment. The mean amount of processed In this work, we introduced ARCADE, a novel unsupervised
flows per second was computed given ten runs. All experi- DL method for network anomaly detection that automatically
ments were implemented in Python 3.8 PyTorch version 1.8 builds the normal traffic profile based on raw network bytes
without any improvement to speed up inference. Table IX as input without human intervention for feature engineering.
present the detection speed results with n = 2. The results ARCADE is composed of a 1D-CNN AE that is trained exclu-
show that ARCADE outperformed D-PACK in all environ- sively on normal network traffic flows and regularized through
ments, with ARCADE being approximately 8, 3, 2.8, 2, 2.16 a WGAN-GP adversarial strategy. We experimentally demon-
times faster on the Raspberry Pi 4, UP Xtreme, NVIDIA strated that the proposed adversarial regularization improves
Jetson, Threadripper CPU, and RTX 3090 GPU, respectively. the performance of the AE. The proposed adversarial regu-
ARCADE can process over 1.9M flows per second on the larization strategy can be applied to any AE independently
RTX 3090 GPU. The definition of an “optimal model” in of its model architecture, and it can also be applied to other
an online network detection scenario cannot be well-defined anomaly detection tasks. Once the AE is trained on the normal

Authorized licensed use limited to: Northeastern University. Downloaded on August 11,2023 at 09:08:50 UTC from IEEE Xplore. Restrictions apply.
TESSARO LUNARDI et al.: ARCADE 1317

TABLE X
E NCODER , D ECODER , AND C RITIC A RCHITECTURE two initial packets of a network flow as input, ARCADE
can detect most of the malicious traffic with nearly 100%
F1-score, except for HTTP DoS and DDoS, where 68.70%
and 66.61% F1-scores were obtained. While considering five
packets, ARCADE achieved 91.95% and 93.19% F1-scores
for HTTP DoS and DDoS, respectively. Experiments show
that the proposed AE model is 20 times smaller than base-
lines and still presents significant improvement in accuracy,
model complexity, and detection speed.

A PPENDIX
See Tables X and XI.

R EFERENCES
[1] “Cisco annual Internet report (2018–2023),” Cisco, San Jose, CA, USA,
White Paper, 2020.
[2] H.-J. Liao, C.-H. R. Lin, Y.-C. Lin, and K.-Y. Tung, “Intrusion detection
system: A comprehensive review,” J. Netw. Comput. Appl., vol. 36, no. 1,
pp. 16–24, 2013.
[3] J. V. V. Silva, N. R. de Oliveira, D. S. Medeiros, M. A. Lopez, and
D. M. Mattos, “A statistical analysis of intrinsic bias of network security
datasets for training machine learning mechanisms,” Ann. Telecommun.,
vol. 77, pp. 555–571, Feb. 2022.
[4] Z. Ahmad, A. S. Khan, C. W. Shiang, J. Abdullah, and F. Ahmad,
“Network intrusion detection system: A systematic study of machine
learning and deep learning approaches,” Trans. Emerg. Telecommun.
Technol., vol. 32, no. 1, 2021, Art. no. e4150.
[5] T. Truong-Huu et al., “An empirical study on unsupervised network
anomaly detection using generative adversarial networks,” in Proc. 1st
ACM Workshop Security Privacy Artif. Intell., 2020, pp. 20–29.
[6] R.-H. Hwang, M.-C. Peng, C.-W. Huang, P.-C. Lin, and V.-L. Nguyen,
“An unsupervised deep learning model for early network traffic anomaly
detection,” IEEE Access, vol. 8, pp. 30387–30399, 2020.
[7] M. Rudolph, B. Wandt, and B. Rosenhahn, “Same same but DifferNet:
Semi-supervised defect detection with normalizing flows,” in Proc.
IEEE/CVF Winter Conf. Appl. Comput. Vis., 2021, pp. 1907–1916.
[8] D. Gong et al., “Memorizing normality to detect anomaly: Memory-
augmented deep autoencoder for unsupervised anomaly detection,” in
Proc. IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 1705–1714.
[9] S. Zhai, Y. Cheng, W. Lu, and Z. Zhang, “Deep structured energy based
models for anomaly detection,” in Proc. Int. Conf. Mach. Learn., 2016,
TABLE XI pp. 1100–1109.
S PECIFICATIONS OF C ONSIDERED E NVIRONMENTS FOR D ETECTION
[10] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation
S PEED E XPERIMENTS
learning with deep convolutional generative adversarial networks,” 2015,
arXiv:1511.06434.
[11] I. Goodfellow et al., “Generative adversarial nets,” in Proc. Adv. Neural
Inf. Process. Syst., vol. 27, 2014, pp. 1–9.
[12] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adver-
sarial networks,” in Proc. Int. Conf. Mach. Learn., 2017, pp. 214–223.
[13] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville,
“Improved training of Wasserstein GANs,” in Proc. Adv. Neural Inf.
Process. Syst., vol. 30, 2017, pp. 5769–5779.
[14] L. Vu, C. T. Bui, and Q. U. Nguyen, “A deep learning based method for
handling imbalanced problem in network traffic classification,” in Proc.
8th Int. Symp. Inf. Commun. Technol., 2017, pp. 333–339.
[15] R. Doriguzzi-Corin, S. Millar, S. Scott-Hayward, J. Martinez-del Rincon,
and D. Siracusa, “Lucid: A practical, lightweight deep learning solution
for ddos attack detection,” IEEE Trans. Netw. Service Manage., vol. 17,
no. 2, pp. 876–889, Jun. 2020.
[16] W. Wang et al., “Hast-IDS: Learning hierarchical spatial-temporal fea-
tures using deep neural networks to improve intrusion detection,” IEEE
Access, vol. 6, pp. 1792–1806, 2017.
[17] W. Wang, M. Zhu, X. Zeng, X. Ye, and Y. Sheng, “Malware traffic classi-
fication using convolutional neural network for representation learning,”
network traffic using the proposed approach, ARCADE can in Proc. Int. Conf. Inf. Netw., 2017, pp. 712–717.
[18] Y. Yu, J. Long, and Z. Cai, “Network intrusion detection through stack-
effectively detect unseen network traffic flows from attacks ing dilated convolutional autoencoders,” Security and Communication
and malware. Our results suggest that even considering only Networks, vol. 2017, Nov. 2017, Art. no. 4184196.

Authorized licensed use limited to: Northeastern University. Downloaded on August 11,2023 at 09:08:50 UTC from IEEE Xplore. Restrictions apply.
1318 IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL. 20, NO. 2, JUNE 2023

[19] W. Wang, M. Zhu, J. Wang, X. Zeng, and Z. Yang, “End-to-end Willian Tessaro Lunardi (Member, IEEE) received
encrypted traffic classification with one-dimensional convolution neu- the Ph.D. degree in computer science from the
ral networks,” in Proc. Int. Conf. Intell. Security Informat., 2017, University of Luxembourg. He is a Senior Research
pp. 43–48. Scientist with the Secure Systems Research Centre,
[20] G. Aceto, D. Ciuonzo, A. Montieri, and A. Pescapé, “Mobile encrypted Technology Innovation Institute, Abu Dhabi, UAE.
traffic classification using deep learning: Experimental evaluation, He has published over 25 research papers in inter-
lessons learned, and challenges,” IEEE Trans. Netw. Service Manage., national scientific journals, conferences, and book
vol. 16, no. 2, pp. 445–458, Jun. 2019. chapters. His main area of research is machine learn-
[21] M. Lotfollahi, M. J. Siavoshani, R. S. H. Zade, and M. Saberian, “Deep ing and combinatorial optimization. He is currently
packet: A novel approach for encrypted traffic classification using deep working on machine learning for network security,
learning,” Soft Comput., vol. 24, no. 3, pp. 1999–2012, 2020. physical layer security, and jamming detection.
[22] T. Ahmad, D. Truscan, J. Vain, and I. Porres, “Early detection of network
attacks using deep learning,” 2022, arXiv:2201.11628.
[23] P. Bergmann, S. Löwe, M. Fauser, D. Sattlegger, and C. Steger,
“Improving unsupervised defect segmentation by applying structural
similarity to autoencoders,” 2018, arXiv:1807.02011.
[24] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image
quality assessment: From error visibility to structural similarity,” IEEE
Trans. Image Process., vol. 13, pp. 600–612, 2004.
[25] G. Pang, C. Shen, L. Cao, and A. V. D. Hengel, “Deep learning for
anomaly detection: A review,” ACM Comput. Surveys, vol. 54, no. 2, Martin Andreoni Lopez (Member, IEEE) gradu-
pp. 1–38, 2021. ated as an electronic engineer from the Universidad
[26] T. Schlegl, P. Seeböck, S. M. Waldstein, U. Schmidt-Erfurth, and Nacional de San Juan, Argentina, in 2011,
G. Langs, “Unsupervised anomaly detection with generative adversarial the master’s degree in electrical engineering
networks to guide marker discovery,” in Proc. Int. Conf. Inf. Process. from the Federal University of Rio de Janeiro
Med. Imag., 2017, pp. 146–157. (COPPE/UFRJ) in 2014, and the Doctoral degree
[27] H. Zenati, C. S. Foo, B. Lecouat, G. Manek, and V. R. Chandrasekhar, from the Teleinformatics and Automation Group,
“Efficient GAN-based anomaly detection,” 2018, arXiv:1802.06222. COPPE/UFRJ and the Phare team of Laboratoire
[28] S. Akcay, A. Atapour-Abarghouei, and T. P. Breckon, “Ganomaly: Semi- d’Informatique Paris VI, Sorbonne Université,
supervised anomaly detection via adversarial training,” in Proc. Asian France, in 2018. He is a Network Security
Conf. Comput. Vis., 2018, pp. 622–637. Researcher with the Secure System Research Center,
[29] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: Technology Innovation Institute, Abu Dhabi, UAE. He was a Researcher with
A large-scale hierarchical image database,” in Proc. Int. Conf. Comput. Samsung R&D Institute Brazil. He has coauthored several publications and
Vis. Pattern Recognit., 2009, pp. 248–255. patents in security, virtualization, traffic analysis, and big data.
[30] Z. Xiao, Q. Yan, and Y. Amit, “Do we really need to learn
representations from in-domain data for outlier detection?” 2021,
arXiv:2105.09270.
[31] L. Bergman, N. Cohen, and Y. Hoshen, “Deep nearest neighbor anomaly
detection,” 2020, arXiv:2002.10445.
[32] T. Reiss, N. Cohen, L. Bergman, and Y. Hoshen, “Panda: Adapting
pretrained features for anomaly detection and segmentation,” in Proc.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, pp. 2806–2814.
[33] A. Shiravi, H. Shiravi, M. Tavallaee, and A. A. Ghorbani, “Toward devel- Jean-Pierre Giacalone received the engineer-
oping a systematic approach to generate benchmark datasets for intrusion ing degree from the École nationale supérieure
detection,” comput. Security, vol. 31, no. 3, pp. 357–374, 2012. d’électrotechnique, d’électronique, d’informatique,
[34] C. D. McDermott, F. Majdani, and A. V. Petrovski, “BotNet detection d’hydraulique et des télécommunications, Toulouse,
in the Internet of Things using deep learning approaches,” in Proc. Int. France. He is the Vice President of Secure
Joint Conf. Neural Netw., 2018, pp. 1–8. Communications Engineering with the Secure
[35] B. Schölkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and Systems Research Centre, Technology Innovation
R. C. Williamson, “Estimating the support of a high-dimensional dis- Institute, Abu Dhabi, UAE. He is responsible
tribution,” Neural Comput., vol. 13, no. 7, pp. 1443–1471, 2001. for researching secure communications, focusing
[36] F. T. Liu, K. M. Ting, and Z.-H. Zhou, “Isolation forest,” in Proc. Int. on improving the resilience of cyber-physical and
Conf. Data Min., 2008, pp. 413–422. autonomous systems. He has worked as an Expert
[37] S. Kastryulin, D. Zakirov, and D. Prokopenko, “PyTorch image qual- in software architecture for advanced driving assistance systems with Renault
ity: Metrics and measure for image quality assessment.” 2019. [Online]. and as a Principal Engineer and Architect within Intel’s Mobile Systems
Available: https://github.com/photosynthesis-team/piq Technologies Group. He holds 19 patents and has coauthored 15 research
[38] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” papers accepted for publication in international journals and conference
2014, arXiv:1412.6980. proceedings.

Authorized licensed use limited to: Northeastern University. Downloaded on August 11,2023 at 09:08:50 UTC from IEEE Xplore. Restrictions apply.

You might also like