Autoencoder-Based Anomaly Detection in Network Traffic
Autoencoder-Based Anomaly Detection in Network Traffic
Traffic
2024 25th International Conference on Computational Problems of Electrical Engineering (CPEE) | 979-8-3315-0664-3/24/$31.00 ©2024 IEEE | DOI: 10.1109/CPEE64152.2024.10720411
Abstract—Due to the continuously increasing number of re- learning models, particularly neural networks, can process
sources and data availability in the cloud, the threats related large datasets, which is essential for cybersecurity applications.
to the security of computer networks and IT systems are Examples of AI-based systems for threat detection in IT
critical. Threat detection systems based on deep neural networks
and anomaly detection are trained on data related to normal networks include security information and event management
activity so that the network can recognize unusual patterns and (SIEM) platforms and intrusion detection systems (IDS). By
behaviours in the event of an attack or an attempt to infiltrate leveraging deep learning, these systems can uncover complex
a given IT infrastructure. This paper presents the results of de- patterns in network traffic, such as port activity or packet
veloping a neural network based on an autoencoder for anomaly header details, which may be imperceptible to human analysts.
detection in network packet data. The network was trained on
data from the HIKARI-2021 dataset. The autoencoder aims to This capability to detect subtle correlations in network data
learn representations of normal network traffic and associate this significantly enhances the effectiveness of anomaly detection
type of traffic with a minimal reconstruction error. The obtained in computer network security.
results were compared with those achieved by authors of other One technique worth mentioning in the context of security
works. High accuracy and sensitivity were achieved at the cost is anomaly detection [1]. Machine learning algorithms, both
of rather low precision, resulting in many false-positive results.
A simple algorithm based on a single threshold value proved traditional and deep learning, can be trained to recognize
efficient but limited in terms of effectiveness. This problem can normal activity in a given system. Then, when given a sample
be resolved by changing the method of calculating the individual of data that stands out, the algorithm should be able to
components of the vector, using only a subset of features, and tell the difference. These techniques enable the detection of
deriving multiple vectors, one for each class separately, which subtle deviations from normal network behaviour, which may
has been described and analyzed in more detail.
Index Terms—anomaly detection, deep learning, cybersecurity, indicate a range of security threats, including Distributed
autoencoder, threat detection Denial of Service (DDoS) attacks, unauthorized access, or data
exfiltration.
I. I NTRODUCTION In this work, publicly available dataset HIKARI-2021 will
The security of IT systems has been a critical concern be briefly analyzed. Then, a deep learning-based autoencoder
since their initial commercial deployment. With constantly algorithm will be presented, followed by a comparison with
evolving environments and the emergence of new types of other approaches applied to the same dataset and their respec-
cyberattacks, relying solely on security systems based on tive results. Finally, potential improvements and directions for
predefined behavioural patterns is no longer feasible. One future research will be proposed.
promising solution to this challenge is the application of ma-
chine learning (ML) and artificial intelligence (AI) techniques II. DATASET
in cybersecurity. Security vulnerabilities can lead to the loss The OSI (Open Systems Interconnection) model plays a
of data, financial assets, and reputational damage, which can critical role in IT security. It defines communication between
be difficult to recover from. To mitigate these risks, machine the various layers of an IT system, with data being transmitted
learning approaches have been increasingly employed, as they from the application layer—closest to the user—through the
can process large datasets using sophisticated algorithms. transport layers, down to the physical layer. This model is
Machine learning techniques can be categorized into three crucial for security as threats can be detected at different
primary types: supervised, unsupervised, and reinforcement. stages of communication. In the application layer, for example,
One of the significant advantages of ML is its ability to rapidly threats such as malicious queries to a server or attempts to steal
and effectively analyze vast quantities of data—a necessity in user credentials can be identified. The datasets analyzed in this
cybersecurity, where IT systems generate massive volumes of work primarily focus on the network layer.
logs, events, and entries. Several publicly available datasets were examined for this
A key area within machine learning is deep learning, study. Despite the rapid evolution of cybersecurity, many
a subfield that mimics the human learning process. Deep studies still rely on outdated datasets, such as the KDD99
Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO RIO DE JANEIRO. Downloaded on November 21,2024 at 23:53:53 UTC from IEEE Xplore. Restrictions apply.
dataset, which is 25 years old at the time of this writing. collecting the most important features of the set in a reduced
Given the constant changes in the cybersecurity landscape, form, presented using a smaller number of dimensions. The
utilizing more up-to-date datasets that reflect current standards previously mentioned encoder learns the feature representation
and attack patterns would likely yield more accurate and by using narrowed deep layers - according to the manifold
meaningful research outcomes. hypothesis, high-dimensional data can be represented with
The HIKARI dataset was selected for further analysis due fewer dimensions [6]. The decoder is responsible for creating
to its size, number of features, and focus on encrypted data. a reconstruction of input features using the encoded informa-
It is also the most recent dataset among those analyzed in this tion in the hidden layers. Figure 1 presents an example of
paper, released in August 2021. The dataset was generated in a an autoencoder with three hidden layers. For simplicity of
controlled laboratory environment, simulating real-world user mathematical description, we can reduce the problem to only
behaviour and incorporating synthetic attack scenarios [5]. It one layer. Then, an encoder can be defined as the following
contains over 80 features and approximately 250,000 records, expression:
each labelled to indicate whether it represents an attack. The
HIKARI dataset includes three types of attacks: XMRIGCC H = σ(Wx X + bx ), (1)
CryptoMiner, Bruteforce-XML, and Bruteforce.
Most of the HIKARI features come from a similar CICIDS- where H is the representation of data in the hidden layer, σ
2017 dataset. They include information on basic flow features is activation function, Wx is the weight matrix, X is an input
(duration, IP addresses), packet lengths, and statistics such as vector and b is the bias vector [8]. It transforms the input
mean and standard deviation for packet and header lengths. vector X to the hidden representation H using the activation
Additionally, source and destination IP addresses and ports function σ. The transformation on the hidden layer H aims to
are provided [5]. reconstruct the input layer using the decoder:
III. I MPLEMENTATION OF ANOMALY DETECTION
ALGORITHM X̂ = σ(Wx H + bx̂ ), (2)
One type of network that can be used for anomaly detection which implies a reconstructed vector X̂ of X. The difference
tasks is an autoencoder. This is a specific type of neural between the reconstructed vector X̂ and X produces a recon-
network whose input and output size are the same. The main struction error:
task of an autoencoder is to reconstruct the input data. It
consists of two parts - encoder and decoder.
r = ∥X − X̂∥ (3)
Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO RIO DE JANEIRO. Downloaded on November 21,2024 at 23:53:53 UTC from IEEE Xplore. Restrictions apply.
the lowest possible number of false positive samples while The default configuration refers to 30 epochs and 256
maintaining the highest possible number of true positives. units as a batch size. Standard metrics were used for eval-
As part of the experiment, a neural network based on the uation. Keras, Numpy and Pandas libraries were used to
architecture from Table I was implemented. This configuration implement the algorithm. Additionally, tools for calculating
comes from the work by Catillo et al [3]. The experiment also model statistics were used from the Sklearn machine learning
tested other configurations from the mentioned work, but this package. The code was implemented in the Jupyter Notebook
one showed the best results. environment. The model was trained on a MacBook Pro with
a six-core Intel Core i9 processor and 32 GB of RAM.
TABLE I
U SED NEURAL NETWORK ARCHITECTURE IV. O BTAINED RESULTS
For the trained model, it is crucial to determine the recon-
Layer type Activation function N. of dimensions
Input - 82 struction error threshold. For this purpose, the ROC method
Hidden 1 tanh 16 mentioned in the previous section of this paper was used. The
Hidden 2 sigmoid 4 curve is visible in Figure 3. It represents the best possible
Hidden 3 sigmoid 16
Output tanh 82 configuration obtained with the trained model, which was
exactly equal to 0.042.
Fig. 2. Division of the HIKARI dataset into training and test data
The dataset was divided into normal and attack data. Figure Fig. 3. ROC curve - the line at the top of the graph represents the model
2 shows the division of HIKARI dataset for further training. and the black point is the best possible configuration.
Redundant features, including IP addresses and possible iden-
tifiers, were removed from it. 80% of the normal data was The reconstruction error threshold is shown in Figure 4.
used for training, whereas data with attacks and the remaining Points representing the average values of the reconstruction
20% of the normal data was used for testing, as shown in error for individual points are shown for normal data as
Figure 2. The model training was performed on the resulting brighter points and for anomalies as darker ones. Anything
data set - 10% of the data was used for validation, the batch above the reconstruction error threshold, shown as a dark
size was 256 units, and the number of epochs was equal to dashed line, should be an anomaly, and anything below it
30. An attempt was also made to select other values for the should be normal activity. As can be seen, not all points were
given hyperparameters, but the default ones turned out to be correctly reconstructed by the decoder.
the best - the result of this research is shown in Table II. Table III shows the results obtained from our method
and compares them with other authors works. It contains
TABLE II
comparisons with algorithms from the works of Fernandes et
T HE IMPACT OF HYPERPARAMETERS CHANGES ON THE RESULTS OF THE al. [4] and Vitorino et al. [9]. These include implementations
NEURAL NETWORK . for random forest, extreme gradient boosting, KNN, and MLP.
Batch size Epochs
It can be seen that our algorithm had a high level of recall
Param Default and accuracy. A problem was the low precision parameter,
128 512 20 40
Accuracy 94% 94% 94% 94% 84% which was about 80%. Figure 4 shows a single value of the
Precision 81% 81% 81% 81% 60% reconstruction error threshold as the average value of the in-
Recall 99% 99% 99% 99% 99%
F1 89% 89% 89% 89% 74% dividual data components. In this figure, we can see that most
of the points with anomalies are above the line representing
Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO RIO DE JANEIRO. Downloaded on November 21,2024 at 23:53:53 UTC from IEEE Xplore. Restrictions apply.
• Using multiple reconstruction error threshold vectors
instead of just one. For each of the individual classes
in a set, a reconstruction error threshold vector can be
determined, which should improve the classification of
individual threats. With a single error threshold value, a
large number of false positive events occur. Studies have
shown that a single reconstruction error threshold vector
deviates significantly from the test data. Due to its high
value for individual components, practically all samples
are classified as normal activity. A single discrepancy in
the data is enough and the vector value takes on very
high values relative to the set.
• Classification based on only some of the features. Not all
features will have the same impact on whether a given
sample is an anomaly. Using the chi-square method to
analyze individual features and using them in the data
classification process should yield better results in threat
detection and reduce the number of false positives.
• Change of the method of calculating components of the
error threshold vector. In this paper, it was chosen for
Fig. 4. Bright dots represent normal data, while dark represent anomalies. each component as the maximum value of all input
The dashed line is a representation of the reconstruction error threshold. records for the corresponding feature. A solution to this
may be to calculate the mean for all data and then the
TABLE III standard deviation.
C OMPARSION WITH THE RESULTS FROM RELATED WORKS .
R EFERENCES
Fernandes Vitorino
Parameter Our
RF XGB KNN MLP [1] Ahmed M., Mahmood A. N., Hu J. “A survey of network anomaly
Accuracy 94% 98% 96% 98% 90% detection techniques”, Journal of Network and Computer Applications,
Precision 81% 99% 99% 98% 90% 2016, 60: 19-31.
Recall 99% 69% 44% 98% 90% [2] Bradley A. P., “The use of the area under the ROC curve in the evaluation
F1 89% 81% 61% 98% 89% of machine learning algorithms,” Pattern Recognition, 1997.
[3] Catillo M., Pecchia A. and Villano U., “AutoLog: Anomaly Detection by
Deep Autoencoding of System Logs,” Expert Systems with Applications,
2021.
[4] Fernandes R., Silva J., Ribeiro Ó., Portela I. and Lopes N., “The
the reconstruction error threshold. On the other hand, there are impact of identifiable features in ML Classification algorithms with the
also a large number of points with normal data, which means HIKARI-2021 Dataset,” IEEE, 2023.
false positive alerts. An attempt was also made using the [5] Ferriyan A., Thamrin A. H., Takeda K., and Murai J., “Generating
network intrusion detection dataset based on real and encrypted synthetic
method from the work of Torabi [8], but it gave poor results, attack traffic,” IEEE, 2021.
below 80% accuracy. This results from the imperfections of [6] Goodfellow I., Bengio Y. and Courville A., “Deep Learning,” The MIT
the data in the training set - small deviations in the values for Press, 2016.
[7] Holly S. “Autoencoder based Anomaly Detection and Explained Fault
each of the features are enough and it generates significantly Localization in Industrial Cooling Systems,” Cornell University, 2022.
overstated thresholds, which are then required to classify the [8] Torabi H., Mirtaheri S. L. and Greco S., “Practical autoencoder based
data as anomalies. It is worth mentioning that the authors anomaly detection by using vector reconstruction error,” Springer Open,
2023.
of the aforementioned work developed a method based on [9] Vitorino J., Silva M., Maia E., Praça I., “Reliable Feature Selection
many classes. Our algorithm would need to be refined for this for Adversarially Robust Cyber-Attack Detection,” Cryptography and
purpose. Security, 2024.
Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO RIO DE JANEIRO. Downloaded on November 21,2024 at 23:53:53 UTC from IEEE Xplore. Restrictions apply.