Applied Sciences: Fficient Distributed Preprocessing Model For
Applied Sciences: Fficient Distributed Preprocessing Model For
sciences
Article
Efficient Distributed Preprocessing Model for
Machine Learning-Based Anomaly Detection over
Large-Scale Cybersecurity Datasets
Xavier Larriva-Novo * , Mario Vega-Barbas , Víctor A. Villagrá , Diego Rivera ,
Manuel Álvarez-Campana and Julio Berrocal
ETSI Telecomunicación, Universidad Politécnica de Madrid (UPM), Avda. Complutense 30,
28040 Madrid, Spain; [email protected] (M.V.-B.); [email protected] (V.A.V.);
[email protected] (D.R.); [email protected] (M.Á.-C.); [email protected] (J.B.)
* Correspondence: [email protected]
Received: 28 April 2020; Accepted: 12 May 2020; Published: 15 May 2020
Abstract: New computational and technological paradigms that currently guide developments in the
information society, i.e., Internet of things, pervasive technology, or Ubicomp, favor the appearance
of new intrusion vectors that can directly affect people’s daily lives. This, together with advances
in techniques and methods used for developing new cyber-attacks, exponentially increases the
number of cyber threats which affect the information society. Because of this, the development and
improvement of technology that assists cybersecurity experts to prevent and detect attacks arose
as a fundamental pillar in the field of cybersecurity. Specifically, intrusion detection systems are
now a fundamental tool in the provision of services through the internet. However, these systems
have certain limitations, i.e., false positives, real-time analytics, etc., which require their operation
to be supervised. Therefore, it is necessary to offer architectures and systems that favor an efficient
analysis of the data handled by these tools. In this sense, this paper presents a new model of data
preprocessing based on a novel distributed computing architecture focused on large-scale datasets
such as UGR’16. In addition, the paper analyzes the use of machine learning techniques in order
to improve the response and efficiency of the proposed preprocessing model. Thus, the solution
developed achieves good results in terms of computer performance. Finally, the proposal shows
the adequateness of decision tree algorithms for training a machine learning model by using a large
dataset when compared with a multilayer perceptron neural network.
Keywords: intrusion detection; machine learning; decision trees; multilayer perceptron; data
preprocessing; large-scale datasets; cybersecurity
1. Introduction
The inclusion of computational elements into the daily life of people, i.e., Internet of things,
wearable technology, or Ubicomp, applied to a sensitive user context such as healthcare, offers new
intrusion vectors that directly affects people’s lives [1]. Currently, the use of sophisticated techniques
and technology to develop new and more efficient cyber-attacks is exponentially increasing the number
of cyber-threats
The improvement of methods for preventing and detecting cyberattacks acquired great importance,
increasing the impact on technological developments and becoming a fundamental pillar of the digital
era. In this sense, the integration of artificial intelligence (AI) in the context of cybersecurity favored
this necessary improvement [2]. The integration of AI and cybersecurity can be applied to different
cybersecurity systems, i.e., to prioritize events using resilient incident response platforms in Security
Operation Center (SOCs), to automatize security analysis, or to detect and predict threats before they
materialize. The integration of AI can be especially positive in intrusion detection systems (IDSs),
which process a huge amount of network traffic from Internet of things (IoT), wearables, wireless
sensors, host–host-based sensors, and network computing.
IDSs are software systems that monitor and analyze the behavior of networks and systems with
the objective of detecting some possible intrusion. There are two classes of IDSs: those based on
signature detection and those based on anomaly detection. IDSs that are based on signature detection
apply different rules to detect an attack over a network or a host. Alternatively, an anomaly-based
detection IDS can differentiate between normal and anomalous flow. However, the problem of false
alarms (false positives) is a major concern in the use of IDS [3], which consequently leads to the need for
an expert in cybersecurity to evaluate the results and investigate the veracity of these false positives [4].
This process takes a considerable amount of time with a need for qualified people.
Technology and techniques used to implement IDSs are mostly publicly known; thus, attackers
are always improving their methods, trying to find a way to bypass IDSs without being detected [5].
This led to the materialization of a new generation of IDSs which, through the use of AI, try to improve
their efficiency and effectiveness via anomaly-based IDS.
Nevertheless, many of these research works were usually based on KDD’99 and DARPA datasets,
which have twenty years of history and do not represent the current state of the art of systems, attacks,
and cybersecurity [6–9]. These datasets are small and do not represent large-scale data compared
with other recent datasets such as UGR’16 [10]. In Reference [11], there is a representation of the
most used datasets in the last decade for machine learning applications related to intrusion detection.
For instance, DARPA 98, DARPA 99, and DARPA 2000 datasets [12] constituted 8.6% of usage and
the GureKddcup dataset [13] constituted 1.4%, a percentage shared by the ISCX2012 dataset [14] and
UNB-CIC [15]. Other used datasets are NSL-KDD [16] with 11.6% of usage, KDD-99 [17] with 63.8% of
usage, and finally generated and simulated datasets with 11.6%. These datasets mentioned above do
not represent the current state of the art related to information security.
In recent years, AI and specifically machine learning (ML) techniques were implemented to solve
problems related to anomaly-based IDSs, in order to improve attack detections. However, training
ML-based IDSs is costly in terms of time and computational requirements due to the large amount of
data needed to be processed.
The use of ML methods for developing a better IDS is a trending research topic which provided
good results. Nevertheless, some stages of the ML process must be optimized, such as data preparation
(preprocessing), to obtain good results in terms of accuracy. To achieve an optimal preprocessing model
for large-scale datasets, a new model of preprocessing and training IDS with a truly large dataset is
presented in this research.
This paper introduces a new way to preprocess large-scale datasets focused on IDSs as presented
in Section 2. Furthermore, in Sections 3–5, we explain the problem statement and the proposed
methodology, defining the new architecture presented in this paper, oriented to data preprocessing.
These sections also include the architecture proposed for applying the multilayer perceptron (MLP)
and decision trees in anomaly-based IDSs. Sections 6 and 7 provide the obtained results after applying
our proposal to large-scale datasets. For our tests, we used the UGR’16 dataset, as well as a comparison
with its execution locally, concluding with a comparison of ML algorithms such as deep learning and
non-deep learning algorithms, i.e., MLP and decision tree, respectively.
2. Related Work
Currently, there are several approaches of how to apply ML to cybersecurity based on a network
sensor response, specifically for intrusion detection. These different approaches consider diverse
parameters such as the classification of the attacks, network flow variables that should be included,
solutions based on data mining, or solutions directly based on various ML algorithms with the aim of
finding the best model.
Appl. Sci. 2020, 10, 3430 3 of 19
In Reference [5], the authors established the basis of intrusion detection systems. In that work, one
of the main contributions was defining how IDSs should work in order to detect more attacks easily by
analyzing the detecting mechanism. Another contribution was a classification of the different attacks.
As stated, by establishing the differences between the types of attacks, the detection mechanism can be
fine-tuned, ending in a better detection. In addition, those researchers tried giving an approach to a
good ML model for IDSs by using different techniques. In those studies, the best results were achieved
by neural networks, closely followed by decision trees and nearest neighbor.
Following the approach started by Reference [5], the authors of Reference [18] tried going deeper
with the objective of proposing a model with optimal parameters for neural networks. Their research
was based on previous works that, however, did not reach truly optimal models. To solve this problem,
they adopted an approach based on a five-stage model with the aim of studying the majority of
possibilities. This approach specifies a comparison to get the best parameters related to the best dataset
features and the way that those features must be normalized to get the best of them. Furthermore,
they specified the way in which the neural network must be built having to account for the number
of hidden layers, the number of nodes inside each one of the layers, the activation function, and
some other parameters for the best adjustment of the neural network. Afterward, the research made
a comparison with a combination of different solutions, whether proposed by them or by other
researchers. The conclusion was clear; they identified the best activation function to be used rectified
linear unit (ReLU), the formula to calculate the architecture of the neural network model, and how to
normalize each type of data included in the dataset. The rules defined are shown in Equations (1) [19],
(2) [20], (3), and (4) [21], where H is the number of hidden layers of an IDS, input is the number of
entries, and output is the number of exits in the neural network model.
Different approaches were presented using techniques such as Apache Spark to improve the
preprocessing of the dataset and its performance [5,7,22]. However, the used datasets cannot be
currently considered as large-scale and did not include real background traffic.
The most common dataset used in nearly all the studies analyzed is KDD’99, as mentioned before
in Section 1. KDD’99 is a dataset that includes a lot of useful features; however, it is not a good
representation of the most modern attacks and cannot be used as information to train a modern IDS
because of its type and the source of the flows it contains (synthetic) [6]. The same happens with the
DARPA dataset. Reference [10] introduced a new, modern, and up-to-date dataset with real network
traffic taken directly from an internet service provider for citizens as customers (TIER-3 ISP): the
UGR’16 dataset.
The information included in the dataset was anonymized providing real and complete information
to build a model. This part of the dataset is called “calibration data”. The dataset also includes some
parts to test the correct training of an IDS. This part of the dataset was specially created to check that
the IDS behaves as expected, and it includes not only real network traffic but also synthetic traffic,
since these pieces include more attacks in proportion to real background traffic
Despite all the efforts made in this area of research, there are no studies that use complete or truly
large datasets, whether because of the number of included traffic flows, the large number of features
that make up the dataset, or both. This is mainly due to the problems already described, related to the
handling of large datasets, which must be resolved in order to break the barrier and develop more
reliable IDSs [11].
Appl. Sci. 2020, 10, 3430 4 of 19
Finally, it is worth noting the work done in Reference [23]. There, the authors presented a generic
window-based approach to deal with heterogeneous streaming data from IoT devices in order to extend
a basic windowing algorithm for real-time data integration and to deal with the timing alignment
issue, something typical of IoT environments. However, although the postulated idea is interesting for
real-time processing of heterogeneous data carried out by an IDS, the problem in terms of the cost
related to the training of machine learning systems responsible for detecting possible attacks was
not addressed.
3. Background
UGR’16 Dataset
UGR’16 is composed of 12 different features and each of the packets/events presented are labeled
as malicious (including the type of attack) or not. Table 1 provides a summary of the variables managed
by UGR’16.
In addition, the data are organized into two differentiated groups of instances, namely, calibration
data and test data. Calibration data refers to real background traffic and this subset is conceived to
train an ML model. For this part, test data are intended to be used to prove the correct training of the
developed ML model.
Appl. Sci. 2020, 10, 3430 5 of 19
The average size of the different compressed files of the dataset is approximately 14 GB. These files
are organized into two different sets, 17 files focused on calibrating and six files for testing [24].
Unsupervised learning algorithms are those used when there are only inputs and no outputs, that
is, when information is neither classified nor labeled. Different methods are suggested to manage
data: clustering, which intends to classify data; association rule mining, which consists of looking for
rules and patterns from the data; and dimensionality reduction, which reduces the number of variable
characteristics in the dataset. Unsupervised learning is planned to capture the high-order correlation
of the observed data to look for patterns when there is no available information about a target class
label. Some examples of unsupervised learning are as follows:
Finally, there are different processes and techniques that can be applied to different ML algorithms
to enhance their work. The most common ones are association rules, anomaly detection, sparse
dictionary learning, and feature learning [13].
This research applies supervised ML algorithms as a basis. Then, the deep learning (DL) and
non-DL algorithms used in our work are presented.
4. Proposal
As we pointed out earlier, datasets are an integral part of the development of ML models and,
the larger they are, the better the results obtained. In this sense, the UGR’16 dataset contains millions
of collected network packets. However, these data are unoptimized to be used directly as input into an
ML algorithm; consequently, it needs to be preprocessed in order to obtain a better performance.
This must be done by firstly selecting the most outstanding features of the dataset to process them
correctly by the ML algorithm. Once the preprocessing is finished, the ML model training process
should be performed. The preprocessing operation requires a large amount of resources to be carried
out correctly, which is something to consider carefully when designing it. The main objective of this
Appl. Sci. 2020, 10, 3430 7 of 19
research is presenting a new model to preprocess data in an optimized way, performing computer
execution in time for large-scale datasets.
The first section of the proposal defines the necessary requirements for the training of an IDS
considering the preprocessing of the dataset by means of distributed computing and different hardware
architectures. In the second section, we offer different software solutions to develop the training by
means of ML techniques. The DT and MLP algorithms are chosen because they were proven to be
suitable for early investigations [18,42], as presented in Section 2. Finally, this research performs a
comparison of different tested ML algorithms to expose all the collected information.
x−σ
f (x) = . (5)
α
Most algorithms use index encode for each feature, as can be seen in the case of Reference [5],
which used the index encode for each feature string, consequently having a unique identifier. The way
that the dataset was preprocessed in our proposal is summarized in Table 2. Three features from
the dataset were normalized into numerical features (duration of the flow, number of packets in the
flow, number of bytes transmitted), five features were encoded as indexes (source/destination Internet
Protocol (IP) address, flags, type of service), two were maintained as indexes (source/destination port),
and the result was encoded as an index (result of the flow).
Appl. Sci. 2020, 10, 3430 8 of 19
Feature Encode
Timestamp Dropped
Duration of the flow Z-score normalization
Source/Destination IP address Index encode
Source/Destination port No encode
Protocol Index encode
Flags Index encode
Forwarding status Dropped
Type of service Index encode
Number of packets in the flow Z-score normalization
Number of bytes transmitted Z-score normalization
Result of the flow Binary/Multilabel
A function for non-numeric values is based on converting all non-numeric features into binary
vectors values; for example: Transmission Control Protocol (TCP), User Datagram Protocol (UDP),
Generic Routing Encapsulation (GRE), Encapsulating Security Payload (ESP), IP tuneling protocol
(IPIP), IP Version 6, and Core Based Trees (CBT) (1,0,0,0,0,0), (0,1,0,0,0,0), (0,0,1,0,0,0), ( . . . ), and
(0,0,0,0,0,0,1). When this change is applied, the dataset is resized and the number of input values
increases. By executing several tests and different configurations of preprocessing data, the best
configuration was determined by the results presented in Table 2.
The features related to timestamp and forwarding status were dropped because they do not affect
the results, since this work does not have the objective of doing an analysis based on time series.
The forwarding status is always set as “0”, which means no forwarding. It was shown that these
assumptions do not imply changes in the accuracy; thus, these assumptions are considered not to affect
the results of latter tests.
Figure 1.
Figure Architecture to
1. Architecture to deploy
deploy distributed
distributed preprocessing
preprocessing on
on four
four machines.
machines.
Algorithm 1 presents how the dataset is preprocessed. This task is done by columns because it is
mandatory to consider the different values in each feature (each feature correspond with a column).
Algorithm 1 Preprocessing UGR’16 dataset
This allows building an index array of the different elements, which is used to calculate the normal
1: preprocess_dataset.py
distribution of the values that shape the feature or just to encode them in the desired way as required
2: open full dataset file
in each case.
3: drop incomplete rows
The whole process is done in an agile way, dividing the different functions correctly and protecting
4: for every machine available:
critical reading/writing operations to provide reliable operation of the software that will be executed in
5: create one thread per CPU core
parallel in the above-mentioned machines.
6: START OF DISTRIBUTED PREPROCESING FUNTION
One of the available threads (launched continuously) manages the execution. This first thread
7: while not processed features available do:
starts by opening the full dataset file and the different columns corresponding with each feature
8: # Code executed in parallel at each machine
identified. Then, this first thread starts assigning features for preprocessing to each of the available
9: for every free thread in parallel do:
threads (lines 6–18 in Algorithm 1) in parallel. Once the tasks finish, each thread returns the result to the
10: feature to preprocess = random feature from the dataset not processed yet
first thread, which assigns more features in parallel until there are no more to process. Finally, the results
11: process_piece(feature to preprocess):
are stored in a file that will be ready to be used by the corresponding machine learning algorithm.
12: if feature is duration or number of packets or bytes:
13: code each value as normal distribution of all
14: if feature is type of traffic:
Appl. Sci. 2020, 10, 3430 10 of 19
The machines that process the algorithm use their graphics processing units (GPUs) and CPUs
to increase its efficiency. The number and type of features and their preprocessing was addressed in
Section 5.1. Additionally, in that same section, the architecture of the neural network was settled in 10
input values. The output was defined by a binary result, where 0 means no attack and 1 means attack.
Between the input and output layer, two hidden layers were added. A hidden layer helps to represent
different decisions that could not be directly related to the linear results of the features. The first layer
helps to approximate functions that contain a continuous mapping from one finite space to another,
and the second layer could represent an arbitrary decision boundary to arbitrary accuracy. For this
reason, the neural network that is created has two hidden layers, this makes it capable of identifying
arbitrary solutions.
In addition, some other important parameters that the neural network must have are the activation
function and the kernel initializer function or the optimizer. In this case, the activation function
selected was ReLU; this decision was made following the previous work carried out by the authors
and presented in Reference [18], where a comparative analysis of different activation functions for
MLP is shown.
Thus, the model proposed in this research was developed by using Keras (Alphabet Inc., Mountain
View, CA, USA) and defined by four layers: one input layer, two hidden layers, and one output
layer. The input layer was designed by including one node for each input feature. On the other
hand, the number of hidden layers was determined using Equation (1), selected after evaluating
the four previously proposed equations [18]. Furthermore, in every layer except in the output,
the “softmax” function was used, as it facilitates the probability distribution among a different number
of categories [47]. The kernel initializer function selected was the “normal” and the optimizer was
implemented by an “adam” optimizer [48,49]. EarlyStopping [50] was used in the proposed model in
order to prevent the overfitting with a min_delta of 10−3 , which is the minimum value of the loss in
order to determine an improvement of the number of epochs in the model.
In addition, unlike neural networks, which are black boxes, in a decision tree, the calculations being
made are easy to understand because it is totally transparent [51].
Decision trees follow some specific steps to build the most effective tree to get the best results
in each case. That is, it looks for the best feature of the dataset using an algorithm called “attribute
selection measures” to split the dataset into different parts, i.e., smaller subsets. This process is
repeated recursively for each new child, trying to meet one of the desired conditions. The possible stop
conditions are the lack of attributes, the lack of subsets, or the leaves that last belonging to the same
attribute [51]. As can be seen, the attribute selection measure is the cornerstone of the decision trees.
The main idea behind it is to provide each feature with a rank by explaining the input dataset. The one
with the best rank is selected as the splitting feature.
There are different algorithms to determine the rank. The most common one is the “information
gain” which calculates the entropy of the features that shape the dataset in order to measure the
“randomness” of the set. Furthermore, the entropy and “Gini index” are used to measure a weighted
sum of the impurity of each partition of the selected features [52]. Different tests were carried out in
order to reach different results. Some other parameters that can be chosen in a decision tree are the
depth of the tree itself, to make simpler trees that are not overfitted, or the criteria to split the dataset
itself (not the feature), by choosing the best split calculated or by randomizing the process.
6. Validation
Table 3. Comparison of the time it takes to process the dataset with distributed preprocessing or locally
for different sizes (in lines) of the dataset.
The results show that the distributed preprocessing architecture achieved significantly better
results than execution in a single machine thanks to the parallelization of tasks. The developed
algorithm makes it easier to open large datasets for preprocessing purposes in machines that do not
have many resources to be used, while the distributed preprocessing architecture makes the execution
run in less time, reducing costs.
In conclusion, the distributed preprocessing architecture reduces the time costs of the dataset
preprocessing, one of the most important problems when dealing with big datasets. A large dataset
containing more than 157,602,189 traffic flows, of which 2,324,955 are attacks, collected during July
2016 [24], with a size of more than 15 GB in compressed format, can be processed in a manageable
amount of time using our distributed architecture proposal. For this reason, in the following sections,
Appl. Sci. 2020, 10, 3430 12 of 19
this dataset is the one chosen to train the neural network and the decision tree as it contains a huge
amount of information and different types of attacks. Table 4 summarizes different attacks classified
within the dataset presented in the preprocessed UGR’16 [24].
Table 4. Relation between the type of attacks and the number of their appearances in the
preprocessed dataset.
Table 5. Evaluation of neural networks with the multilayer perceptron (MLP) algorithm. CPU—central
processing unit; GPU—graphics processing unit.
In conclusion, the MLP neural networks cannot be trained this way with large datasets.
The limitations of this architecture make it not scalable and, therefore, impossible to train a model in
an assumable amount of time. In addition, training it using multiple GPUs, despite being theoretically
faster, is limited by the size of the model itself, which, regardless of the number of graphics processor
units used, must be able to be stored completely in the memory of each of them in order to perform the
training. Furthermore, regarding GPU training, when the training process takes place on multiple
GPUs, as explained above, the CPU is responsible for creating the model, processing the batches,
and assigning them to each GPU. In this process, an overhead is added. Although that overhead is
negligible in small models, it becomes more and more noticeable in large neural networks, increasing
Appl. Sci. 2020, 10, 3430 13 of 19
the training time with respect to training in a CPU with many cores, or even making this training
impossible because of the size achieved. This additional overhead is fixed since it serves to relocate the
output information of the training in the final model that the CPU itself is building.
Table 6. Summary of the result achieved using different combinations of algorithms for decision trees
using the fully preprocessed dataset.
Quality of Split
“Gini” “Entropy”
Time Accuracy Error Time Accuracy Error
Splitter Best 2899.448 0.999 0.0012 3423.31 0.999 0.0011
strategy Random 497.19 0.989 0.0120 448.976 0.988 0.0119
When the random splitting criteria are applied, the time needed to make the calculations is heavily
reduced, since it does not calculate which one is better, choosing one random feature to continue the
tree. This is penalized by reducing the accuracy and lightly elevating the error rate, which is, however,
almost negligible.
More important than accuracy and times for training are the threat detection rates, since the IDS
is expected to detect as many security problems as possible. Given the available attacks that were
present in the preprocessed dataset and summarized in Table 4, confusion matrices were obtained for
each of the cases.
The first important conclusion from the results is that the random splitting strategy is not a valid
solution. An IDS cannot have such a high false negative rate, detecting 72.33% of the cases as “no
attack” when it really was an attack in the worst case. This rate was lowered to 6.33% of the false
negative rate in the entropy/best combination, which can be classified as acceptable, since, with an IDS,
it is preferable to have a higher rate of false positives rather than false negatives.
In each case, the decision tree was generated so that the criteria followed by the algorithm could
be checked. Thanks to this, it was possible to verify that the main criteria that the decision tree took
were the origin and destination IP addresses.
These IP addresses, coded as an index, were only an anonymized number since this was done in
the dataset itself. It is logical to think that data such as the port of origin and destination, the duration
of the flow, or the amount of data transmitted are more important. For this reason, it was decided
to repeat the test, this time removing the trace of IP addresses from the dataset. In this repetition,
the best results balancing quality and time were again achieved with the “entropy” calculation and
“best” splitting strategy. The time needed to execute this algorithm was 2786.77 seconds. The accuracy
achieved was 0.99614035. Logically, the amount of time needed decreased, while the distribution of
accuracy and error remained the same. The confusion matrix for this case is presented in Table 7.
Appl. Sci. 2020, 10, 3430 14 of 19
True Label/
DoS Net Scan Botnet Blacklist Spam No Attack
Predicted Label
DoS 235,635 0 0 0 0 64
Net Scan 0 133,500 0 0 0 1424
Botnet 0 0 16,601 0 0 28,726
Blacklist 10 4 0 1358 7 133,319
No Attack 0 0 0 0 140,379 7526
In this case, the false negative rate increased slightly, with a 24% rate. However, looking at the
decision tree, it is more logical to think that this algorithm is more prepared to predict attacks in more
situations since it is not using IP addresses.
Furthermore, analyzing how the model is predicting attacks, several conclusions can be drawn.
Firstly, the model presents good results in the classification of network denial of service attacks, with
a success rate of 99.97%. Similar cases occur with network scans and spam detection, with 98.94%
and 94.91%, respectively. On the other hand, the model did not find a good relationship between
information in traffic flows and botnets since its success rate was only 36.62%. The model cannot
predict blacklisted traffic, with a success rate of only 1%. These results are summarized in Table 8.
The achieved results were good, and the IDS is expected to perform adequately. The worst rates,
achieved in the predictions of blacklisted traffic, are not a real problem since there are good alternative
methods to detect this kind of attacks [53], e.g., based on blacklist malicious IPs formed by different
feeds obtained from malicious IP service providers.
7. Discussion
Due to the enormous amount of information needed to determine an attack, limiting information
contained by cybersecurity datasets stands as an issue for developing detection systems. Consequently,
the information offered by this kind of datasets should be reduced to the most useful to feed the
underlying ML algorithms. For this reason, UGR’16 was selected as a basis dataset in this research, a
dataset that contains information strongly focused on attack detection within real ISP network traffic.
As mentioned throughout this article, to achieve a successful result in detecting attacks and
intrusions by means of an IDS based on AI, a thorough training of the underlying ML model is
necessary. For this process, trying to use as much as information as possible offered by the dataset is
important to achieve good results.
The time invested in this preprocessing remained below 6500 seconds, a fact that meets the
expectations and objectives defined previously. Table 3 shows the results given by parallelism and
local execution, and it leads to the fact that that its comparison in terms of complexity is not parallel.
This is because the time required for the processing of the individual functions is related to the
feature preprocessing encoding in each machine. The model, therefore, demonstrated a correct level
of scalability while maintaining a balance among costs, execution time, and the specific knowledge
required for its execution.
The second objective of this research was focused on the training process of an ML model, using
for this the full portion of the preprocessed data. This process was approached from two perspectives,
using MLP neural networks with different configurations, and through decision trees. Table 9 shows a
comparison between both approaches.
Appl. Sci. 2020, 10, 3430 15 of 19
For the first approximation, the results show problems related to its scalability and the consumption
of computational resources. The DL model implemented required a memory space greater than five
times the size of the data subset, that is, 120 GB. The training based on CPUs, despite being slow and
requiring hundreds of thousands of hours to process each epoch, is possible. TensorFlow is able to
scale the model to make use of all the available cores in the CPU. However, it is not optimal due to the
number of hours needed. By using GPUs, it is possible to obtain better results in terms of time due to
the use of several GPUs at the same time. However, to perform training on several GPUs at once, it is
necessary to add an overhead of data to the batches that, in each epoch, are sent to each GPU to relocate
the results to the complete original model. This makes large models a much slower process due to
the large overhead added. To avoid scalability problems, the library of python for machine learning
applications, Keras [54], uses TensorFlow (2.0, Santa Clara, CA, USA) [55], providing a method to
perform the training by using a Python generator that collects small portions of the original dataset.
However, the UGR’16 dataset contains more background traffic than any other kind and, therefore,
the most probable event is that the portions on which the algorithm is trained only contain background
traffic, without attacks. Therefore, the model will generate false negatives. Thus, we can conclude
that this kind of architecture is not ready for training MLP networks by use of large datasets, because,
despite using very powerful machines with multiple CPUs, GPUs, and large amounts of memory,
the cost of training remains very high, as shown in Table 4.
In contrast to this first approach, the same tests were attempted on a simpler algorithm such as
the decision tree. For this method, a comparison was made of all the algorithms used to calculate
the decision tree. The results obtained led to several conclusions. By using this kind of algorithm, it
was possible to obtain results with machines with more limited resources and less time. In the worst
case, the construction of the entire decision tree using the preprocessed dataset took less than 3500 s.
Analyzing the decision tree created from the training, we observed how the IP addresses, both origin
and destination, were taken as a key variable, something that led to unacceptable results. This was due
to IP addresses being easily changeable data for most attacks. Therefore, we decided not to consider
those data during the training. By adapting the training in this way, we obtained results that show
that decision trees achieved a good overall precision. More specifically, the results of the model were
successful in obtaining a classification network for denial of service attacks, with a success rate of
99.97%. Similar results were achieved when detecting network scans and spam, with 98.94% and
94.91%, respectively. The accuracy of the results was reduced when predicting botnet attacks (36.62%)
and especially when predicting blacklist traffic (1%). In general, the results are faster and more reliable
compared with neural networks.
As stated in Section 2, there are different approaches for classification for IDSs based on anomaly
detection. Table 10 compares the model proposed in this paper with some of those included in
the literature, based on different aspects such as multi-class classification, binary classification (BC),
training time, model proposed, scalability of the software, and quantity of data analyzed.
Despite all the efforts made in this area of research, most of the existing approaches do not analyze
truly large-scale datasets as proposed in this paper. Our solution demonstrated that it is able to train
up to 14 GB of data (compressed) detecting new and reliable modern attacks with high accuracy using
the benchmark dataset UGR’16.
Appl. Sci. 2020, 10, 3430 16 of 19
Author Contributions: Conceptualization, V.A.V., X.L.-N., and M.Á.-C.; methodology, X.L.-N. and V.A.V.,
software, X.L.-N.; validation, M.V.-B., D.R., and V.A.V.; investigation, V.A.V., M.V.-B., and X.L.-N.; data curation,
X.L.-N.; writing—original draft preparation, X.L.-N., M.V.-B., D.R., and V.A.V.; writing—review and editing,
X.L.-N., M.V.-B., D.R., and V.A.V.; supervision, V.A.V., M.Á.-C., D.R., M.V.-B., and J.B.; project administration,
V.A.V.; funding acquisition, V.A.V., M.Á.-C., and J.B. All authors have read and agreed to the published version of
the manuscript.
Funding: This research received no external funding.
Appl. Sci. 2020, 10, 3430 17 of 19
References
1. Key Challenges. Available online: https://www.weforum.org/centre-for-cybersecurity/home/ (accessed on
15 April 2019).
2. Geluvaraj, B.; Satwik, P.M.; Kumar, T.A. The Future of Cybersecurity: Major Role of Artificial Intelligence,
Machine Learning, and Deep Learning in Cyberspace. In Proceedings of the International Conference on Computer
Networks and Communication Technologies; Springer: Singapore, 2019; pp. 739–747.
3. Tjhai, G.C.; Papadaki, M.; Furnell, S.M.; Clarke, N.L. Investigating the problem of IDS false alarms:
An experimental study using Snort. In Proceedings of the Ifip Tc 11 23rd International Information Security
Conference; Jajodia, S., Samarati, P., Cimato, S., Eds.; Springer: Boston, MA, USA, 2008; Volume 278,
pp. 253–267. ISBN 978-0-387-09698-8.
4. Hu, L.; Li, T.; Xie, N.; Hu, J. False positive elimination in intrusion detection based on clustering. In Proceedings
of the 2015 12th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), Zhangjiajie,
China, 15–17 August 2015; pp. 519–523.
5. Mishra, P.; Varadharajan, V.; Tupakula, U.; Pilli, E.S. A Detailed Investigation and Analysis of Using Machine
Learning Techniques for Intrusion Detection. IEEE Commun. Surv. Tutor. 2019, 21, 686–728. [CrossRef]
6. Tavallaee, M.; Bagheri, E.; Lu, W.; Ghorbani, A.A. A detailed analysis of the KDD CUP 99 data set.
In Proceedings of the 2009 IEEE Symposium on Computational Intelligence for Security and Defense
Applications, Ottawa, ON, Canada, 8–10 July 2009; pp. 1–6.
7. Gupta, G.P.; Kulariya, M. A Framework for Fast and Efficient Cyber Security Network Intrusion Detection
Using Apache Spark. Procedia Comput. Sci. 2016, 93, 824–831. [CrossRef]
8. Flow-Based Intrusion Detection: Techniques and Challenges | Elsevier Enhanced Reader. Available online:
https://reader.elsevier.com/reader/sd/pii/S0167404817301165?token=E3C74A2C564F117F985E7ED42B8710
D395BFDF2BA31F99DBA65E63B3295E5B3D75000B3FC3E01132C2E06A2ACDE52A92 (accessed on
20 June 2019).
9. Strigl, D.; Kofler, K.; Podlipnig, S. Performance and Scalability of GPU-Based Convolutional Neural Networks.
In Proceedings of the 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing; IEEE:
Piscataway, NJ, USA, 2010; pp. 317–324.
10. Maciá-Fernández, G.; Camacho, J.; Magán-Carrión, R.; García-Teodoro, P.; Therón, R. UGR‘16: A new dataset
for the evaluation of cyclostationarity-based network IDSs. Comput. Secur. 2018, 73, 411–424. [CrossRef]
11. Hindy, H.; Brosset, D.; Bayne, E.; Seeam, A.; Tachtatzis, C.; Atkinson, R.; Bellekens, X. A Taxonomy and
Survey of Intrusion Detection System Design Techniques, Network Threats and Datasets. arXiv 2018,
arXiv:1806.03517 [cs].
12. MIT Lincoln Laboratory: DARPA Intrusion Detection Evaluation. Available online: https://www.ll.mit.edu/
ideval/data/ (accessed on 22 May 2018).
13. Perona, I.; Arbelaiz Gallego, O.; Gurrutxaga, I.; Martín, J.I.; Muguerza Rivero, J.F.; Pérez, J.M. Generation
of the database gurekddcup. 2017. Available online: https://addi.ehu.es/handle/10810/20608 (accessed on
15 May 2020).
14. Shiravi, A.; Shiravi, H.; Tavallaee, M.; Ghorbani, A.A. Toward developing a systematic approach to generate
benchmark datasets for intrusion detection. Comput. Secur. 2012, 31, 357–374. [CrossRef]
15. Analysis, C.C.; For A.I.D. CAIDA: Center for Applied Internet Data Analysis. Available online: http:
//www.caida.org/home/index.xml (accessed on 11 September 2019).
16. Revathi, S.; Malathi, D.A. A Detailed Analysis on NSL-KDD Dataset Using Various Machine Learning
Techniques for Intrusion Detection. Int. J. Eng. Res. Technol. 2013, 2, 1848–1853.
17. Alrawashdeh, K.; Purdy, C. Toward an online anomaly intrusion detection system based on deep learning.
In Proceedings of the 2016 15th IEEE International Conference on Machine Learning and Applications
(ICMLA), Anaheim, CA, USA, 18–20 December 2016; pp. 195–200.
18. Larriva-Novo, X.A.; Vega-Barbas, M.; Villagra, V.A.; Sanz Rodrigo, M. Evaluation of Cybersecurity Data Set
Characteristics for Their Applicability to Neural Networks Algorithms Detecting Cybersecurity Anomalies.
IEEE Access 2020, 8, 9005–9014. [CrossRef]
Appl. Sci. 2020, 10, 3430 18 of 19
19. Shahamiri, S.R.; Binti Salim, S.S. Real-time frequency-based noise-robust Automatic Speech Recognition
using Multi-Nets Artificial Neural Networks: A multi-views multi-learners approach. Neurocomputing 2014,
129, 199–207. [CrossRef]
20. Gaidhane, R.; Vaidya, C.; Raghuwanshi, D.M. Intrusion Detection and Attack Classification using
Back-propagation Neural Network. Int. J. Eng. Res. 2014, 3, 4.
21. Karsoliya, S. Approximating Number of Hidden layer neurons in Multiple Hidden Layer BPNN Architecture.
Int. J. Eng. Trends Technol. 2012, 3, 714–717.
22. Belouch, M.; El Hadaj, S.; Idhammad, M. Performance evaluation of intrusion detection based on machine
learning using Apache Spark. Procedia Comput. Sci. 2018, 127, 1–6. [CrossRef]
23. Tu, D.Q.; Kayes, A.S.M.; Rahayu, W.; Nguyen, K. ISDI: A New Window-Based Framework for Integrating
IoT Streaming Data from Multiple Sources. In Proceedings of the Advanced Information Networking and
Applications; Barolli, L., Takizawa, M., Xhafa, F., Enokido, T., Eds.; Springer International Publishing: Cham,
Switzerland, 2020; pp. 498–511.
24. Rajagopal, S.; Kundapur, P.P.; Hareesha, K.S. A Stacking Ensemble for Network Intrusion Detection Using
Heterogeneous Datasets. Secur. Commun. Netw. 2020, 2020, 4586875. [CrossRef]
25. Learning, D. Ian Goodfellow, Yoshua Bengio, Aaron Courville; MIT Press: Cambridge, MA, USA, 2016.
26. Brownlee, J. Supervised and unsupervised machine learning algorithms. Mach. Learn. Mastery 2016, 16.
27. Gardner, M.W.; Dorling, S.R. Artificial neural networks (the multilayer perceptron)—A review of applications
in the atmospheric sciences. Atmos. Environ. 1998, 32, 2627–2636. [CrossRef]
28. Heaton, J. AIFH, Volume 3: Deep Learning and Neural Networks; Heaton Research, Inc, 2015; ISBN 1-5057-1434-6.
29. Vinayakumar, R.; Soman, K.P.; Poornachandran, P. Applying convolutional neural network for network
intrusion detection. In Proceedings of the 2017 International Conference on Advances in Computing,
Communications and Informatics (ICACCI), Udupi, India, 13–16 September 2017; pp. 1222–1228.
30. Haykin, S. Neural Networks: A Comprehensive Foundation, 1st ed.; Prentice Hall PTR: Upper Saddle River, NJ,
USA, 1994; ISBN 978-0-02-352761-6.
31. Saha, S. A Comprehensive Guide to Convolutional Neural Networks—The ELI5 Way. Available online: https://
towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way- 3bd2b1164a53
(accessed on 20 June 2019).
32. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks.
In Advances in Neural Information Processing Systems 25; Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q.,
Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2012; pp. 1097–1105.
33. Breiman, L. Classification and Regression Trees; Routledge: London, UK, 2017; ISBN 978-1-315-13947-0.
34. Liberman, N. Decision Trees and Random Forests. Available online: https://towardsdatascience.com/decision-
trees-and-random-forests-df0c3123f991 (accessed on 20 June 2019).
35. Wang, F.; Wang, Q.; Nie, F.; Yu, W.; Wang, R. Efficient tree classifiers for large scale datasets. Neurocomputing
2018, 284, 70–79. [CrossRef]
36. Peng, K.; Leung, V.C.M.; Zheng, L.; Wang, S.; Huang, C.; Lin, T. Intrusion Detection System Based on Decision
Tree over Big Data in Fog Environment. Available online: https://www.hindawi.com/journals/wcmc/2018/
4680867/abs/ (accessed on 21 November 2019).
37. Gupta, P. Decision Trees in Machine Learning. Available online: https://towardsdatascience.com/decision-
trees-in-machine-learning-641b9c4e8052 (accessed on 20 June 2019).
38. Decision Tree Classification in Python. Available online: https://www.datacamp.com/community/tutorials/
decision-tree-classification-python (accessed on 31 May 2019).
39. Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [CrossRef]
40. Nearest Neighbor Pattern Classification. Available online: https://scholar.googleusercontent.com/scholar?q=
cache:0XrqZfG45o0J:scholar.google.com/+k+nearest+neighbor&hl=es&as_sdt=0,5&as_vis=1 (accessed on
20 June 2019).
41. Gandhi, R. K Nearest Neighbours—Introduction to Machine Learning Algorithms. Available online: https:
// towardsdatascience.com/k-nearest-neighbours-introduction-to-machine-learning-algorithms-18e7ce3d802a
(accessed on 20 June 2019).
42. Aldweesh, A.; Derhab, A.; Emam, A.Z. Deep learning approaches for anomaly-based intrusion detection
systems: A survey, taxonomy, and open issues. Knowl. Based Syst. 2020, 189, 105124. [CrossRef]
Appl. Sci. 2020, 10, 3430 19 of 19
43. Moradi, M.; Zulkernine, M. A Neural Network Based System for Intrusion Detection and Classification of
Attacks. In Proceedings of the IEEE International Conference on Advances in Intelligent Systems-Theory
and Applications, Luxembourg-Kirchberg, Luxembourg, 15–18 November 2004.
44. Jalberca. Anomaly-based Intrussion Detection System with machine learning and distributed execution.
Available online: https://github.com/jalberca/tfm-ids_and_machine_learning (accessed on 13 May 2020).
45. Soysal, M.; Schmidt, E.G. Machine learning algorithms for accurate flow-based network traffic classification:
Evaluation and comparison. Perform. Eval. 2010, 67, 451–467. [CrossRef]
46. Chiba, Z.; Abghour, N.; Moussaid, K.; El Omri, A.; Rida, M. A novel architecture combined with optimal
parameters for back propagation neural networks applied to anomaly network intrusion detection. Comput. Secur.
2018, 75, 36–58. [CrossRef]
47. Ramachandran, P.; Zoph, B.; Le, Q.V. Searching for Activation Functions. arXiv 2017, arXiv:1710.05941.
48. Brownlee, J. Gentle Introduction to the Adam Optimization Algorithm for Deep Learning. Machine Learning
Mastery. Available online: https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-
learning/ (accessed on 13 May 2020).
49. Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980 [cs].
50. Callbacks-Keras Documentation. Available online: https://keras.io/callbacks/#earlystopping (accessed on
21 November 2019).
51. Bouzida, Y.; Cuppens, F. Neural networks vs. decision trees for intrusion detection. In Proceedings of
the IEEE/IST workshop on monitoring, attack detection and mitigation (MonAM), Citeseer, Tuebingen,
September 2006; 28, p. 29.
52. White, A.P.; Liu, W.Z. Bias in information-based measures in decision tree induction. Mach. Learn. 1994, 15,
321–329. [CrossRef]
53. Ghafir, I.; Prenosil, V. Blacklist-based malicious IP traffic detection. In Proceedings of the 2015 Global
Conference on Communication Technologies (GCCT), Thuckalay, India, 23–24 April 2015; pp. 229–233.
54. Keras Documentation. Available online: https://keras.io/ (accessed on 4 June 2018).
55. TensorFlow. Available online: https://www.tensorflow.org/?hl=es (accessed on 14 June 2019).
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).