A Review On Machine Learning Methods For Intrusion
A Review On Machine Learning Methods For Intrusion
DOI: 10.54254/2755-2721/27/20230148
Man Ni
School of advanced technology, Xi’an Jiaotong-Liverpool University, Suzhou,
215123, China
Abstract. With the increasing access to the Internet and the development of information
technology, concerns about computer security have been raised on a considerably large scale.
Computer crimes contain various methods to undermine information privacy and system
integrity, causing millions to trillions lose in the past few years. It is urgent to improve the
security algorithms and models to perform as a thorough structure to prevent attacks. Among
this prevention structure, an intrusion detection system (IDS) has played a vital role to monitor
and detect malicious behaviours. However, due to the rapidly increasing variety of threats, the
traditional algorithms are not sufficient, and new methods should be brought into IDS to improve
the functionality. Deep learning (DL) and Machine learning (ML) are newly developed programs
which can process data on a considerably large scale. They can also make decisions and
predictions without specific programming, and these features are suitable to improve and
enhance the IDS. This article mainly focuses on a review of ML methods used in IDS
construction.
1. Introduction
In the past decades, the development and usage of the Internet is increasingly growing. In 2022, the
Internet has been widely used by 82% of the urban resistance World, and the percentage of usage in the
rural area increased from 31% in 2019 to 46% in 2022 [1]. The Internet has possessed a dramatically
large amount of utilization contemporarily. With the growth of the number of Internet users and devices,
the damage caused by cybercrime becomes more catastrophic as well. The use of a computer or the
Internet for criminal purposes is defined as cybercrime, and the commonly used attacks involve malware
blackmail, invasion of privacy, identity theft, denial of service (DoS) and so on. In the field of
economics, damages caused by cybercrimes are raised from $2 trillion in 2015 to $6 trillion annually by
2021. Cyberattacks were nominated by the World Economic Forum (WEF) as the third global risk for
2018, this status would not decline in the future [2].
Under the threats mentioned before, the requirement for reliable protection systems is acute and
reasonable. On the operational level, Intrusion Detection System (IDS) is a complement to the firewall,
whose main purpose is to detect malicious invasion which has a high possibility to undermine the
functions of a network or system [3]. IDS can be categorized into two types in terms of deployment
position: network-based IDS (NIDS) and host-based IDS (HIDS). While HIDS performs like a
© 2023 The Authors. This is an open access article distributed under the terms of the Creative Commons Attribution License 4.0
(https://creativecommons.org/licenses/by/4.0/).
57
Proceedings of the 2023 International Conference on Software Engineering and Machine Learning
DOI: 10.54254/2755-2721/27/20230148
stationing on a single complex like the host of one personal computer, NIDS can supervise the whole
traffic from the network, including all the packets that pass through. Therefore, with regard to a large-
scale network, the NIDS is more customarily used. Detection is fundamental to both types of IDS, and
in respective of mechanism, the methods employed for identification can be sorted as two classifications:
anomaly-based detection system and signature-based detection system. The main dissimilarity between
them is that signature-based detection discovers known malicious intrusions, but anomaly-based
detection can uncover new intrusions by diagnosing anything abnormal in the traffic. The definition of
anomaly is based on the offset attributes from the baseline. Since supervisions by the means of signature
detection depends on the updated database with malicious behaviour signatures recorded, it is rather
lagging and not adequate to cover all the intrusions, especially the novel ones. Therefore, the
implementation of anomaly-based detection is decidedly crucial. The root mechanism of anomaly-based
detection is to build a baseline profile which symbolizes the behaviours outside the alerting range, by
studying the traffic from the network. After this studying, anomaly-based IDS is well prepared to detect
and supervise by juxtaposing the normal baseline and the current traffic [3].
However, since the data capacity of the Internet is excessively large, and the refreshing is rapid, it is
impossible to manually set every baseline profile for anomaly-based detection. Therefore, the demand
for an efficacious self-learning algorithm sprang. Machine learning (ML) performs as an algorithm
consisting model included to artificial intelligence (AI), which is intended for data training, constructing
models that are able to make selections and decisions by themselves, rather than depending on specific
program commands [4]. ML has already manifested as a formidable approach in several domains, such
as consumer services, control of logistics chains, biology and computer science [5]. Deep Learning (DL)
is one sub-branch of ML, and adopting this algorithm to anomaly-based IDS has become a tren [6]. A
huge amount of data is reckoned with by the employment of ML to perform supervising and detection
of the network traffic [7]. Applying ML to early detections in the cyber security field is an effective way
to reveal new attacks [8]. From 2009 to 2014, ML models are often used on detecting attacks from cloud
security, malware, and malicious intrusions. After the raise of DL in 2013, this trend increased at a
considerably large rate. And up to 2020, the usage of ML and DL is comprehensive in the computer
security domain [9].
In this article, the main objective is to review deep learning and machine learning employed in
intrusion detection systems. Section I aims to explain the concept and classification of IDS, the
signature-based and anomaly-based IDS are addressed. And section II demonstrates the principles of
deep learning and machine learning, with several frequently used techniques in detail. The proposed
applications and innovations of machine learning to IDS are presented in section III, and section IV
delivers the challenges met by the machine learning-based IDS. Finally, the conclusion contains a brief
summary of the whole article.
58
Proceedings of the 2023 International Conference on Software Engineering and Machine Learning
DOI: 10.54254/2755-2721/27/20230148
Since the comparison needs high precision, SIDS performs with high accuracy when detecting
known threats, and this is the reason SIDS has been broadly deployed in the Internet of Things industry
and many other fields [10]. The advantages of SIDS apparently focus on two aspects: wide employment
and simple but effective detection of known attacks. But the disadvantages of SIDS are also ostensible.
The lack of flexibility when detecting a new threat leads to a huge false negative ratio. And on account
of the growing size of the malicious signatures, the search speed of one signature becomes slow and the
requirement for updating is increasing. The most distinct challenge SIDS faces is to keep the database
updated and arranging the limitation profiles of SIDS as a suitable manner [11].
59
Proceedings of the 2023 International Conference on Software Engineering and Machine Learning
DOI: 10.54254/2755-2721/27/20230148
3.2.1. Decision tree. The decision tree is a construction which includes nodes and branches, and the
target is to finalize decisions by going through steps such as splitting, stopping, and pruning. In the
figure, every node is represented by a name, and regarding to the deployment location they are
categorized into three types: root node, internal node and end node. The root node, or a decision node,
represents a selection made to cause two or more nodes as its sub-nodes. The internal node or chance
node is the node between the other two nodes, representing one possible choice. And the leaf node is
also called an end node, representing one of the end decisions made by the decision tree. Every pathway
from the decision node to the end node in the decision tree construction is one branch, which represents
one outcome from the decision tree. The ‘if-then’ rule is one way to embody these pathways [16]. There
are several important decision tree algorithms that have already been used in practice, in particular C4.5,
ID3 and CART [17].
3.2.2. Neural networks. Artificial neural network, or ANN, performs as a simulation based on the
biological nervous systems of a human brain, and the applications are already in practice, such as
intrusion detection, data classification and optimization method. ANN has a high accuracy one
information management and has the ability to handle a large scale of inputs, just like the human brain.
A large number of elements perform like neurons in an interconnected way, to achieve specific solutions
[18]. DL is actually ANN with complex multilayers, which link with each other [19]. This characteristic
60
Proceedings of the 2023 International Conference on Software Engineering and Machine Learning
DOI: 10.54254/2755-2721/27/20230148
enables DL to process information with broad variables and layers but in a unique basic network
architecture.
Figure 2. SVM.
3.2.3. SVM. As it is shown in figure2, the green and blue points are data points belonging to separate
classes, and the two support vectors are the closest data points between the two classes. Locating the
finest separating hyperplane between the two separate classes is the main intention if this algorithm, by
the ways of maximizing the margin between the support vectors. And the Support Vector Machine,
shorted as SVM, is a program with the capability of solve this problem [20]. For different kernel
functions, SVM can be classified into linear and non-linear types. With regards of the detection modes,
two forms one-class and multi-class are included [21, 22]. A huge amount of time is essential to train
SVM as well.
61
Proceedings of the 2023 International Conference on Software Engineering and Machine Learning
DOI: 10.54254/2755-2721/27/20230148
5.1. Dataset
Several frequently used datasets such as KDD 99, NSL-KDD, ISOT and so on, are actually archaic
nowadays. For example, KDD 99 is used for data training in the two pieces of research mentioned in
section III, but the results of KDD 99 CUP were published in 2000, which means this dataset is
constructed over 20 years ago [26]. And it is possible that this dataset is redundant and not able to cope
with the new attacks. On the other hand, the datasets that covers the newest malicious behaviors are
private at present but the public datasets often have redundant and anonymous attributes, therefore,
various issues exsists [23]. Therefore, valid and open datasets are required.
6. Conclusion
In this article, the mechanisms of two types of IDS are introduced in the first section, with the
comparison, advantages and disadvantages in detail. And section II focuses on the general concepts of
deep learning and machine learning, and three basic approches of machine learning are explained and
inspected. Based on the previous two sections, the third section included the applications and
innovations from two pieces of research. And finally, the challenges exposed are considered and several
urgent requirements are mentioned as well.
62
Proceedings of the 2023 International Conference on Software Engineering and Machine Learning
DOI: 10.54254/2755-2721/27/20230148
References
[1] ITU. (n.d.). Internet use in urban and rural areas. Retrieved March 2, 2023, from
https://www.itu.int/itu-d/reports/statistics/2022/11/24/ff22-internet-use-in-urban-and-rural-
areas/
[2] Nguyen, T. (2023, January 6). A review of Cyber Crime. Retrieved March 3, 2023, from
https://dzarc.com/social/article/view/244
[3] Rao, U., & Nayak, U. (1970, January 01). Intrusion detection and prevention systems. Retrieved
March 3, 2023, from https://link.springer.com/chapter/10.1007/978-1-4302-6383-8_11#Abs1
[4] Dua, S., & Du, X. (2011). Data Mining and machine learning in Cybersecurity. Boca Raton,
FL: CRC Press.
[5] Jordan, M. I., & Mitchell, T. M. (2015). Machine learning: Trends, Perspectives, and
prospects. Science, 349(6245), 255-260. doi:10.1126/science.aaa8415
[6] Ahmad, Z., Shahid Khan, A., Wai Shiang, C., Abdullah, J., & Ahmad, F. (2020). Network
intrusion detection system: A systematic study of machine learning and Deep Learning
Approaches. Transactions on Emerging Telecommunications Technologies, 32(1).
doi:10.1002/ett.4150
[7] LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436-444.
doi:10.1038/nature14539
[8] Fraley, J. B., & Cannady, J. (2017). The promise of machine learning in Cybersecurity.
SoutheastCon 2017. doi:10.1109/secon.2017.7925283
[9] Prasad, R., & Rohokale, V. (2019). Artificial Intelligence and machine learning in cyber
security. Springer Series in Wireless Technology, 231-247. doi:10.1007/978-3-030-31703-
4_16
[10] Ioulianou, P., Vassilakis, V., Moscholios, I., & Logothetis, M. (2018, August 31). A signature-
based intrusion detection system for the internet of things. Retrieved March 3, 2023, from
https://www.ieice.org/publications/proceedings/summary.php?iconf=ICTF&session_num=S
ESSION02&number=SESSION02_3&year=2018
[11] Folorunso, O., Ayo, F. E., & Babalola, Y. E. (2016). CA-NIDS: A network intrusion
detection system using combinatorial algorithm approach. Journal of Information Privacy and
Security, 12(4), 181-196. doi:10.1080/15536548.2016.1257680
[12] Hamid, Y., Sugumaran, M., & Journaux, L. (2016). Machine learning techniques for
intrusion detection. Proceedings of the International Conference on Informatics and Analytics.
doi:10.1145/2980258.2980378
[13] Buczak, A. L., & Guven, E. (2016). A survey of data mining and machine learning methods
for cyber security intrusion detection. IEEE Communications Surveys & Tutorials, 18(2),
1153-1176. doi:10.1109/comst.2015.2494502
[14] Purushotham, S., Meng, C., Che, Z., & Liu, Y. (2018). Benchmarking deep learning models
on large healthcare datasets. Journal of Biomedical Informatics, 83, 112-134. doi:
10.1016/j.jbi.2018.04.007
[15] Fernandez Maimo, L., Perales Gomez, A. L., Garcia Clemente, F. J., Gil Perez, M., &
Martinez Perez, G. (2018). A self-adaptive deep learning-based system for anomaly detection
in 5G networks. IEEE Access, 6, 7700-7712. doi:10.1109/access.2018.2803446
[16] Song, Y., & Lu, Y. (2015, April 25). Decision tree methods: Applications for classification and
prediction. Retrieved March 3, 2023, from
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4466856/
[17] Sharma, H., & Kumar, S. (2016). A survey on decision tree algorithms of classification in
data mining. International Journal of Science and Research (IJSR), 5(4), 2094-2097.
[18] Abiodun, O. I., Jantan, A., Omolara, A. E., Dada, K. V., Mohamed, N. A., & Arshad, H.
(2018). State-of-the-art in Artificial Neural Network Applications: A survey. Heliyon, 4(11).
doi:10.1016/j.heliyon.2018.e00938
[19] Albawi, S., Mohammed, T. A., & Al-Zawi, S. (2017). Understanding of a convolutional
63
Proceedings of the 2023 International Conference on Software Engineering and Machine Learning
DOI: 10.54254/2755-2721/27/20230148
64