Multi-Class Intrusion Detection Based On Transformer For IoT Networks Using CIC-IoT-2023 Dataset
Multi-Class Intrusion Detection Based On Transformer For IoT Networks Using CIC-IoT-2023 Dataset
Article
Multi-Class Intrusion Detection Based on Transformer for IoT
Networks Using CIC-IoT-2023 Dataset
Shu-Ming Tseng 1, * , Yan-Qi Wang 1 and Yung-Chung Wang 2
1 Department of Electronic Engineering, National Taipei University of Technology, Taipei 10608, Taiwan
2 Department of Electrical Engineering, National Taipei University of Technology, Taipei 10608, Taiwan
* Correspondence: [email protected]
Abstract: This study uses deep learning methods to explore the Internet of Things (IoT) network
intrusion detection method based on the CIC-IoT-2023 dataset. This dataset contains extensive data on
real-life IoT environments. Based on this, this study proposes an effective intrusion detection method.
Apply seven deep learning models, including Transformer, to analyze network traffic characteristics
and identify abnormal behavior and potential intrusions through binary and multivariate classifi-
cations. Compared with other papers, we not only use a Transformer model, but we also consider
the model’s performance in the multi-class classification. Although the accuracy of the Transformer
model used in the binary classification is lower than that of DNN and CNN + LSTM hybrid models,
it achieves better results in the multi-class classification. The accuracy of binary classification of our
model is 0.74% higher than that of papers that also use Transformer on TON-IOT. In the multi-class
classification, our best-performing model combination is Transformer, which reaches 99.40% accuracy.
Its accuracy is 3.8%, 0.65%, and 0.29% higher than the 95.60%, 98.75%, and 99.11% figures recorded in
papers using the same dataset, respectively.
et al. [7] compared their proposed Transformer architecture with RNN and LSTM with
binary classification using the ToN_IoT dataset released in 2020. The results show that the
proposed Transformer model performs excellently in terms of accuracy and precision, with
an accuracy rate of 87.79%.
References [7,8] mention the time complexity of some of the models in our paper such
as RNN, CNN, LSTM, etc. Reference [6] mentions most of the models’ time complexity, in
the same way as our paper but in a different dataset.
He et al. [9] proposed a transferable and adaptive network intrusion detection system
(NIDS) based on deep reinforcement learning. The results reached 99.60% and 95.60% in
the binary classification and multi-class classification of CIC-IoT2023, respectively. Jony
et al. [10] used LSTM to conduct an experimental evaluation of the multi-class classification
in CIC-IoT-2023, and the accuracy of the results reached 98.75%. Jaradat et al. [11] used four
different machine learning methods to classify network attacks in CIC-IoT-2023, but they
did not mention the classification tasks they used. Among them, Gradient Boost achieved
the highest accuracy of 95%. Among the above-mentioned papers, only Abbas et al. [1]
dealt with the problem of data imbalance in the dataset. Table 1 summarizes the key
points of the above papers. The effectiveness of machine learning-based intrusion detection
systems (ML-IDSs) depends largely on the quality of the dataset [12]. In this paper, we use
the CIC-IoT-2023 dataset [13] released in 2023 to conduct IDS experiments. CIC-IoT-2023
is a unique and comprehensive collection of information designed specifically for IoT
attacks. And we use multiple models, such as DNN, CNN, RNN, LSTM, CNN + LSTM,
CNN + RNN, and Transformer, to identify whether the traffic is malicious. Classification
tasks cover binary classification and multi-class classification. The main contributions of
this study are detailed below.
(1) We use the CIC-IoT-2023 dataset [1,13] used by Abbas et al. This is currently the
largest collection of IoT data recorded by real IoT devices. The number of data entries
in this dataset reaches 46,686,579 and there are as many as 33 attack types. Among
them, most of the examples in this dataset are related to common malicious attacks:
DDoS and DoS attacks [14];
(2) We not only use the six DL models used in [6], but also use a Transformer model [15]
to handle binary and multi-class classification tasks. Compared with [1,7], we further
implement the multi-class classification on our model;
(3) On the ToN_IoT dataset, compared with [7], our Transformer model achieved an
accuracy of 88.25%, which is 0.46% higher than the 87.79% of [7];
(4) Compared with [10,11,13], which also use the CIC-IoT-2023 dataset [16,17], the ac-
curacy of our Transformer model in the multi-class classification reaches 99.40%
accuracy; when compared with 95.60% [10], 98.75% [11], and 99.11% [13], our results
are 3.8%, 0.65%, and 0.29% higher, respectively.
Figure
Figure 1.
1. Architecture
Architecture diagram
diagram of
of this
this paper.
paper.
2.1. CIC-IoT-2023
CIC-IoT-2023
As of
As of 2023,
2023, CIC-IoT-2023
CIC-IoT-2023 stands out as the largest IoT dataset [16], [16], derived
derived from
from real
real
IoT devices. The dataset contains data from 105 IoT devices, documenting
IoT devices. The dataset contains data from 105 IoT devices, documenting 33 33 recorded
attacks. Notably,
Notably, these attacks were
were launched
launched byby malicious
malicious IoT
IoT devices
devices targeting
targeting other
other IoT
IoT
devices. In
devices. In addition,
addition, CIC-IoT-2023
CIC-IoT-2023 also
also contains
contains multiple
multiple attack
attack types
types that
that do
do not
not exist
exist in
in
other IoT
other IoT datasets
datasets
Table 22provides
Table providesthe thenumber
numberofof each
each label
label containing
containing benign
benign traffic.
traffic. This This dataset
dataset con-
contains a total of 46 features and 1 label. Different from the 84 features of CSE-CIC-IDS2018,
tains a total of 46 features and 1 label. Different from the 84 features of CSE-CIC-IDS2018,
CIC-IoT-2023 has
CIC-IoT-2023 has 37
37 fewer
fewer features.
features. In
In this
this experiment,
experiment, no
no specific
specific feature
feature screening
screening was
was
performed, and all features were used directly to conduct the experiment.
performed, and all features were used directly to conduct the experiment.
Future Internet 2024, 16, 284 4 of 25
CIC-IoT-2023 Features
CIC-IoT-2023 has 46 features and those features are shown in Table 3.
Feature Name
1 Flow duration
2 Header Length
3 Protocol
4 Type
5 Duration
6 Rate Mrate Drate
7 fin flag number
8 syn flag number
9 rst flag number
10 psh flag number
11 ack flag number
12 ece flag number
13 cwr flag number
14 ack count
15 syn count
16 fin count
17 urg count
18 rst count
19 HTTP
20 HTTPS
21 DNS
22 Telnet
23 SMTP
24 SSH
25 IRC
26 TCP
27 UDP
28 DHCP
29 ARP
30 ICMP
31 IPv
32 LLC
33 Tot sum
34 Min
35 Max
Future Internet 2024, 16, 284 5 of 25
Table 3. Cont.
Feature Name
36 AVG
37 Std
38 Tot size
39 IAT
40 Number
41 Magnitude
42 Radius
43 Covariance
44 Variance
45 Weight
46 Flow duration
We chose all the above features because all of these features lack redundancy. This
method ensures better accuracy.
40,000,000
30,000,000
QUANTITY
20,000,000
10,000,000
0
DDoS Dos Mirai Benign Spoofing Recon Web-Based BruteForce
LABEL
Figure 2. 2.
Figure Distribution
Distributionofofconverted
converted labels containingbenign
labels containing benigntraffic.
traffic.
Future Internet 2024, 16, 284 6 of 25
Table 4. The number of neurons and units of each of the neural networks.
The various architectures of the neural network are shown in Figure 3. Part of the
figure only shows one layer of the architecture of each deep learning network. But, we
actually conducted experiments using one- and three-layer stacking architectures. At the
output layer, it is worth noting that we will use excitation functions for the classification
tasks, binary classification will use Sigmoid, and multivariate classification will use Soft-
max. We will describe the detailed parameter quantities of each neural network in the
following sections.
figure only shows one layer of the architecture of each deep learning network. But, we
actually conducted experiments using one- and three-layer stacking architectures. At the
output layer, it is worth noting that we will use excitation functions for the classification
tasks, binary classification will use Sigmoid, and multivariate classification will use Soft-
Future Internet 2024, 16, 284 max. We will describe the detailed parameter quantities of each neural network in the
7 of 25
following sections.
Cov1D
Dropout Dropout
Dropout
Flatten
Flatten
Dense Dense
Future Internet 2024, 16, x FOR PEER REVIEW Dense 8 of 26
Cov1D Cov1D
MaxPooling MaxPooling
LSTM
Dropout Dropout
Figure 3.
Figure (a)Architecture
3. (a) Architecture diagram
diagram of of DNN,
DNN, (b) (b) architecture
architecture diagram
diagram of RNN,
of RNN, (c) architecture
(c) architecture dia-
diagram
gram of CNN,
of CNN, (d) (d) architecture
architecture diagram
diagram of LSTM,
of LSTM, (e) (e) architecture
architecture diagram
diagram of CNN
of CNN + RNN,
+ RNN, andand
(f)
(f) architecture
architecture diagram
diagram of CNN
of CNN + LSTM.
+ LSTM.
3.1.1. DNN
3.1.1. DNN
The architecture of DNN is shown in Figure 3a, which mainly consists of the input
The architecture of DNN is shown in Figure 3a, which mainly consists of the input
Dense layer, Batch Normalization (BN) layer, Dropout layer, Flatten layer, and output
Dense layer, Batch Normalization (BN) layer, Dropout layer, Flatten layer, and output
Dense layer. The number of parameters in each layer and the corresponding number of
Dense layer. The number of parameters in each layer and the corresponding number of
nodes are shown in Table 5. In order to reduce the occurrence of overfitting, we add a BN
nodes are shown in Table 5. In order to reduce the occurrence of overfitting, we add a BN
layer and a Dropout layer to each layer, normalize each batch during the training process,
layer and a Dropout layer to each layer, normalize each batch during the training process,
and the Dropout layer randomly discards neurons at a certain proportion in each layer.
and the Dropout layer randomly discards neurons at a certain proportion in each layer.
Both effectively prevent neurons from becoming overly dependent on certain features.
Both effectively prevent neurons from becoming overly dependent on certain features.
Parameters
Layers Neurons
Binary Multi-Class
256 13,313 15,112
Future Internet 2024, 16, 284 8 of 25
Parameters
Layers Neurons
Binary Multi-Class
256 13,313 15,112
1 512 26,625 30,216
768 39,937 45,320
256 19,521 19,976
3 512 63,617 64,520
768 146,945 148,744
3.1.2. RNN
The architecture of RNN is shown in Figure 3b. Similar to DNN, it also consists of a
Simple RNN, BN layer, and Dropout layer. But, there is no Flatten layer in RNN. This is
because, in RNN, the input can be a sequence, such as a text sentence or a time series, and
the RNN layer is designed to be able to process sequence data. Therefore, there is no need
to add a Flatten layer to convert the dimensions of the data. The number of parameters in
each layer and the corresponding number of nodes are shown in Table 6.
Parameters
Layers Neurons
Binary Multi-Class
256 78,849 80,648
1 512 288,769 292,360
768 629,761 635,144
256 44,097 44,552
3 512 161,921 162,824
768 343,553 345,352
3.1.3. CNN
The architecture of CNN is shown in Figure 3c, which mainly consists of Conv1D
and MaxPooling layers. Unlike DNN and RNN where each hidden layer contains a BN
layer and Dropout layer, CNN only introduces a BN layer and Dropout layer before the
output layer. This design choice is attributed to the effectiveness of MaxPooling layers 1
and 2 in preventing overfitting. These layers facilitate feature extraction after convolution,
emphasizing key data and minimizing irrelevant noise. Table 7 outlines the details of the
number of parameters per layer and the corresponding number of nodes of CNN.
Parameters
Layers Neurons
Binary Multi-Class
256 13,313 15,112
1 512 26,625 30,216
768 39,937 45,320
256 19,521 19,976
3 512 63,617 64,520
768 146,945 148,744
3.1.4. LSTM
The architecture of LSTM is shown in Figure 3d. LSTM is a variant of RNN designed
to better handle long sequence dependencies and overcome the vanishing gradient problem
of traditional RNN. The number of parameters in each layer and the corresponding number
Future Internet 2024, 16, 284 9 of 25
of nodes are shown in Table 7. The architecture of CNN + RNN is shown in Figure 3e. In
this architecture, there are two architectures: one with one convolutional layer and one
recurrent layer, and one with three convolutional layers and three recurrent layers. The
number of parameters in each layer and the corresponding number of nodes are shown in
Table 8.
Parameters
Layers Neurons
Binary Multi-Class
256 311,553 313,352
1 512 1,147,393 1,150,984
768 2,507,521 2,512,904
256 173,121 173,576
3 512 354,433 619,528
768 1,364,225 1,366,024
Parameters
Layers Neurons
Binary Multi-Class
256 78,849 133,160
1 512 288,769 365,864
768 629,761 729,640
256 44,097 86,568
3 512 161,921 215,336
768 343,553 397,864
Parameters
Layers Neurons
Binary Multi-Class
256 420,041 428,840
1 512 1,346,849 1,350,440
768 2,790,945 2,796,328
256 246,625 247,080
3 512 756,641 757,544
768 1,479,713 1,481,512
3.2. Transformer
The architecture of the Transformer used in this paper is shown in Figure 4, and the
detailed parameters are shown in Table 11. The main architecture of Transformer includes
Future Internet 2024, 16, 284 10 of 25
an encoder
Future Internet 2024, 16, x FOR PEER REVIEW and a decoder, but for binary and multivariate classification tasks involving
11 of 26 a
single output sequence, the decoder is unnecessary. Therefore, only encoders [7] are used
in our architecture.
Residual Connection
Figure
Figure 4. 4. Transformer
Transformer encoder
encoder architecture
architecture diagram.
diagram.
Table 11. Number of parameters of Transformer.
Table 11. Number of parameters of Transformer.
Dense Dimension
Dimension Number of Parameters
Dense Number
Numberofof Number ofLayers
Layers Parameters
(FFN) Heads (Encoder) Binary Multi-Class
(FFN) Heads (Encoder) Binary Multi-Class
256
256 11 11 32,733
32,733 33,062
33,062
128
128 20,829
20,829 21,158
21,158
512
512 56,541
56,541 56,870
56,870
1024
1024 104,157
104,157 104,486
104,486
2048 199,389 199,718
2048 199,389 199,718
2 41,335 41,664
2 41,335 41,664
4 58,539 58,868
4 58,539 58,868
8 94,947 93,276
8 2 94,947
41,381 93,276
41,710
24 41,381
58,677 41,710
59,006
48 94,269
58,677 93,598
59,006
8 94,269 93,598
Additionally, two structures can be omitted for classification purposes. First, word
embedding, which converts language vocabulary into a vector space for deep learning
Additionally, two structures can be omitted for classification purposes. First, word
analysis, is unnecessary for our model. The material we are classifying is already in nu-
embedding, which converts language vocabulary into a vector space for deep learning
meric form and converted to integers, thus eliminating the need for word embeddings.
analysis, is unnecessary for our model. The material we are classifying is already in numeric
Secondly, positional encoding (Positional Encoding) used to determine the relative and
form and converted to integers, thus eliminating the need for word embeddings. Secondly,
absolute positions of tokens in sentences is not needed for our dataset. The length and
positional encoding (Positional Encoding) used to determine the relative and absolute
composition of similar “sentences” in our data are fixed, making this structure not neces-
positions of tokens in sentences is not needed for our dataset. The length and composition
sary [5].
of similar “sentences” in our data are fixed, making this structure not necessary [5].
Future Internet 2024, 16, x FOR PEER REVIEW 12 of 26
Future Internet 2024, 16, 284 11 of 25
2 Heads as example
𝒃𝟏,𝟏
𝒒𝟏,𝟏 ⋅ 𝒌𝟏,𝟏 ′ ⊗ 𝒒𝟏,𝟏 ⋅ 𝒌𝟐,𝟏 ′ ⊗
Softmax Softmax
𝒒𝟏,𝟏 𝒒𝟏,𝟐 𝒌𝟏,𝟏 𝒌𝟏,𝟐 𝒗𝟏,𝟏 𝒗𝟏,𝟐 𝒒𝟐,𝟏 𝒒𝟐,𝟐 𝒌𝟐,𝟏 𝒌𝟐,𝟐 𝒗𝟐,𝟏 𝒗𝟐,𝟐
𝟏
Future Internet 2024, 16, x𝒒
𝟏
𝒌
FOR PEER REVIEW 𝒗𝟏 𝒒𝟐 𝒌𝟐 𝒗𝟐 3 of 3
𝒂𝟏 𝒂𝟐
(a)
𝒃𝟏,𝟐
𝒒𝟏,𝟐 ⋅ 𝒌𝟏,𝟐 ′ ⊗ 𝒒𝟏,𝟐 ⋅ 𝒌𝟐,𝟐 ′ ⊗
Softmax Softmax
𝒒𝟏,𝟏 𝒒𝟏,𝟐 𝒌𝟏,𝟏 𝒌𝟏,𝟐 𝒗𝟏,𝟏 𝒗𝟏,𝟐 𝒒𝟐,𝟏 𝒒𝟐,𝟐 𝒌𝟐,𝟏 𝒌𝟐,𝟐 𝒗𝟐,𝟏 𝒗𝟐,𝟐
𝒒𝟏 𝒌𝟏 𝒗𝟏 𝒒𝟐 𝒌𝟐 𝒗𝟐
𝒂𝟏 𝒂𝟐
(b)
Figure 6. Cont.
𝒃𝒊,𝟏
𝒃𝒊 = 𝑾𝑸
𝒃𝒊,𝟐
𝒂𝟏 𝒂𝟐
(b)
𝒃𝒊,𝟏
𝒃𝒊 = 𝑾𝑸
𝒃𝒊,𝟐
(c)
Figure 6. (a) The schematic diagram of finding one of the output 𝒃𝟏,𝟏 ; (b) the schematic diagram of
Figure 6. (a) The schematic 𝟏,𝟐diagram of finding one of the output b1,1 ; (b) the schematic diagram of
finding one of the output 𝒃1,2 ; and (c) the schematic diagram of adding two results.
finding one of the output b ; and (c) the schematic diagram of adding two results.
Then, as shown in Figure 6b, the second attention head q1,2 will perform an attention
calculation with k1,2 , then it will perform Softmax, and finally it will multiply by v1,2 . Then,
q1,2 performs an attention calculation with k2,2 , then it performs Softmax, and finally it
multiplies v2,2 . Finally, adding the previous two results gives b1,2 , that is:
n
b1,2 = ∑i=1 Softmax q1,2 · kn,1 vn,1 (13)
Finally, these two outputs are concatenated and multiplied by an output transforma-
tion matrix WO to obtain the final output b1 , as shown in Figure 6c.
4. Experimental Results
4.1. Experimental Environment
The equipment specifications and environment settings used in this article are shown
in Table 12. Since simply using tensorflow will cause the training speed to be too slow;
this article chooses to use tensorflow-gpu to run our model to speed up the training. The
Future Internet 2024, 16, 284 14 of 25
hyperparameters of the six neural network models are shown in Table 13. Due to the large
size of the dataset, we increased the batch size to 1024.
Project Properties
OS Windows 11
CPU Intel® Core™ i7-13700 Processor
GPU NVIDA Geforce RTX 4080
Memory 128 GB
Disk 1TB SSD
Python 3.7.16
NVIDIA CUDA 11.3.1
Framework Tensorflow-gpu 2.5 & 2.6
Hyperparameter Value
Batch Size 1024
Epochs 10
Learning Rate 0.001
Dropout 0.1
Accuracy (%)
Layers Neurons
Binary Multi-Class
256 99.48 97.35
1 512 99.47 97.73
768 99.53 99.13
Accuracy (%)
Layers Neurons
Binary Multi-Class
256 99.56 99.16
3 512 99.56 99.23
768 99.56 99.36
The accuracy results of RNN are shown in Table 16, and the evaluation results of RNN
are shown in Table 17.
Accuracy (%)
Layers Neurons
Binary Multi-Class
256 99.49 99.21
1 512 99.49 99.22
768 99.48 99.24
256 99.53 99.26
3 512 99.50 99.27
768 99.50 99.28
The accuracy results of CNN are shown in Table 18, and the evaluation results of CNN
are shown in Table 19.
The accuracy results of LSTM are shown in Table 20, and the evaluation results of
LSTMare shown in Table 21.
Accuracy (%)
Layers Neurons
Binary Multi-Class
256 99.51 99.28
1 512 99.51 99.28
768 99.50 99.28
256 99.54 99.32
3 512 99.54 99.21
768 99.52 99.34
The accuracy results of CNN + RNN are shown in Table 22, and its evaluation results
are shown in Table 23.
Future Internet 2024, 16, 284 17 of 25
Accuracy (%)
Layers Neurons
Binary Multi-Class
256 99.37 99.15
1 512 99.29 99.19
768 99.45 99.11
256 99.46 99.16
3 512 99.42 99.07
768 99.15 99.03
The accuracy results of CNN + LSTM are shown in Table 24, and its evaluation results
are shown in Table 25.
Accuracy (%)
Layers Neurons
Binary Multi-Class
256 99.56 99.33
1 512 99.46 98.70
768 99.55 99.34
256 99.53 99.31
3 512 99.49 99.26
768 99.48 99.26
The accuracy results of Transformer are shown in Table 26 and its evaluation results
are shown in Tables 27 and 28.
Future Internet 2024, 16, 284 18 of 25
Dense Number of
Number of
Dimension Layers Binary Multi-Class
Heads
(FFN) (Encoder)
256 1 1 99.52 94.03
128 99.53 98.72
512 99.52 99.27
1024 99.54 99.31
2048 99.54 99.33
2 99.53 98.88
4 99.52 99.23
8 99.53 95.03
2 99.53 99.25
4 99.52 99.32
8 99.49 99.11
Dense Number of
Number of
Dimension Layers Binary Multi-Class
Heads
(FFN) (Encoder)
256 1 1 99.50 93.68
128 99.51 98.72
512 99.51 99.27
1024 99.52 99.43
2048 99.52 99.33
2 99.50 98.88
4 99.50 94.94
8 99.51 98.88
2 99.50 99.24
4 99.49 99.30
8 99.48 99.11
Figure 7.
Figure Accuracy figure
7. Accuracy figure of
of DNN
DNN with
with (layer
(layer == 3,
3, Node
Node ==768,
768,multi-class).
multi-class).
Figure 8. Accuracy figure of RNN (with layer = 3, node = 768, multi-class classification).
Figure 8.
Figure Accuracy figure
8. Accuracy figure of
of RNN
RNN (with
(with layer
layer == 3,
3, node
node ==768,
768,multi-class
multi-classclassification).
classification).
Figure 9.
Figure Accuracy figure
9. Accuracy figure of
of CNN
CNN (with
(with layer
layer == 3,
3, node
node ==768,
768,multi-class
multi-class classification).
classification).
Figure 9. Accuracy figure of CNN (with layer = 3, node = 768, multi-class classification).
Future Internet 2024, 16, 284 20 of 25
Figure 9. Accuracy figure of CNN (with layer = 3, node = 768, multi-class classification).
Figure 10.
Figure Accuracy figure
10. Accuracy figure of
of LSTM
LSTM (with
(with layer
layer ==3,3,node
node==768,
768,multi-class
multi-classclassification).
classification).
Figure 11. Accuracy figure of CNN + RNN (with layer = 3, node = 768, multi-class classification).
Figure11.
Figure Accuracyfigure
11.Accuracy figureof
ofCNN
CNN++RNN
RNN(with
(withlayer
layer==3,3,node
node==768,
768,multi-class
multi-classclassification).
classification).
Figure 12.
Figure Accuracy figure
12. Accuracy figure of
of CNN
CNN ++LSTM
LSTM(with
(withlayer
layer==3,3,node
node==768,
768,multi-class
multi-classclassification).
classification).
Figure 12. Accuracy figure of CNN + LSTM (with layer = 3, node = 768, multi-class classification).
Future Internet 2024, 16, 284 21 of 25
Figure 12. Accuracy figure of CNN + LSTM (with layer = 3, node = 768, multi-class classification).
Figure 13.
Figure Accuracy figure
13. Accuracy figure of
of Transformer
Transformer (with
(with Dense
Dense Dimension
Dimension == 2048,
2048, Number
Number of
of Heads
Heads ==1,
1,
Number of
Number of Layers
Layers == 1,
1, multi-class
multi-class classification).
classification).
Table 30. Confusion matrix figure of DNN (with layer = 3, node = 768, multi-class classification).
Benign
1,073,132 87 287 8001 30 3 16,647 8
Traffic
DDos 47 83,980,302 2712 1338 0 0 12 149
Dos 22 18,808 8,071,716 79 0 0 34 7915
Recon 82,758 5445 105 220,880 1550 138 43,664 15
Actual
Web-
5367 0 7 3462 3193 12 12,787 1
Based
Brute
2508 0 2 1938 15 3749 4852 0
Force
Spoofing 56,557 132 141 13,208 91 945 415,405 25
Mirai 9 13,504 289 1175 0 0 18 2,619,129
Brute
Benign Traffic DDos Dos Recon Web-Based Spoofing Mirai
Force
Future Internet 2024, 16, 284 22 of 25
Table 31. Confusion matrix figure of RNN (with layer = 3, node = 768, multi-class classification).
Benign
1,057,073 7 4 17,204 40 1 23,866 0
Traffic
DDos 51 83,980,261 2463 1198 0 0 96 491
Dos 26 7272 8,083,199 32 0 0 46 163
Recon 83,296 1312 37 236,622 196 9 33,083 10
Actual
Table 32. Confusion matrix figure of CNN (with layer = 3, node = 768, multi-class classification).
Benign
1,034,444 14 7 22,362 127 47 41,192 2
Traffic
DDos 83 83,979,984 3238 764 0 0 63 428
Dos 36 6228 8,084,368 20 0 0 37 49
Recon 78,798 2093 40 236,729 790 161 35,930 24
Actual
Web-
6077 1 2 5485 2960 7 10,297 0
Based
Brute
3564 0 0 3584 78 2401 3437 0
Force
Spoofing 101,541 23 4 24,349 880 98 359,605 4
Mirai 5 380 63 6 0 0 16 2,633,654
Benign Web- Brute
DDos Dos Recon Spoofing Mirai
Traffic Based Force
Predicted
Table 33. Confusion matrix figure of LSTM (with layer = 3, node = 768, multi-class classification).
Benign
1,049,179 16 3 17,245 244 34 31,472 2
Traffic
DDos 46 83,980,598 2405 1335 2 0 47 136
Dos 24 6531 8,084,054 28 1 0 37 63
Recon 68,011 723 29 247,281 1212 179 37,128 2
Actual
Web-
5230 1 0 4826 5520 16 9235 1
Based
Brute
3258 1 0 3384 142 2864 3415 0
Force
Spoofing 88,611 29 30 21,880 1797 170 373,965 22
Mirai 11 865 38 19 0 0 25 2,633,166
Benign Web- Brute
DDos Dos Recon Spoofing Mirai
Traffic Based Force
Predicted
Future Internet 2024, 16, 284 23 of 25
Table 34. Confusion matrix figure of CNN + RNN (with layer = 3, node = 768, multi-class classification).
Benign
1,043,235 67 3 20,089 81 2 34,715 3
Traffic
DDos 108 83,962,688 160,078 3626 0 2 290 1768
Dos 42 29,673 8,058,272 1521 3 0 47 1180
Recon 95,693 4048 638 217,211 55 14 36,490 416
Actual
Table 35. Confusion matrix figure of CNN + LSTM (with layer = 3, node = 768, multi-class classification).
Benign
1,042,720 31 6 25,929 367 26 29,116 0
Traffic
DDos 33 83,980,794 2611 778 0 3 83 258
Dos 15 6435 8,084,207 10 1 0 30 40
Recon 66,965 1731 27 251,565 1386 155 32,689 47
Actual
Web-
4273 6 1 6214 5465 10 8410 0
Based
Brute
3036 1 0 3710 177 2740 3400 0
Force
Spoofing 93,724 109 28 26,392 2532 77 363,638 4
Mirai 7 368 31 70 0 0 103 2,633,545
Benign Web- Brute
DDos Dos Recon Spoofing Mirai
Traffic Based Force
Predicted
Table 36. Confusion matrix figure of Transformer (with layer = 3, node = 768, multi-class classification).
Benign
1,050,021 1264 1 23,943 61 11 22,828 66
Traffic
DDos 13 83,975,357 2031 3208 1 0 688 3262
Dos 46 25,250 8,064,500 498 0 0 60 384
Recon 59,531 2309 2 257,601 28 7 35,007 80
Actual
Web-
5513 23 0 4960 7361 0 6971 1
Based
Brute
3300 6 0 2589 2 2318 4848 1
Force
Spoofing 68,286 613 0 23,988 379 333 392,815 90
Mirai 3 7796 79 212 0 0 262 2,625,772
Benign Web- Brute
DDos Dos Recon Spoofing Mirai
Traffic Based Force
Predicted
Future Internet 2024, 16, 284 24 of 25
5. Conclusions
This research is based on the CIC-IoT-2023 dataset and conducts an in-depth discus-
sion and analysis of IoT network intrusion detection. We apply deep learning methods to
improve the detection performance of abnormal behaviors and intrusions. Compared with
other papers, we further use the Transformer model and further use multi-class classifica-
tion. The experimental results show that in binary classification, DNN and CNN + LSTM
have the highest accuracy, while in multi-class classification, the Transformer model has
the highest accuracy. This proves the potential application value of deep learning methods
in IoT network intrusion detection. In the future, the dataset can be reconstructed and
balanced to avoid the unpredictable situation of minority category attacks, so that these 34
categories can be directly used for classification to improve the generalization ability of the
model and remove some features that have no impact on model classification to improve
classification efficiency.
The method used in this study brings new possibilities to the field of IoT network
intrusion detection. It is hoped that the results of this study can provide a valuable reference
for the development of the field of IoT security.
Author Contributions: Conceptualization, S.-M.T. and Y.-C.W.; methodology, S.-M.T. and Y.-C.W.;
software and data curation, Y.-Q.W.; funding acquisition, S.-M.T. All authors have read and agreed to
the published version of the manuscript.
Funding: This research was funded by National Science and Technology Council, Taiwan grant
number NSTC 112-2221-E-027-079-MY2.
Data Availability Statement: The data can be shared up on request. The data are not publicly
available due to privacy restrictions.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Abbas, S.; Al Hejaili, A.; Sampedro, G.A.; Abisado, M.A.; Almadhor, A.M.; Shahzad, T.; Ouahada, K. A novel federated edge
learning approach for detecting cyberattacks in IoT infrastructures. IEEE Access 2023, 11, 112189–112198. [CrossRef]
2. Asharf, J.; Moustafa, N.; Khurshid, H.; Debie, E.; Haider, W.; Wahab, A. A review of intrusion detection systems using machine
and deep learning in Internet of Things: Challenges solutions and future directions. Electronics 2020, 9, 1177. [CrossRef]
3. Dadkhah, S.; Mahdikhani, H.; Danso, P.K.; Zohourian, A.; Truong, K.A.; Ghorbani, A.A. Towards the development of a realistic
multidimensional IoT pofiling dataset. In Proceedings of the 2022 19th Annual International Conference on Privacy, Security &
Trust (PST), Fredericton, NB, Canada, 22–24 August 2022; pp. 1–11.
4. Talpur, A.; Gurusamy, M. Machine learning for security in vehicular networks: A comprehensive survey. IEEE Commun. Surv.
Tutor. 2022, 24, 346–379. [CrossRef]
5. Li, Q.F.; Liu, Y.Q.; Niu, T.; Wang, X.M. Improved Resnet Model Based on Positive Traffic Flow for IoT Anomalous Traffic Detection.
Electronics 2023, 12, 3830. [CrossRef]
6. Wang, Y.C.; Yng, Y.C.; Chen, H.X.; Tseng, S.M. Network anomaly intrusion detection based on deep learning approach. Sensors
2023, 23, 2171. [CrossRef] [PubMed]
7. Ahmed, S.W.; Kientz, F.; Kashef, R. A modified transformer neural network (MTNN) for robust intrusion detection in IoT
networks. In Proceedings of the 2023 International Telecommunications Conference (ITC-Egypt), Alexandria, Egypt, 18–20 July
2023; pp. 663–668.
8. Mezina, A.; Burget, R.; Travieso-González, C.M. Network Anomaly Detection with Temporal Convolutional Network and U-Net
model. IEEE Access 2021, 9, 143608–143622. [CrossRef]
9. He, M.S.; Wang, X.J.; Wei, P.; Yang, L.; Teng, Y.L.; Lyu, R.J. Reinforcement learning meets network intrusion detection: A
transferable and adaptable framework for anomaly behavior identification. IEEE Trans. Netw. Serv. Manag. 2024, 21, 2477–2492.
[CrossRef]
10. Jony, A.I.; Arnob, A.K.B. A long short-term memory based approach for detecting cyber attacks in IoT using CIC-IoT2023 dataset.
J. Edge Comput. 2024, 3, 28–42. [CrossRef]
11. Jaradat, A.S.; Nasayreh, A.; Al-Na’amneh, Q.; Gharaibeh, H.; Al Mamlook, R.E. Genetic optimization techniques for enhancing
web attacks classification in machine learning. In Proceedings of the IEEE International Conference on 11 Dependable 2023,
Autonomic & Secure Computing, Abu Dhabi, United Arab Emirates, 14–17 November 2023; pp. 0130–0136.
12. Guo, G.; Pan, X.; Liu, H.; Li, F.; Pei, L.; Hu, K. An IoT intrusion detection system based on TON IoT network dataset. In
Proceedings of the 2023 IEEE 13th Annual Computing and Communication Workshop and Conference (CCWC), Las Vegas, NV,
USA, 8–11 March 2023; pp. 0333–0338.
Future Internet 2024, 16, 284 25 of 25
13. Neto, E.C.P.; Dadkhah, S.; Ferreira, R.; Zohourian, A.; Lu, R.; Ghorbani, A.A. CICIoT2023: A real-time dataset and benchmark for
large-scale attacks in IoT environment. Sensors 2023, 23, 5941. [CrossRef]
14. Shtayat, M.M.; Hasan, M.K.; Sulaiman, R.; Islam, S.; Khan, A.U.R. An explainable ensemble deep learning approach for intrusion
detection in industrial Internet of Things. IEEE Access 2023, 11, 115047–115061. [CrossRef]
15. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In
Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2023; Volume 30.
16. Haque, S.; EI-Moussa, F.; Komninos, N.; Muttukrishnan, R. A systematic review of data-driven attack detection trends in IoT.
Sensors 2023, 23, 7191. [CrossRef] [PubMed]
17. Le, T.T.H.; Wardhani, R.W.; Putranto, D.S.C.; Jo, U.; Kim, H. Toward enhanced attack detection and explanation in intrusion
detection system-based IoT environment data. IEEE Access 2023, 11, 131661–131676. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.