0% found this document useful (0 votes)
44 views18 pages

A Deep Learning Approach For IoT Traffic Multi-Classification in A Smart-City Scenario

Uploaded by

balajimt922
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views18 pages

A Deep Learning Approach For IoT Traffic Multi-Classification in A Smart-City Scenario

Uploaded by

balajimt922
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Received January 25, 2022, accepted February 17, 2022, date of publication February 22, 2022, date of current

version March 2, 2022.


Digital Object Identifier 10.1109/ACCESS.2022.3153331

A Deep Learning Approach for IoT Traffic


Multi-Classification in a Smart-City Scenario
AROOSA HAMEED, JOHN VIOLOS, AND ARIS LEIVADEAS , (Senior Member, IEEE)
Department of Software and Information Technology Engineering, École de Technologie Supérieure, Montreal, QC H3C 1K3, Canada
Corresponding author: Aris Leivadeas ([email protected])
This work was supported in part by the CHIST-ERA-18-SDCDN-003-DRUID-NET Project ‘‘eDge computing ResoUrse
allocatIon for Dynamic NETworks (DRUID-NET).’’

ABSTRACT As the number of Internet of Things (IoT) devices and applications increases, the capacity
of the IoT access networks is considerably stressed. This can create significant performance bottlenecks in
various layers of an end-to-end communication path, including the scheduling of the spectrum, the resource
requirements for processing the IoT data at the Edge and/or Cloud, and the attainable delay for critical
emergency scenarios. Thus, a proper classification or prediction of the time varying traffic characteristics
of the IoT devices is required. However, this classification remains at large an open challenge. Most of the
existing solutions are based on machine learning techniques, which nonetheless present high computational
cost, whereas they are not considering the fine-grained flow characteristics of the traffic. To this end, this
paper introduces the following four contributions. Firstly, we provide an extended feature set including,
flow, packet and device level features to characterize the IoT devices in the context of a smart environment.
Secondly, we propose a custom weighting based preprocessing algorithm to determine the importance of the
data values. Thirdly, we present insights into traffic characteristics using feature selection and correlation
mechanisms. Finally, we develop a two-stage learning algorithm and we demonstrate its ability to accurately
categorize the IoT devices in two different datasets. The evaluation results show that the proposed learning
framework achieves 99.9% accuracy for the first dataset and 99.8% accuracy for the second. Additionally,
for the first dataset we achieve a precision and recall performance of 99.6% and 99.5%, while for the second
dataset the precission and recall attained is of 99.6% and 99.7% respectively. These results show that our
approach clearly outperforms other well-known machine learning methods. Hence, this work provides a
useful model deployed in a realistic IoT scenario, where IoT traffic and devices’ profiles are predicted and
classified, while facilitating the data processing in the upper layers of an end-to-end communication model.

INDEX TERMS Deep learning, edge computing, Internet of Things, machine learning, neural networks,
traffic classification.

I. INTRODUCTION generating an unprecedented volume of data for a variety


Internet of Things (IoT) allows tens of billion devices to be of smart applications such as healthcare, industrial control,
connected over the Internet. Nonetheless, the rapid increase transportation and so on. However, these IoT devices are
of IoT devices has also resulted in a colossal increase of the usually of limited computational abilities [2] and cannot
data generated by IoT devices. Specifically, the total data manipulate locally the data generated.
has quadrupled in just five years from 145 ZB in 2015 to This often urges the offloading of computational hefty
600 ZB in 2020 [1]. Furthermore, IoT not only enables new IoT tasks to a remote infrastructure, a process called task
applications, but introduces new types of devices as well. For offloading [3]. Edge Computing [4] is a viable solution for
example, in the context of a smart environment, thousands the task offloading as it allows to offer the necessary network-
of non-traditional Internet devices are used including smart ing and computational resources at the edge of the network
sensors, alarms, traffic lights, cameras, weather stations, etc. enabling at the same time the real time processing of the IoT
data. However, as explained in [5], it is extremely difficult to
The associate editor coordinating the review of this manuscript and estimate the edge resources needed due to the fact that (i) the
approving it for publication was Chunsheng Zhu . IoT data are randomly generated, as a consequence of the

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 10, 2022 21193
A. Hameed et al.: Deep Learning Approach for IoT Traffic Multi-Classification in Smart-City Scenario

different types of devices and their dynamic cycle activity; 3) A statistical feature selection technique is employed to
and (ii) when there is a large number of IoT devices, the select the features with regard to their contribution to
total communication delay may be affected on account of the the classification of IoT devices. Furthermore, an inves-
constrained nature of the IoT access networks. tigation of correlated features at each level is provided
Hence, the importance to predict the time varying charac- using the Pearson correlation coefficient.
teristics of the IoT devices (such as activity patterns, signaling 4) A two stage learning framework is presented with
patterns etc.) becomes evident. Furthermore, the classifica- 99.9% accuracy for the first dataset under consideration
tion of similar devices facilitates the estimation of the gen- and 99.8% for the second one, which proves the gener-
erated workload and can better guarantee a specific level alization of our approach. To determine the IoT device
of Quality of Service (QoS). Therefore, by classifying the classification, we compute the classes for certain nom-
IoT devices into different categories, the prediction of traffic inal and multivalued attributes at learning stage 0 using
characteristics can be more efficiently done. Additionally, logistic regression. Following, we perform the final
a more accurate prediction of the resource requirements at the classification for numerical and single-valued features
IoT access network (i.e. spectrum) and Edge infrastructures at stage 1 using a multilayer perceptron (MLP) neural
(i.e. computational and communication resources), can be network. The MLP network takes as an input a feature
achieved. subset at each time and classifies IoT devices in a con-
However, such an IoT device classification, often called text of a smart environment. Furthermore, to achieve
device fingerprinting [6], presents several challenges. In par- the optimal or near optimal MLP architecture, a random
ticular, the existing IoT classification techniques do not search based keras tuner is employed.
consider the fine-grained characterization of IoT traffic,
while they suffer from high computational cost for the The rest of the paper is structured as follows: Section II
data extraction and processing, and are often affected by highlights the related work in traffic classification, covering
high dimensional data and complexity. Accordingly, in this the most important methods and technologies applied in the
paper, we propose a two-stage based deep learning archi- IoT traffic classification domain. Section III provides the sys-
tecture in order to classify the IoT devices by considering a tem model and necessary preliminaries for comprehending
fine-grained set of network characteristics (features). To do the classification problem in the context of the IoT domain.
so, firstly, we propose a two-step preprocessing algorithm Additionally, this Section covers the description of the feature
while employing a feature selection and prioritization tech- sets, their statistical characteristics and feature correlation,
nique for the feature set under consideration. Our approach, information that is necessary for the domain of data anal-
facilitates the distribution of the features in the two stages ysis that our paper touches upon. Section IV presents the
avoiding the high dimensionality and overfitting problems of two-stage proposed learning framework for the IoT device
the training data. classification problem. Section V explains the algorithmic
The novelty of this paper lies in proposing a very accurate form of proposed preprocessing and learning model along
but considerably more lightweighted approach than the exist- with their asymptotic analysis. Sections IV and V fall under
ing ones. Furthermore, the feature selection and prioritazion the domains of deep learning, machine learning and problem
along with the combination of a deep learning model creates complexity, presenting all the necessary technical details.
a unique and innovative approach for the problem of the Section VI provides the performance evaluation results for
IoT device classification. The novelty of our approach is both datasets under consideration. The conclusions and the
strengthened by the fact that it can be generalized and applied future directions of this work are presented in Section VII.
in different datasets without losing any accuracy. Thus, the Finally, Table 1 presents the set of abbreviations used in this
reproducability of the results and the stability of our approach paper.
in different IoT contexts fortify the originality introduced.
In particular, the major contributions and novelty of this
paper can be summarized as follows: II. RELATED WORK
For the IoT device classification, significant emphasis has
1) In order to perform a classification of the IoT devices, been given into aggregated traffic models, fingerprinting,
we have suggested an extended feature set compris- and machine learning based solutions. The aggregated traffic
ing of flow, device, and packet level features. This models resort to mathematical and statistical distribution-
approach provides a fine grained characterization of the based methods, which involve several probability distribu-
traffic flow with less computational complexity for the tions and mathematical techniques like stochastic processes
classification. to model the traffic. Following, the fingerprinting methods are
2) A two step preprocessing algorithm is proposed that used to identify the IoT devices leveraging information from
assigns relevance weights to the nominal (representing network traces in order to correlate datasets. In particular,
the qualitative data with numeric codes) features and this category of classification identifies a device using infor-
provides scaling of the dataset using a MinMaxScaler mation from the network packets during the communication
method. over the network.

21194 VOLUME 10, 2022


A. Hameed et al.: Deep Learning Approach for IoT Traffic Multi-Classification in Smart-City Scenario

FIGURE 1. Overview of our previous work vs. proposed work contributions (shown in the purple boxes).

TABLE 1. List of abbreviations. ML algorithms: supervised learning, unsupervised learning,


semi-supervised learning, and reinforcement learning. How-
ever, in the current literature, mostly supervised learning,
unsupervised learning or a combination of these two are
utilized in order to analyze, predict and model the IoT traffic
and device characteristics.
With respect to aggregated traffic models, Laner et al. [7]
proposed a Coupled Markov Modulated Poisson Pro-
cesses (CMMPP) framework to capture the traffic behav-
ior of a single machine-type communication along with
the collective behavior of tens of thousands of devices.
In [8] a classification strategy is designed for a fleet
management use case incorporating three classes of M2M
traffic states, namely periodic update, event-driven, and pay-
load exchange. The authors in [9] proposed a model that
estimates the M2M traffic volume generated in a wire-
less network-enabled connected home. However, the above
works do not consider the fine-grained characterization of
the IoT traffic, whereas the complexity of such methods
grows linearly with the number of the devices. Further-
more, common communication patterns identified can be
attributed to any sensing device under a specific use case
(limitation 1).
There is also a significant effort to identify the type of the
IoT devices using the fingerprinting method. For example,
‘‘IoT Sentinel’’ [10] is a classification system that can rec-
ognize and identify the IoT devices immediately after they
are connected to a network using a single attribute vector
with 276 network features. The ‘‘IoT Sentinel’’ framework
can be further improved by extracting additional network
features such as payload entropy, TCP payload length, and
TCP window size [11]. Similarly, in [12] almost 300 network
Regarding the Machine Learning (ML) based schemes, attributes are used from each TCP traffic session to classify
they utilize existing algorithms to automatically learn com- the devices, using a majority voting for every 20 consecutive
plex patterns from the IoT traffic data. The algorithms used sessions.
in these schemes are classified according to how the learning The work in [13] utilized a deep learning approach in
process is conducted. Four main classes are used to group order to perform the device fingerprinting using the packet

VOLUME 10, 2022 21195


A. Hameed et al.: Deep Learning Approach for IoT Traffic Multi-Classification in Smart-City Scenario

TABLE 2. Comparison of related works.

interarrival time. However, this approach is computationally dom forest, decision tree, SVM, k-nearest neighbors, simple
intensive as all packet level information is utilized with- neural network and naive bayes approaches.
out any selection strategy. In [14], the traffic patterns of Lopez-Martin et al. [21] classified the traffic applications
encrypted network flows are used to reveal the existence of using a multi-class neural network, which is proven to be
a specific device inside a home network. However, obtain- effective in complex data structures. The authors in [22]
ing such a great number of features require specialized proposed an individual binary classification model for each
hardware accelerators, thus resulting in high computational class in order to eliminate the complexity issue of multi-class
cost, longer classification duration and limited scalability classification. Sivanathan et al. [23] utilized the statistical
due to the need of a deep packet inspection functionality attributes, signaling patterns and cipher suites along with
(limitation 2). machine learning for IoT device classification.
Some related works also employed machine learn- Nonetheless, these ML approaches are affected by the high
ing in order to perform traffic and device classifica- data dimensionality, they are sensitive to the hyper-parameter
tion. Lippmann et al. [15] compared the K-nearest neighbor tuning and they require a large number of training data.
(KNN), Support Vector Machine (SVM), Decision Tree (DT) Moreover, the main constraint of the multi-class classifica-
and Multilayer Perceptron (MLP), using the packet header tion is scalability, as the high number of classes makes the
information and concluded that KNN and DT provide bet- classifier more complex and updating requires full retraining
ter results. Kotak and Elovici [16] classified nine different (limitation 3). A summary of the papers reviewed in this
device flows based on the device type using artificial neuron section is given in Table 2.
network. Regarding traffic classification, the authors in [17] In our preliminary work [24], we tried to address some
predicted the QoS behavior of five different IoT applications of these limitations by relying on typical machine learning
in a smart building context, using several regression based techniques, such as logistic regression and gradient boosting.
ML approaches. In this paper, we extend our preliminary framework to pro-
The work in [18] shows how to classify traffic and perform vide a more complete and detailed IoT multi-classification
device identification using random forest. The list of key approach based on a deep learning solution. As this research
features used in the classification included the packet size, is an extension of our previous study, we used the same IoT
volume of packets, inter-arrival time, duration, urgent and dataset [23]. However, in order to prove the generalization
push flags. Additionally, the authors in [19] performed a of our proposed methodology we also performed our exper-
prediction of the IoT network traffic using Long Term Short iments with a second IoT dataset [25]. Additionally, herein,
Memory (LSTM). The features of dataset consisted of the we include a more extended feature set at three different levels
timestamp, bytes count, and the packet count. A more com- such as: device, flow and packet.
parative approach, was introduced in [20], where the authors This work also introduces a feature correlation mechanism,
presented a method to recognize the IoT devices using ran- whereas specific features are selected for training models

21196 VOLUME 10, 2022


A. Hameed et al.: Deep Learning Approach for IoT Traffic Multi-Classification in Smart-City Scenario

which is not included in our previous work. Furthermore, for TABLE 3. Summary of the key notation.
the new two stage learning framework, we apply an optimal
searched neural network architecture at the second stage.
Finally, a completely new performance evaluation section is
presented. The particular section includes a new set of results
for both datasets, new experiments, and additional compar-
isons with machine learning and deep learning approaches.
The differences between our previous and proposed work are
given in Fig. 1.
The extensions made in this paper are aligned in such a way
to address the above cited limitations:
• To overcome limitation 1, we incorporate a fine-grained
feature set at different network levels i.e., flow, device
and packet level.
• To address limitation 2 and the high computational costs
of complex features, we employ a statistical feature
selection (i.e., ANOVA score) to select a subset of the
available features at a time instance t.
• To address limitation 3, we propose a two-stage learning
framework. Firstly, a relevance weighting-based prepro-
cessing is performed for the available features, whereas
different subsets of the selected features are utilized
across these two stages to avoid the high dimensionality
issue. Finally, the tuned hyperparameters are utilized in
a neural network that achieves 99.9% accuracy for the
first dataset and 99.8% for the second.

III. PROBLEM SETUP


In this section, we describe and formulate the IoT traffic
classification problem, where different IoT devices are com-
bined to their respective classes according to their distinctive
characteristics. To help the reader follow the modeling of our
work, Table 3 summarizes the key notation used throughout
this paper.
In particular, a smart environment (e.g. smart city, home,
grid, etc.) can be modeled as a network of S smart devices,
generating M traffic flows. The devices are represented by the
set D = {d1 , d2 .., ds }, where ds indicates the sth smart device,
where 1 ≤ s ≤ S. Similarly, the set T = {td11 , td22 , . . . , tdms } transport protocol used by each flow, f5 is the source port
represents the generated traffic flows, where tdms denotes the number, f6 denotes the destination port number, f7 is the Time-
mth traffic flow in T generated by the sth device, with 1 ≤ to-Live (TTL) information, f8 denotes the window size used
m ≤ M such that M ⊆ S. Furthermore, each traffic flow by the transport layer, f9 indicates the length of a packet,
is constituted by a number of packets denoted by P = and f10 denotes the source Ethernet address, and f11 is the
{p1m , p2m , . . . , pkm } where pkm represents the k th packet of destination Ethernet address.
the mth flow. Furthermore, we assume that we have a given training set
Regarding the features, the set F denotes the distinctive G, including pairs of input samples along with their class
properties of the traffic flow tdms which we want to classify. labels as G = {(x1 , c1 ), (x2 , c2 ), . . . , (xr , cq )}. Accordingly,
Each packet in P is a D-dimensional set of the network ele- the set C = {c1 , c2 , . . . , cq } denotes the available classes,
ments under consideration. These elements are represented where cq ∈ C represents the qth class in C, while C ⊂ D and
as a feature space F, such that F = {f1 , f2 , f3 , .., fi }, where fi q ≤ n. Furthermore, xr ∈ X is the r th input sample of the
represents the ith feature in the feature space F with 1 < i ≤ total set of samples X = {x1 , x2 , . . . , xr }, such that X ⊂ P
11 (in this work we assume 11 distinctive features). and r ≤ k. Hence, the IoT Traffic Classification problem is
The set F consists of device, flow and packet level features, defined as the task of estimating the class label cq to the input
where f1 represents the interarrival time, f2 denotes the source vector xr , where xr belongs to a subset of a feature space F,
IP address, f3 is the destination IP address, f4 shows the xr ∈ X ⊂ F. This task is accomplished using a classification

VOLUME 10, 2022 21197


A. Hameed et al.: Deep Learning Approach for IoT Traffic Multi-Classification in Smart-City Scenario

TABLE 4. Description of features in both datasets. TABLE 5. Statistical characteristics of IoT traffic features.

we considered three statistical characteristics of the distri-


bution of each feature, such as: mean, median and standard
deviation. Table 5 summarizes the statistical characteristics
of each feature for both datasets. However, for illustration
purposes we plot the probability distribution of the features
under consideration for the first dataset only, as shown in
Fig. 2. As can be seen, the interarrival time shows a Gaussian
distribution (as explained in the previous subsection), while
all other features illustrate an exponential distribution.
rule or function f (x) : X D → C that can predict the label C
of unseen D dimensional input vector xr . C. FEATURE CORRELATION
One very important aspect of the performance of the classifi-
A. FEATURE DESCRIPTION cation is the correlation between the features. Hence, in this
As mentioned earlier, the available features can be catego- work we consider the feature correlation from two perspec-
rized as follows: tives. Firstly, we examine which features are correlated within
the feature space. The correlation between two features say, fi
1) DEVICE LEVEL FEATURES and fj , is calculated using the Pearson’s correlation coefficient
In this category we consider the source and destination MAC which is given as:
addresses of the devices. Such features are extracted directly cov(fi , fj )
from the traffic traces. These features offer a characterization ρ(fi ,fj ) = (1)
σfi σfj
of the IoT traffic independent of the other two levels of
features. where cov(fi , fj ) is the covariance between features fi and
fj , whereas σ(fi ) and σ(fj ) represent the standard deviation of
2) FLOW LEVEL FEATURES the ith and jth feature respectively. The value of correlation
This includes features such as source and destination IP coefficient lies between −1 and 1. If there is no correlation
addresses, protocol type of a flow, source and destination port between the features fi and fj then ρ(fi ,fj ) = 0. A perfect
numbers, the TTL information of a flow, and the window size negative correlation is found if ρ(fi ,fj ) = −1 and a perfect
used by the flow. This set can be used to extract the packet positive correlation is found if ρ(fi ,fj ) = 1. We plot the
level features of a flow described below. correlation between features for the first dataset as a heatmap,
which is shown in Fig. 3.
3) PACKET LEVEL FEATURES As it can be seen, the source IP address is more correlated
This category includes the timestamp, the interarrival time to TTL, destination port number, source MAC addresses
(IaT), and the length of the packets. The interarrival time is and destination IP addresses. Furthermore, the destination IP
the amount of time that elapses between a packet reception address and source port number, the destination IP address
and the arrival of the one following it. As timestamp follows and destination MAC address, the packet length and destina-
the normal (guassian) distribution, to calculate the interarrival tion MAC address, the source MAC address and source port
time feature, we analyzed and extracted the time between the number, the source port number and destination port number
successive incoming traffic packets following a Gaussian’s are also highly correlated features.
distribution with an average rate of 1 (since at each time unit Secondly, we find the correlation between the input vector
one packet arrives). All of the above features along with their features and the target class labels. Then based on the rela-
description are illustrated in Table 4. To prove the generality tionship between independent variables (i.e., feature space)
of our approach, we used the same feature sets for both and dependent variable (i.e., class label) we select the features
datasets under consideration. for our learning (classification) framework. This is further
discussed in Section IV-C.
B. STATISTICAL CHARACTERISTICS OF THE FEATURES
Each feature fi in the feature space F has its own distribution, IV. PROPOSED CLASSIFICATION FRAMEWORK
which is represented by the number of different statistical A. OVERVIEW
characteristics over different smart devices. The analysis of The proposed classification framework consists of three key
such distributions can be useful in order to identify which steps as shown in Fig. 4 and discussed in the following
features are most important for the classification. In this work, sections.

21198 VOLUME 10, 2022


A. Hameed et al.: Deep Learning Approach for IoT Traffic Multi-Classification in Smart-City Scenario

FIGURE 2. Probability distributions of IoT traffic flow features of Dataset 1.

the proposed two stage learning model. At stage 0,


the classification is performed by applying a logistic
regression technique, while the tentative classes are
provided. At stage 1, a neural network is applied to
provide the final classes.
The operational flow of the proposed work is provided in
Fig. 5.

B. DATA PREPROCESSING
During the data preprocessing, a basic filtering of the dataset
is performed in order to remove some of the non-meaningful
packets such as ping, DNS requests, etc. The features such
as TTL, window size, packet length are already numerical,
whereas the interarrival time feature is converted to seconds.
Following, we observed that some of the features such as ‘‘set
of port numbers (f5 and f6 )’’, ‘‘set of IP addresses (f2 and f3 )’’
and ‘‘set of MAC addresses (f10 and f11 )’’ are nominal and
FIGURE 3. Correlation between IoT traffic features of Dataset 1.
multi-valued (having more than one value with a single data
instance). As machine learning classifiers cannot deal with
such data, we converted these features into a numerical form
1) Preprocessing the IoT Traffic (Section IV-B): It is the using a two-step procedure.
first step executed and it aims at providing the weighted Firstly, we perform the data cleaning by passing the nomi-
preprocessing of dataset along with the rescaling, nal vectors to the Bag-of-Word (BoW) model [26]. Secondly,
imputation and transformation of traffic traces. as the BoW assigns the same importance to each vector
2) Selecting the most relevant features (Section IV-C): word, we have proposed a relevance weighting to assign
It consists of the selection of the most important fea- a prioritized importance to each word within each vector.
tures, which are highly correlated to the class labels, These relevance weights, attributed to each feature vector, are
using the ANOVA filter based selection method. passed to the stage 0 classifier and is given by Eq. (2):
3) Two-stage learning model (Section IV-D): Here the
classification of the IoT traffic traces is done using Relevance Weight = wfw,v × vfw,v (2)

VOLUME 10, 2022 21199


A. Hameed et al.: Deep Learning Approach for IoT Traffic Multi-Classification in Smart-City Scenario

FIGURE 4. Overview of proposed two-stage classification framework.

C. FEATURE SELECTION
The supervised feature selection is a way to choose the input
features that are believed to be the most useful to a model
in order to predict the target variable. For our supervised
feature selection method, we resort to either wrapper methods
or filter based methods. A wrapper based method, such as
Recursive Feature Elimination (RFE), selects the features that
are performing well.
However, for the selection of features from our feature
space F, we employed the filter-based feature selection tech-
nique [27] which uses the statistical methods to score the rela-
tionship between the features and the target labels i.e., class
FIGURE 5. Operational flow of the proposed work.
labels. Specifically, we have selected the ANOVA (Analysis
of Variance) F-value feature selection technique because our
input features are quantitative or become quantitative after
where wfw,v denotes the word frequency of a word w within
preprocessing and the target class labels are of categorical
a vector v and vfw,v represents the total vector frequency.
nature (i.e. c1 indicates a belkin wemo switch, c2 represents
Herein, the vectors consist of the ‘‘port numbers vector’’,
smart cam and so on).
‘‘IP addresses vector’’, and ‘‘MAC addresses vector’’. The
word frequency wfw,v is defined as the number of times that
w occurs in v and is given using Eq. (3): D. PROPOSED TWO-STAGE LEARNING MODEL
1) STAGE 0 CLASSIFIER
number of occurrence of a word in a vector
wfw,v = (3) The Logistic Regression method is employed at stage 0,
number of words in that vector which takes the selected set of features for the training,
Because frequent words are less informative than rare as given by the ANOVA F-value. The reason that we have
words, the vector frequency, vfw,v is given as Eq. (4). selected this classifier is that it has been proven to perform
well for very large data sets [28], as in the case of a smart envi-
number of vectors
vfw,v = log (4) ronment. The logistic regression technique investigates the
number of vectors containing word w association among the independent variables and the depen-
After this step, we impute the missing values of features dent variables of the problem. In our scenario, the selected
using their mean value and re-scale the dataset between 0 and features are the independent variables and the device cate-
1 using the MinMaxScaler technique. gories (e.g. hubs, cameras, etc.) are the dependent variables.

21200 VOLUME 10, 2022


A. Hameed et al.: Deep Learning Approach for IoT Traffic Multi-Classification in Smart-City Scenario

(l)
The goal is to estimate the probability p for a combination of respective layer. The w(i,j) denotes the weight of a connection
independent variables using the following logit function: between the ith neuron of layer l and the jth neuron of layer
(l)
p l − 1; Bi represents the bias value applied at the l th layer for
logit(p) = ln (5) (l)
1−p the ith neuron; Oi denotes the output of the the ith neuron
where ln is the natural logarithm and p denotes the probability at the l th layer and V l represents the nonlinear activation
of an independent variable. The anti log of (5) allows us to function applied at layer l. This work applied the Rectified
find the estimated regression equation given by Eq. (6): Linear Units (ReLU) activation function at the input layer and
p the softmax activation function at the output layer.
logit(p) = ln The above process continues till the output layer predicts a
1−p
label, i.e., class of an IoT device, which is then compared with
= β0 + β1 ∗ x1 + β2 ∗ x2 + . . . + βn ∗ xn ⇒
the actual label and a loss value is calculated using a loss func-
eβ0 +β1 ∗x1 +β2 ∗x2 +...+βn ∗xn tion based on the categorical cross entropy. Secondly, a back
p= (6)
1 + eβ0 +β1 ∗x1 +β2 ∗x2 +...+βn ∗xn propagation is done in which weights are updated using the
where β0 is an intercept, β1 , β2 , and βn are the regression predicted output, desired output and their difference. The goal
coefficients, x1 is the first independent variable, x2 is the is to minimize the loss by finding the optimal weights value.
second independent variable, and xn is the nth selected feature. The optimization function that we applied is based on the
In order to calculate β coefficients, we employed the Gradient Adaptive Moment Estimation (Adam) because it is proved to
Descent method [29]. The general form of Eq. (6) is given as: be very robust for large datasets [31].
1 To model an optimal MLP-ANN, we used the Keras
p(yi |x1 , x2 , . . . , xn ) = (7) tuner [32] along with the Random Search technique. For
1 + e−(β0 +β1 ∗x1 +β2 ∗x2 +...+βn ∗xn )
the hyper parameter optimization, we determine the optimal
where yi represents the dependent variable i.e., the ith IoT number of hidden layers, the optimal number of neurons in
device class, which we predict based on x1 , x2 , and xn . After each layer (i.e., a search between 22 and 512 neurons), and
calculating the regression coefficients the testing component the learning rate (i.e., a search between 1e-2 and 1e-4) using a
comes into effect, where the classifier uses the regression random search tuner. Following, these parameters are passed
coefficients and computes the estimated regression for each to the Adam optimizer, since we want to achieve the best
testing instance using Eq. (7). Finally, stage 0 classifier per- performance along with the least computational complexity.
forms a first tentative prediction.

2) STAGE 1 CLASSIFIER V. CLASSIFICATION ALGORITHM


In order to optimally classify the IoT devices, we architect A. ALGORITHM DESCRIPTION
the Multi-Layer Perceptron Artificial Neural Network (MLP- The preprocessing algorithm (Algorithm 1) consists of the
ANN) [30] based classification as our stage 1 classifier. MLP- PREP procedure, which firstly generates the BoW represen-
ANNs are composed of multiple neurons that are arranged tations using the function generate_BOW (). Then, the rele-
in the form of an input, output, and hidden layers. In this vant weights are calculated by employing the word_Freq()
work, the architecture of MLP-ANN consists of one input and vector_Freq() functions, which takes BoW as an
layer with 11 neurons, because we have 11 different features input. Following, the features are scaled using the function
to be passed as an input to the neural network. Following, MinMaxScaler(). Algorithm 2 depicts the learning model
we optimize the number of hidden layers, while the output consisting of two procedures, namely, LOGREG and MLP.
layer consists of n number of neurons depending on the In the LOGREG procedure, the input labels x and output
number of labelled classes n found in each of the dataset. labels y are split into training and testing data using the func-
MLP-ANN provides two major processes for the classi- tion, split(). Next, the filter-based feature selection is done
fication task. Firstly, it performs the forward propagation using the statistical method called ANOVA score and this
process, which feeds the features to the input layer neurons. is achieved by employing the SelectKBest() function. Then
In our case, all quantitative features along with the output the LogisticRegression() generates and fit the model using
from stage 0 classifier (i.e., tentative classes) are fed to an the fit() function. The prediction is done using the predict()
input layer. Following, the input layer propagates these data which contains the x_tst as testing dataset.
to the hidden layers and then to the output layer. The neurons The MLP procedure generates the classification results
in each of the neural network layer calculates the weighted based on the MLP-ANN which takes stage’s 0 results
sum as output which is then passed to the activation function along-with the other features. At this stage, firstly the data
and is given by Eq. (8). are split using split() and then a sequential model is cre-
(l)
X (l) (l−1) (l) ated using the function, build_model(). Following, the keras
Oi = V (l) ( w(i,j) × Oj + Bi ) (8) tuner is applied to search the number of models using
j
RandomSearch(), which takes the sequential model, the num-
where the superscripts on variables represent the layer num- ber of trials per search, the max trials allowed and the search
ber and the subscripts represent the neuron numbers in the objective as an input. Then, the getBestModel() returns the

VOLUME 10, 2022 21201


A. Hameed et al.: Deep Learning Approach for IoT Traffic Multi-Classification in Smart-City Scenario

Algorithm 1 Preprocessing Algorithm requires O(1) operations. For the relweight statement (line 6)
PREP(f2 ,f3 ,f5 ,f6 ,f10 ,f11 ,devices) the complexity is O(1) ∗ O(n) = O(n). However, line 9
// f2 and f3 are source and destination IP addresses; f5 and depends on the number of feature vectors n and thus, in the
f6 are source and destination port numbers; f10 and f11 are worst-case scenario needs O(n). Accordingly, the overall time
source and destination MAC addresses; and devices labels. complexity of PREP procedure is linear i.e., O(1) + O(1) +
1. BOW1 ← generate_BOW (f2 , f3 ) O(n) + O(n) = O(n). 
2. BOW2 ← generate_BOW (f5 , f6 ) Proposition 2: The computational complexity of LOGREG
3. BOW3 ← generate_BOW (f10 , f11 ) procedure is O(n).
4. wf ← word_Freq(BOW1 , BOW2 , BOW3 ) Proof: Line 1 is a simple assignment statement (i.e.,
5. vf ← vector_Freq(BOW1 , BOW2 , BOW3 ) O(1)) and lines 2-3 require O(n) computation time in the
6. relweight ← wf × vf worst scenario. Regarding the training time (lines 4-5) of
7. set x ← dataset(BOW1 , BOW2 , BOW3 , relweight ) LOGREG the complexity is O(t ∗ n) where t is the number
8. set y ← dataset(devices) of training examples and n is the number of selected data
9. set xnorm ← MinMaxScaler(x) features used for the classifier training. Additionally, the
Output: xnorm ,y testing time taken by line 6 is O(n). Thus, the LOGREG takes
O(1)+O(n)+O(t ∗n)+O(n) = O(n), which can be beneficial
for low latency applications that require a fast classification
model with the highest validation accuracy across all models method. 
given by the RandomSearch(). Finally, we fit the model with Proposition 3: The computational complexity of MLP pro-
fit() for 70 epochs and then call the predict() function. cedure is O(nd)
Proof: In the MLP procedure, lines 7-9 consist of simple
Algorithm 2 Learning Algorithm assignments i.e., O(1). Line 10 indicates the build_model()
function of the neural network and its complexity is O(n ∗
LOGREG(xnorm ,y)
d ∗ t ∗ e), where for proposition 3, n represents the number
// xnorm is the dataset instances and y is the class labels
of layers, d denotes the number of neurons in each layer,
1. set xtr , xtst , ytr , ytst ← split(x, y, testsize ← 0.2)
t is the number of training examples and e is the number
2. set xtr ← selectKBest(Anovascore , xtr )
of epochs. Because we are using 80% training examples
3. set xtst ← selectKBest(Anovascore , xtst )
i.e., 664796 for 70 epochs, the complexity for this part is
4. set model ← LogisticRegression(maxiter ← 3000)
O(n ∗ d ∗ 664796 ∗ 70) = O(nd). Following, RandomSearch()
5. set fit ← model.fit(xtr , ytr )
(line 11) takes O(n) for the worst scenario and line 12 takes
6. set ypred ← model.predict(xtst )
a constant amount of time i.e., O(1). Line 13 takes O(t) and
Output: ypred F Stage 0
testing time taken by the line 14 is O(n). Thus, the MLP takes
MLP(ypred ,f1 ,f4 ,f7 ,f8 ,f9 , devices)
O(1) + O(nd) + O(n) + O(1) + O(t) + O(n) = O(nd) time.
// ypred is the output of Stage 0 classifier; f1 is the interar-

rival time; f4 is the IP protocol used; f7 is the TTL; f8 and
The overall complexity, T of the proposed learning frame-
f9 are the window size and packet length; devices are the
work is represented in term of n as: T (n) = O(n) +
class labels.
O(n) + O(nd) = O(n). Thus, it is a linear time learning
7. set x ← dataset(ypred , f1 , f4 , f7 , f8 , f9 )
work.
8. set y ← dataset(devices)
9. set xtr , xtst , ytr , ytst ← split(x, y, testsize ← 0.2)
10. set m ← build_model() VI. PERFORMANCE EVALUATION
11. set tuner ← RandomSearch(m, tuner.obj(valacc ), A. MODEL IMPLEMENTATION AND FRAMEWORKS
maxtr ← 3, searchtr ← 1)
1) DATASET DESCRIPTION
12. set model ← tuner.getBestModel(nummodels ← 1)
In this work, we have used two different datasets provided
13. set history ← model.fit(xtr , ytr , epochs ← 70)
by [33] and [25] consisting of IoT traffic traces in a smart
14. set ypred ← model.predict(xtst )
environment. The description of both datasets is provided as
Output: ypred : FS ← devices F Stage 1
follows:
Dataset 1 [33] consists of network traffic traces from
28 smart devices. As we have considered a subset of the
B. ASYMPTOTIC ANALYSIS network traffic, which is a total of 12000317 labeled instances
Proposition 1: The computational complexity of PREP of 22 IoT devices, for this dataset we have 22 distinctive
procedure is O(n) classes. The devices are namely, smart phone, belkin wemo
Proof: The PREP procedure running time depends on switch, belkin wemo motion sensor, dropcam, HP printer,
the number of feature vectors, represented as n. Lines 1-3 iphone, laptop, nest protect smoke alarm, netatmo welcome,
take a constant time as they split the vectors into words, thus netatmo weather station, PIX star photo frame, samsung tab,
O(1). Lines 4-5 and 7-8 are assignment statements and each samsung smartcam, smart things, TP link camera, TP link

21202 VOLUME 10, 2022


A. Hameed et al.: Deep Learning Approach for IoT Traffic Multi-Classification in Smart-City Scenario

For the evaluation of the classification performance,


we have considered the following well known classification
metrics:
1) Precision: It is the ability of a classifier to not label an
FIGURE 6. Samples of IoT traffic traces from dataset 1.
instance that is actually negative as positive and is given
as:
TruePositive
Precision = (9)
plug, TP link router, triby speaker, withings smart baby TruePositive + FalsePositive
monitor, withings smart scale, ipv4mcast and amazon echo. 2) Recall: Recall calculates the rate of all the positive
Dataset 2 [25] consists of traffic traces of from 81 IoT instances, which is also called true positive rate and is
devices which are located at various US and UK locations. given as:
These devices belongs to cameras, smart hubs, home automa-
tion, TVs, audio devices and home appliances categories. TruePositive
Recall = (10)
For the second dataset, a total of 40588450 labeled instances TruePositive + FalseNegative
of 68 IoT devices were used in this work. 3) F1-score: It is the harmonic mean of the precision and
A sample of the network trace used from the first dataset recall metrics and is given as:
is provided in Fig. 6. Nonetheless, since we have used the
same feature space for both datasets, Fig. 6 reflects the traces 2 ∗ Precision ∗ Recall
F1 = (11)
from the second dataset as well. The feature called ‘‘MAC Precision + Recall
address’’ of each device is used to provide the label to each 4) Accuracy: It is the proportion of correctly classified
network trace in both of the datasets. instances and is given as:
CorrectPredictions
2) EXPERIMENT SETUP Accuracy = (12)
TotalPredictions
The configuration settings used for our experiments and for
both datasets are listed in Table 6. The proposed model was 5) Confusion matrix: It is a table that is used to describe
implemented in Python (version 3.8.2). In Table 6, the No. the classifier performance on a set of test data for which
of architectures represents the number of different classi- the true values are known.
fication solutions used during our experimentation. These The values of recall, precision, F1-score, confusion matrix
architectures/solutions are further explained in section VI.3. and accuracy are calculated between [0,1] with 1 indicating
Following, the total number of instances provides the number the best and 0 the worst performance. However, a decrease
of labelled instances used from each dataset and the total from 1 towards 0 is good for the loss function of the network.
number of classes represents the total number of distinct
device types. The reason that we have selected a subset of the 3) ARCHITECTURE MODELS
labelled instances for each dataset, is because these datasets We have applied different composite models consisting of
span over a period of about two months and the training neural networks along with traditional machine learning
of such a large amount of data can create several big data algorithms to see their suitability for the IoT traffic multi
challenges. Furthermore, as shown later, we also managed to classification problem. Table 7 provides the description of
achieve a very good performance by using only the specific the different network architectures. The LR represents the
subset of these datasets. Accordingly, the selected subset of logistic regression algorithm and GB denotes the gradient
data under evaluation resulted in a slightly reduced number boosting algorithm (architecture I) [24]. The NB is Naive
of classes for each dataset. Bayes algorithm at stage 0 and RF denotes applying random
Regarding the number of tuner trials, this value represents forest at stage 1 (architecture II) [23]. IP(x) stands for the
the keras tuner trials that we executed for our proposed model. input layer of neural network with x number of neurons.
In more details, for the first dataset, we noticed that after FC(x) denotes the fully connected layer of neural network
5 trials we have achieved the best hyperparameter configu- with x number of nodes (or neurons). OP(x) represents the
ration and for the second dataset after 3 trials. The reason output layer of neural network with x number of classes i.e.,
for executing several trials, is that the keras tuner uses a neurons.
different set of parameters (i.e. learning rate, number of layers MLP represents the multi layer Perceptron neural network
and number of neurons in each layer) at each trial and then with an input layer consisting of 11 neurons, two fully con-
it selects the best performing configuration. Nonetheless, nected layer and one output layer with 22 classes (architec-
we have not seen a significant variation between the accuracy ture III). LR(RFE)+MLP denotes the logistic regression at
of the different trials. Lastly, we split both of the dataset stage 0 with recursive feature elimination method and MLP
instances into three groups as: 60% training instances, 20% at stage 1 with one input layer, two fully connected layers
validation and 20% testing instances, which is a common split and one output layer (architecture IV). LR(Anova)+MLP
ratio in the machine learning domain. (keras tuner) denotes the logistic regression at stage 0 with

VOLUME 10, 2022 21203


A. Hameed et al.: Deep Learning Approach for IoT Traffic Multi-Classification in Smart-City Scenario

TABLE 6. Configurations used in the experiments.

TABLE 7. Description of model architectures applied to the multi-classification problem.

the Anova based feature selection and MLP at stage 1 (archi-


tecture V), which is the two-stage learning model proposed
in this paper.
For comparison purposes, it is important to mention that
the accuracy of existing works are less than the proposed
framework, as shown in the following subsection. For exam-
ple, the proposed framework in [16] achieves an accuracy
of 99.0%, the authors in [21] achieve 96% accuracy, while
in [22] the accuracy is 99.2%. However, for our evaluation,
we compared the proposed framework with the architecture
I [24] and architecture II [23], which both use the first dataset.
Additionally, to better illustrate the efficiency of our work,
we also compare our proposed architecture V with the archi- FIGURE 7. Performance comparison at stage 0.
tectures III and IV which are based on the MLP neural net-
work. For all the neural network-based architectures (i.e. III
to V), the training was done with a number of epochs between
50 and 100. The training was stopped earlier if an increase in architectures I, II, IV, and V for this part, because architecture
the number of epochs did not lead into an improvement of the III i.e., MLP does not consist of two stages. In terms of the
loss function. precision, our proposed architecture V provides the highest
Furthermore, for the activation functions we used the value i.e., 0.74 followed by LR(RFE) + MLP with 0.72 and
ReLU along with the softmax activation which was applied LR+GB with 0.69 value for the first dataset. Regarding the
at the last output layer. The loss functions used was the second dataset, the same trend is noticed, as architecture V
categorical cross entropy. Finally, the optimization was done provides the highest value i.e., 0.87 followed by LR(RFE) +
with the Adaptive Gradient (AdaGrad) for the architectures MLP with 0.83 and LR+GB with a value of 0.79.
III and IV and with Adam for architecture V. The particular In contrast, NB + RF performed poorly for both datasets,
configurations gave the best results for each of the examined i.e., 0.6 for the first dataset and 0.4 for the second. This means
architectures. that 40% of the labelled instances were wrongly classified as
We have also experimented with different LSTM configu- positive for the first dataset and 60% were wrongly classified
rations. In particular, we executed five tuner trials to find the as positive for the second. This can be attributed to the fact
best hyperparameters such as number of layers, LSTM units, that the precision values of some devices were zero and less
learning rate, etc. However, these models gave less accurate than 0.17 for many other. As an example, in the first dataset
results, (i.e., 70% of accuracy). Moreover, we also considered the most misclassified devices for the NB+RF were the
the AdaGrad optimizer for the architecture V but it produced Belkin Switch, HP printer, Netatmo Welcome, PIX-STAR,
an accuracy of 85% and we decided to show only the results Samsung tab and TP link camera.
of the best configuration, which uses the Adam optimizer. When looking into the recall metric, we see that the
proposed architecture V also outperformed the rest of the
models, followed by the LR+GB and LR(RFE)+MLP
B. RESULTS for the first dataset. However, for the second dataset,
1) IMPACT OF ARCHITECTURES LR(RFE)+MLP(KT) is followed by LR(RFE)+MLP and
a: STAGE 0 LR+GB, while architecture V remains the most efficient
Fig. 7 illustrates the performance of the different network solution. Once again NB+RF gives the least average recall
architectures at stage 0, in terms of precision, recall and for both datasets, with 0.61 and 0.29 for dataset 1 and 2. The
F1 score for both datasets. We have only considered the reason for this behavior is that the majority of instances were

21204 VOLUME 10, 2022


A. Hameed et al.: Deep Learning Approach for IoT Traffic Multi-Classification in Smart-City Scenario

100% misclassified. For instance, for the first dataset, out


of 22 classes, instances of 8 classes were 100% incorrectly
classified.
Lastly, we observe that the architecture V gives the highest
value of F1 score among all architectures at stage 0, with a
value of 0.7 for the first dataset, followed by LR+GB and
LR(RFE)+MLP which both give an F1-score of around 0.65,
whereas NB+RF achieves only 0.6. For the second dataset,
our proposed architecture presents a F1-score of 0.89 fol-
lowed by LR(RFE)+MLP, LR+GB, and NB+RF which give
a F1-score of 0.85, 0.80, and 0.28 respectively.
FIGURE 8. Performance comparison at stage 1.
b: STAGE 1
At this stage all five network architectures are considered
as shown in Fig. 8 for both datasets. Moreover, we also
included the accuracy in our evaluation metrics, since the
output of Stage 1 is our final classification. As it can be seen,
our proposed architecture (LR(Anova)+MLP(KT)) attained
an accuracy of 0.999, a precision of 0.996, a recall of
0.995 and a F1-score of 0.996 for the first dataset. Regard-
ing, the second dataset, it achieved an accuracy of 0.998,
a precision of 0.996, a recall of 0.997 and a F1-score of
0.997. Furthermore, LR(RFE)+MLP(KT) provided reason-
able results followed by the other architectures for both of the
datasets.
Once again, NB+RF continued to under-perform for both
datasets at stage 1. Specifically, for the dataset 1, the NB+RF
achieved a performance of only 0.78 for recall, 0.8 for preci- FIGURE 9. Feature ranks provided by the feature selection methods.
sion and 0.77 for F1-score because 3335 training instances
of Belkin switch class, 374 instances of HP printer class,
262 instances of the TP link camera class and 31 iPhone 2) IMPACT OF FEATURES
class instances were incorrectly classified. Similarly, for the Fig. 3 illustrated the correlation of the full set of features for
dataset 2, the particular model achieved a performance of the first dataset. However, it is critical to understand which
only 0.33 for recall, 0.29 for precision and 0.31 for F1-score features have a higher importance (rank value) provided by
because many instances of devices such as Tphilips Hub US, the feature selection method in the classification process.
TP link bulb US, Sousvide US, TP link plug UK, T wemo For this purpose, we provide the full set of features along
plug UK, T wemo plug US, Wans view cam wired US, wans with their ranks, as calculated by Anova score and RFE for
view cam wired UK, smart thing hub UK,sousvide UK,T dataset 1, in Fig. 9. The most important features selected for
philips hub UK,TP link bulb UK,TP link plug US were both datasets are provided in Table 8.
incorrectly classified. For the architectures I, II, III, we have used all features
Additionally, the NB+RF provided an accuracy of 0.77 for during the training and testing phases, thus, we only compare
dataset 1 and 0.92 for dataset 2. Further analysis showed the architectures IV and V to see the feature importance.
that for the first dataset, there were 5 classes incorrectly Specifically, we illustrate the ranks provided by the RFE for
classified out of 22 and for the second dataset, there were architecture IV and the ranks provided by the Anova score for
13 misclassified classes out of 68. As accuracy is the ratio architecture V. The rank values are between 0 and 1. It can be
of these numbers, we corroborate the poor performance of seen that the highest rank provided by Anova was 0.8 given to
architecture II as shown in Fig. 8. the feature 2 i.e., source IP address and the least rank given by
After analyzing the results of stage 1, we conclude that our Anova score was 0.14 for feature 4 i.e., IP protocol used by
architecture V and its variation (architecture VI) provide the device. For the RFE method, the highest rank was provided
best classification results in terms of all performance metrics to feature 2 i.e., 0.7 and the least to the feature 7 i.e., TTL
for both of the datasets. This is a significant observation that information. The features were selected in decreasing order
proves the robustness of our framework that works equally of their ranks by the architectures.
well for different datasets with different number of classes. In more details, Table 8 provides the information about
That is not the case for architectures I-III, which presented the features utilized by each architecture along with the
a great deviation in the attained results between the two performance metrics of each architecture for both datasets.
datasets. The first three architectures used all 11 features. However,

VOLUME 10, 2022 21205


A. Hameed et al.: Deep Learning Approach for IoT Traffic Multi-Classification in Smart-City Scenario

TABLE 8. Classification performance metrics vs features employed.

as mentionned earlier, architecture IV selected the features


by RFE and architecture V selected the features by ANOVA
method. For the first dataset, the selected features by RFE
for architecture IV consists of the source IP address (f2 ),
interarrival time (f1 ), source port number (f5 ), destination Eth-
ernet address (f11 ), window size (f8 ), destination port number
(f6 ), source Ethernet address (f10 ) and IP protocol used (f4 ).
In contrast, the selected features by Anova for architecture V
consists of source IP address (f2 ), packet length (f9 ), window
size (f8 ), source Ethernet address (f10 ), destination port num-
ber (f6 ), TTL (f7 ), destination IP address (f3 ), and source port
number (f5 ). FIGURE 10. Performance comparison per device for architecture V.
For the second dataset, the selected features by RFE in
architecture IV are the source port number (f5 ), destination
port number (f6 ), window size (f8 ), MAC address of source Equation (14) provides the features that are only included
(f10 ) and MAC address of destination (f11 ). For the archi- by RFE and these are the interarrival time, the destination
tecture V, the selected features are the type of protocol (f4 ), MAC address and the IP protocol used. Since, architecture
port number of source (f5 ), port number of destination (f6 ), IV presented an inferior performance than architecture V,
TTL (f7 ) and window size (f8 ). Therefore, source IP address, we can safely say that these three features did not provide a
packet length, window size, source Ethernet address, destina- well aligned information with the features given by R ∩ A.
tion port number, TTL, destination IP address, and source port Following, we extract the features included by the Anova
number are more relevant to classify labels for dataset 1 and score method but not from the RFE:
the features as protocol, port number of source, port number / R} ⇔ {f9 , f7 , f3 }
A − R = {x|x ∈ A ∧ x ∈ (15)
of destination,TTL and window size are more important for
the classification in the second dataset. As (15) suggests, the packet length, TTL and destination
To better illustrate the impact of feature selection in the IP address are the features that they are only considered
resulted accuracy, we provide the following formal logic by Anova and thus, by architecture V. Interestingly, we see
representation for the first dataset. Nonetheless, the same that when these features are included in R ∩ A such that
logic can be easily applied for the second dataset as well. (R∩A)∪(A−R) = A, the performance increased significantly.
In more detail, we are representing the actual and selected Thus, the features {f9 , f7 , f3 } have a positive impact in the
feature sets of dataset 1 as: R = {f2 , f1 , f5 , f11 , f8 , f6 , f10 , f4 } performance of architecture V as they increased the accuracy
and A = {f2 , f9 , f8 , f10 , f6 , f7 , f3 , f5 } respectively. According to 99.9%, precision to 99.6%, recall to 99.5% and f1- score
to these sets, we model R ∩ A as follows: to 99.6% for dataset 1.

R ∩ A = {x|x ∈ R : x ∈ A} ⇔ {f2 , f5 , f8 , f6 , f10 } (13) 3) PERFORMANCE OF ARCHITECTURE V


In this part of the evaluation, we present the detailed results
The intersection R ∩ A gives the features that were used by of the proposed architecture V for the first dataset, however,
both architectures. However, in order to evaluate the impact the accuracy, precision, recall and F1 score for both datasets
of the feature selection in the overall performance, we need can be found in Table 8, as shown earlier.
to identify the features that were not included in both archi-
tectures, which is captured as follows: a: PERFORMANCE OF STAGE 0
As we have proved the superior performance of our proposed
/ A} ⇔ {f1 , f11 , f4 }
R − A = {x|x ∈ R ∧ x ∈ (14) two-stage classifier (architecture V), in this part of the section

21206 VOLUME 10, 2022


A. Hameed et al.: Deep Learning Approach for IoT Traffic Multi-Classification in Smart-City Scenario

FIGURE 11. Confusion matrix for stage 0 of architecture V of dataset 1.

we delve into the details of the performance of the particular At the main diagonal there are four exception cases: (i) the
framework. worst classification is noticed for the iPhone device, since
Accordingly, for the first dataset, Fig. 10 illustrates the 58% instances of the particular device were classified as
performance metrics per device for stage 0. Some devices Samsung galaxy tab, 22% instances were misclassified as TP
such as Belkin sensor, Dropcam and TP link router presents link router, and 20% were misclassified as amazon echo thus
the highest performance, i.e., recall=1; precision=1 and F1- depicting 100% FPR; (ii) for the nest protect smoke alarm
score=1, all aggregated to 3. The lowest precision is noticed the classification value is 0% with 100% FPR because it
for the belkin wemo switch i.e., 0.61, while the lowest recall was misclassified as Samsung tab; (iii) for the triby speaker,
and F1-score are observed for the Samsung smartcam i.e., we notice a 28% of misclassification as laptop (Type II
0.53 and 0.65 respectively. Furthermore, for the SmartCam error), and 72% of misclassification as netatmo welcome
the aggregated value is 2.04 since the F1 score is 0.65, the (Type II error); (iv) for the withings smart scale, we noticed
recall is 0.53, whereas the precision is significantly high, i.e., 87% of misclassification as baby monitor (Type II error),
0.86. For the Netatmo weather station device, the aggregated 9.6% of misclassification as Samsung smartcam (Type II
value is 2.09 as the precision is reasonably good, i.e., 0.88 but error), 1.9% of misclassification as Netatmo welcome, and
the recall and F1 score are relatively low i.e., 0.54 and 0.67. 1.9% instances were incorrectly classified as belkin wemo
However, there were some devices such as withings scale, switch.
triby speaker, nest alarm, and iPhone for which precision, This behavior is attributed to the following reasons:
recall and F1-score were zero. The reason is that the instances (a) there were 50 instances of iPhone compared to 3242,
of such devices were misclassified in other categories. 87580 and 6231 of galaxy tab, TP link router and amazon
Following, we plot the confusion matrix of dataset 1 to give echo instances; (b) 41 nest protect smoke alarm instances
the overall performance of stage 0 as shown in Fig. 11. The compared to 3242 instances of Samsung galaxy tab; (c)
row entries of a confusion matrix depict the actual values 771 triby speaker instances compared to 21815 laptop
and the column entries depicts the predicted values for the instances and 3995 instances of netatmo welcome; and (d)
22 classes. All the diagonal entries correspond to correct 52 withings smart scale instances compared to 5912, 4895,
classification whereas entries above diagonal are all Type I 3995 and 4407 instances of baby monitor, Samsung smart-
error (also called False Positive Rate (FPR)) and entries below cam, Netatmo welcome and belkin wemo switch respectively.
are Type II error (also called False Negative Rate (FNR)). The Thus, the prediction value for these devices is much higher as
goal is to minimize the Type I and Type II errors close or equal compared to iPhone, nest protect smoke alarm, triby speaker
to zero. and withings scale.

VOLUME 10, 2022 21207


A. Hameed et al.: Deep Learning Approach for IoT Traffic Multi-Classification in Smart-City Scenario

FIGURE 12. Training vs. validation accuracy of architecture V for FIGURE 14. Comparison of performance metrics for stage 1 of
100 epochs. architecture V over 100 epochs.

The exploration may help the optimisation process to escape


from a local optimal, resulting however to the spikes noticed
in Fig. 12. Yet, the optimisation process manage to converge
due to this exploitation stage.
Following, we have plotted the loss function for the train-
ing and testing datasets across the 100 epochs as shown in
Fig. 13. The learning curve shows the decay of the categor-
ical cross entropy loss function with respect to the number
of epochs. This curve is helpful in predicting whether our
model is overfitted, underfitted or is fit to testing and training
datasets. We see that the loss function for both training and
testing decays to low values i.e., 0.001193 for training and
0.001516 for the testing datasets at epoch 100. The spikes
are due to the use of a random search hyper tuner and the
reasons discussed above. Furthermore, training and testing
losses decrease and are stabilized around the same point i.e.,
after epoch 80 for training data. The model thus successfully
FIGURE 13. Training vs. Testing loss functions for stage 1 of architecture V. captures the classification patterns.
Next, Fig. 14 depicts the performance metrics for
100 epochs at stage 1. The precision is high as compared to
b: PERFORMANCE OF STAGE 1 the other two performance metrics i.e., 0.996923 at the epoch
Fig. 12 depicts the training and testing accuracy, over the 100. It can also be observed that the precision metric for
100 epochs for the first dataset. The network model, i.e., the neural network does not exhibit significant changes after
optimized MLP at stage 1 of LR(Anova)+MLP (keras tuner), the epoch 80. Regarding the recall, it is lower compared to the
achieves better training accuracy i.e., 0.9997292 and vali- precision and F1-score i.e., 0.9957 at epoch 100 and it shows
dation accuracy i.e., 0.99962693 as the number of epochs a constant behavior after the epoch 95. For the F1-score, the
increases. The initial accuracy values start from 0.998 at value is 0.9964 at the epoch 90 and it does not present any
epoch 1 and the accuracy value does not change significantly significant changes after this point.
after epoch 60. Regarding the spikes noticed, Keras Tuner
estimates a close to optimal neural network topology using C. LIMITATIONS
an exploitation versus exploration approach. Even though our framework provides very encouraging
In the exploitation stage, it tries to improve the neural results, it still presents some limitations that stem from the
network topology, which output the most accurate results. intrinsic data nature of the IoT traffic multi-classification
In the exploration stage, it tries to randomly examine new problem. This includes the extra overhead of monitoring
neural network topologies that have not been explored yet. the infrastructure to collect the traces, the construction of

21208 VOLUME 10, 2022


A. Hameed et al.: Deep Learning Approach for IoT Traffic Multi-Classification in Smart-City Scenario

a training dataset, and the computational overhead for the [4] Y. Mao, C. You, J. Zhang, K. Huang, and K. B. Letaief, ‘‘A survey on
model training. In addition to that a classification task is a mobile edge computing: The communication perspective,’’ IEEE Com-
mun. Surveys Tuts., vol. 19, no. 4, pp. 2322–2358, 4th Quart., 2017, doi:
supervised learning approach. This means that if new types of 10.1109/COMST.2017.2745201.
IoT devices are connected in the local network a new cycle of [5] D. Dechouniotis, N. Athanasopoulos, A. Leivadeas, N. Mitton, R. Jungers,
data collection, annotation and training should begin in order and S. Papavassiliou, ‘‘Edge computing resource allocation for dynamic
networks: The DRUID-NET vision and perspective,’’ Sensors, vol. 20,
to update the model. no. 8, p. 2191, Apr. 2020, doi: 10.3390/s20082191.
[6] Q. Xu, R. Zheng, W. Saad, and Z. Han, ‘‘Device fingerprinting
in wireless networks: Challenges and opportunities,’’ IEEE Commun.
VII. CONCLUSION
Surveys Tuts., vol. 18, no. 1, pp. 94–104, 1st Quart., 2016, doi:
In this work, we studied the problem of IoT traffic classifica- 10.1109/COMST.2015.2476338.
tion. To solve this problem we have proposed a composite [7] O. N. Osterbo, D. Zucchetto, K. Mahmood, A. Zanella, and O. Grondalen,
learning framework that consists of two stages. After an ‘‘State modulated traffic models for machine type communications,’’ in
Proc. 29th Int. Teletraffic Congr. (ITC), Ilmenau, Germany, Sep. 2017,
initial data preprocessing, the network traces are passed to pp. 1–5.
stage 0, where a feature selection mechanism and a Logistic [8] M. Laner, N. Nikaein, P. Svoboda, M. Popovic, D. Drajic, and S. Krco,
Regression classifier are applied. In particular, an ANOVA ‘‘Traffic models for machine-to-machine (M2M) communications: Types
and applications,’’ in Machine-to-Machine (M2M) Communications:
filter based selection technique decides on the most important Architecture, Performance and Applications, C. Antón-Haro and
features to be used by the stage 0 classifier. The tentative clas- M. Dohler, Eds. Sawston, U.K.: Woodhead Publishing, 2020,
sification of the stage 0 classifier along with the remaining pp. 133–154.
[9] A. Orrevad, ‘‘M2M traffic characteristics: When machines participate in
features were then passed to the stage 1 classifier, which used communication,’’ Ph.D. dissertation, KTH Inf. Commun. Technol., Stock-
an optimal multi-layer perceptron neural network architecture holm, Sweden, 2009.
that provides the final classification. [10] M. Miettinen, S. Marchal, I. Hafeez, T. Frassetto, N. Asokan,
A.-R. Sadeghi, and S. Tarkoma, ‘‘IoT Sentinel demo: Automated
Following, a detailed experimentation and comparison device-type identification for security enforcement in IoT,’’ in Proc.
with various composite architectures on two different IoT IEEE 37th Int. Conf. Distrib. Comput. Syst. (ICDCS), Atlanta, GA, USA,
datasets have been performed. We concluded that the pro- Jun. 2017, pp. 2511–2514.
posed framework can considerably increase the performance [11] B. Bezawada, M. Bachani, J. Peterson, H. Shirazi, I. Ray, and I. Ray,
‘‘Behavioral fingerprinting of IoT devices,’’ in Proc. Workshop Attacks
of the classification in terms of recall, precision, F1-score, Solutions Hardw. Secur., Jan. 2018, pp. 41–50.
accuracy and confusion matrix metrics. Regarding the accu- [12] Y. Meidan, M. Bohadana, A. Shabtai, M. Ochoa, N. Ole Tippenhauer,
racy, our proposed model achieved a 99.9% accuracy for the J. Davis Guarnizo, and Y. Elovici, ‘‘Detection of unauthorized IoT
devices using machine learning techniques,’’ 2017, arXiv:1709.04647.
first dataset and a 99.8% accuracy for the second dataset, Accessed: Jul. 27, 2021.
proving the generalization aspects of our approach. [13] S. Aneja, N. Aneja, and M. S. Islam, ‘‘IoT device fingerprint using deep
The particular model is of utmost importance in an IoT learning,’’ in Proc. IEEE Int. Conf. Internet Things Intell. Syst. (IOTAIS),
Nov. 2018, pp. 174–179.
to Cloud continuum communication model, where different [14] N. Apthorpe, D. Reisman, and N. Feamster, ‘‘A smart home is no castle:
IoT devices need to be classified and their traffic profiles Privacy vulnerabilities of encrypted iot traffic,’’ 2017, arXiv:1705.06805.
be accurately predicted. This precise classification can pos- Accessed: Jul. 27, 2021.
[15] R. Lippmann, D. Fried, K. Piwowarski, and W. Streilein, ‘‘Passive oper-
itively contribute to the proper estimation of the required ating system identification from TCP/IP packet headers,’’ in Proc. ICDM
resources from the subsequent Edge and Cloud layers where Workshop Data Mining Comput. Secur. (DMSEC), 2003, pp. 1–10.
the IoT traffic will be processed and analyzed. [16] J. Kotak and Y. Elovici, ‘‘IoT device identification using deep learning,’’
The future direction of this work lies in the combination of in Proc. 13th Int. Conf. Comput. Intell. Secur. Inf. Syst. (CISIS), 2020,
pp. 76–86.
our proposed model with a resource allocation mechanism [17] A. Hameed, J. Violos, N. Santi, A. Leivadeas, and N. Mitton, ‘‘A machine
that will be able to leverage this workload estimation and learning regression approach for throughput estimation in an IoT environ-
dynamically change the allocation strategy at the access and ment,’’ in Proc. 14th IEEE Int. Conf. Internet Things, Melbourne, VIC,
Australia, Dec. 2021, pp. 29–36.
Edge networks. Finally, we aim to include other machine [18] M. R. P. Santos, R. M. C. Andrade, D. G. Gomes, and A. C. Callado,
learning techniques such as K-means clustering along with ‘‘An efficient approach for device identification and traffic classification
unsupervised methods to address the limitations of classify- in IoT ecosystems,’’ in Proc. IEEE Symp. Comput. Commun. (ISCC),
Jun. 2018, pp. 304–309.
ing new and unknown types of IoT devices. [19] A. Abdellah, V. Artem, A. Muthanna, D. Gallyamov, and A. Koucheryavy,
‘‘Deep learning for IoT traffic prediction based on edge computing,’’ in
REFERENCES Proc. Int. Conf. Distrib. Comput. Commun. Netw., Moscow, Russia, 2020,
pp. 18–29.
[1] N. Ivanov. (2019). Unleashing the Internet of Things With In-Memory [20] M. R. Shahid, G. Blanc, Z. Zhang, and H. Debar, ‘‘IoT devices recognition
Computing—IoT Now—How to Run an IoT Enabled Business. through network traffic analysis,’’ in Proc. IEEE Int. Conf. Big Data (Big
Accessed: Jul. 7, 2021. [Online]. Available: https://www.iot-now. Data), Dec. 2018, pp. 5187–5192, doi: 10.1109/BigData.2018.8622243.
com/2019/01/17/92200-unleashing-internet-things-memory-computing [21] M. Lopez-Martin, B. Carro, A. Sanchez-Esguevillas, and J. Lloret, ‘‘Net-
[2] S. C. Mukhopadhyay and N. K. Suryadevara, ‘‘Internet of Things: Chal- work traffic classifier with convolutional and recurrent neural networks
lenges and opportunities,’’ in Internet of Things. Springer, 2014, pp. 1–17, for Internet of Things,’’ IEEE Access, vol. 5, pp. 18042–18050, 2017, doi:
doi: 10.1007/978-3-319-04223-7_1. 10.1109/ACCESS.2017.2747560.
[3] F. Saeik, M. Avgeris, D. Spatharakis, N. Santi, D. Dechouniotis, J. Violos, [22] Y. Meidan, M. Bohadana, A. Shabtai, J. D. Guarnizo, M. Ochoa,
A. Leivadeas, N. Athanasopoulos, N. Mitton, and S. Papavassiliou, ‘‘Task N. O. Tippenhauer, and Y. Elovici, ‘‘ProfilIoT: A machine learning
offloading in edge and cloud computing: A survey on mathematical, arti- approach for IoT device identification based on network traffic analysis,’’
ficial intelligence and control theory solutions,’’ Comput. Netw., vol. 195, in Proc. Symp. Appl. Comput. (SAC), Marrakech, Morocco, Apr. 2017,
Aug. 2021, Art. no. 108177, doi: 10.1016/j.comnet.2021.108177. pp. 506–509.

VOLUME 10, 2022 21209


A. Hameed et al.: Deep Learning Approach for IoT Traffic Multi-Classification in Smart-City Scenario

[23] A. Sivanathan, H. H. Gharakheili, F. Loi, A. Radford, C. Wijenayake, JOHN VIOLOS was a Research Associate at the
A. Vishwanath, and V. Sivaraman, ‘‘Classifying IoT devices in smart National Technical University of Athens, a Ses-
environments using network traffic characteristics,’’ IEEE Trans. sional Lecturer at the Harokopio University of
Mobile Comput., vol. 18, no. 8, pp. 1745–1759, Aug. 2019, doi: Athens, and a Visiting Lecturer at the National and
10.1109/tmc.2018.2866249. Kapodistrian University of Athens. He was a mem-
[24] A. Hameed and A. Leivadeas, ‘‘IoT traffic multi-classification using net- ber of the European Commission’s Digital Single
work and statistical features in a smart environment,’’ in Proc. IEEE 25th Market working group on the code of conduct for
Int. Workshop Comput. Aided Modeling Design Commun. Links Netw.
switching and porting data between cloud service
(CAMAD), Pisa, Italy, Sep. 2020, pp. 1–7.
providers. He is currently a Research Associate
[25] J. Ren, D. J. Dubois, D. Choffnes, A. M. Mandalari, R. Kolcun, and
H. Haddadi, ‘‘Information exposure from consumer IoT devices: A multi- with the Department of Software Engineering and
dimensional, network-informed measurement approach,’’ in Proc. Internet Information Technology, ETS. His research interests include deep learning,
Meas. Conf., New York, NY, USA, Oct. 2019, pp. 267–279. machine learning, and cloud and edge computing.
[26] C. Zong, R. Xia, and J. Zhang, ‘‘Text representation,’’ in Text Data Mining,
1st ed. Singapore: Springer, 2021.
[27] J. Brownlee, ‘‘How to choose a feature selection method for machine
learning,’’ Mach. Learn. Mastery, 2020. Accessed: Jul. 27, 2021. [Online].
Available: https://machinelearningmastery.com/feature-selection-with-
real-and-categorical-data/
[28] K. Backhaus, B. Erichson, S. Gensler, R. Weiber, and T. Weiber, ‘‘Logis-
tic regression,’’ in Multivariate Analysis, K. Backhaus, B. Erichson,
S. Gensler, R. Weiber, and T. Weiber, Ed. Wiesbaden, Germany: Springer,
2021, pp. 267–354.
[29] M. Henry, ‘‘Review on gradient descent algorithms in deep learning
approaches,’’ J. Innov. Develop. Pharmaceutical Tech. Sci., vol. 4, no. 3,
pp. 91–95, 2021.
[30] M. Okwu and L. Tartibu, ‘‘Artificial neural network,’’ in Metaheuris-
tic Optimization: Nature-Inspired Algorithms Swarm and Computational
Intelligence, Theory and Applications, M. Okwu and L. Tartibu, Eds.
Cham, Switzerland: Springer, 2021, pp. 133–145.
[31] Scikit Learn, Neural Network Models (Supervised).
Accessed: Jul. 27, 2021. [Online]. Available: https://scikit-learn.org/stable/
modules/neural_networks_supervised.html
[32] (2020). Keras Tuner. Accessed: Jul. 27, 2021. [Online]. Available: ARIS LEIVADEAS (Senior Member, IEEE)
https://keras-team.github.io/keras-tuner/ received the Diploma degree in electrical and
[33] University of New SouthsWales. IoT Traffic Traces. computer engineering from the University of
Accessed: Jul. 27, 2021. [Online]. Available: https://iotanalytics.unsw. Patras, Greece, in 2008, the M.Sc. degree in
edu.au/iottraces engineering from King’s College London, U.K.,
in 2009, and the Ph.D. degree in electrical and
computer engineering from the National and Tech-
nical University of Athens, Greece, in 2015.
From 2015 to 2018, he was a Postdoctoral
AROOSA HAMEED received the master’s degree Researcher with the Department of Systems and
in computer science from Quaid-i-Azam Univer- Computer Engineering, Carleton University, Ottawa, ON, Canada. In par-
sity, Islamabad, Pakistan, in 2018. She is currently allel, he worked as an Intern at Ericsson and then at Cisco, Ottawa. He is
pursuing the Ph.D. degree with the Department currently an Associate Professor with the Department of Software and
of Software and Information Technology Engi- Information Technology Engineering, Ecole de Technologie Superieure
neering, Ecole de Technologie Superieure (ETS), (ETS), University of Quebec, Canada. His research interests include cloud
Montreal. Her main research interests include the computing, the IoT, and network optimization and management. He received
Internet of Things (IoT), traffic analytics, the IoT the Best Paper Award in ACM ICPE 2018 and IEEE iThings 2021 and the
services, the IoT security, and machine learning. Best Presentation Award in IEEE HPSR 2020.

21210 VOLUME 10, 2022

You might also like