0% found this document useful (0 votes)
29 views37 pages

Unsupervised Machine Learning For Networking Techniques Applications and Research Challenges

Unsupervised Machine Learning

Uploaded by

Jayoti Saha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views37 pages

Unsupervised Machine Learning For Networking Techniques Applications and Research Challenges

Unsupervised Machine Learning

Uploaded by

Jayoti Saha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Received April 5, 2019, accepted April 26, 2019, date of publication May 14, 2019, date of current version

June 3, 2019.
Digital Object Identifier 10.1109/ACCESS.2019.2916648

Unsupervised Machine Learning for Networking:


Techniques, Applications and
Research Challenges
MUHAMMAD USAMA 1 , JUNAID QADIR 1, AUNN RAZA2 , HUNAIN ARIF2 ,
KOK-LIM ALVIN YAU 3 , YEHIA ELKHATIB 4, AMIR HUSSAIN5,6 , AND ALA AL-FUQAHA 7,8
1 Information Technology University (ITU)-Punjab, Lahore 54000, Pakistan
2 National University of Science and Technology (NUST), Islamabad 44000, Pakistan
3 Sunway University, Subang Jaya 47500, Malaysia
4 The School of Computing and Communications, Lancaster University, Lancaster LA1 4WA, U.K.
5 School of Computing, Edinburgh Napier University, Edinburgh EH11 4BN, U.K.
6 Taibah Valley, Taibah University, Medina 42353, Saudi Arabia
7 Information and Computing Technology (ICT) Division, College of Science and Engineering (CSE), Hamad Bin Khalifa University, Doha, Qatar
8 Department of Computer Science, Western Michigan University, Kalamazoo, MI 49008, USA

Corresponding author: Ala Al-Fuqaha ([email protected])


The publication of this article was funded by the Qatar National Library (QNL). The statements made herein are solely the responsibility of
the authors.

ABSTRACT While machine learning and artificial intelligence have long been applied in networking
research, the bulk of such works has focused on supervised learning. Recently, there has been a rising
trend of employing unsupervised machine learning using unstructured raw network data to improve network
performance and provide services, such as traffic engineering, anomaly detection, Internet traffic classifica-
tion, and quality of service optimization. The growing interest in applying unsupervised learning techniques
in networking stems from their great success in other fields, such as computer vision, natural language
processing, speech recognition, and optimal control (e.g., for developing autonomous self-driving cars).
In addition, unsupervised learning can unconstrain us from the need for labeled data and manual handcrafted
feature engineering, thereby facilitating flexible, general, and automated methods of machine learning. The
focus of this survey paper is to provide an overview of applications of unsupervised learning in the domain of
networking. We provide a comprehensive survey highlighting recent advancements in unsupervised learning
techniques, and describe their applications in various learning tasks, in the context of networking. We also
provide a discussion on future directions and open research issues, while identifying potential pitfalls. While
a few survey papers focusing on applications of machine learning in networking have previously been
published, a survey of similar scope and breadth is missing in the literature. Through this timely review,
we aim to advance the current state of knowledge, by carefully synthesizing insights from previous survey
papers, while providing contemporary coverage of the recent advances and innovations.

INDEX TERMS Machine learning, deep learning, unsupervised learning, computer networks.

I. INTRODUCTION methods for optimization and automated decision-making


Networks—such as the Internet and mobile telecom from the fields of artificial intelligence (AI) and machine
networks—serve the function of the central hub of modern learning (ML). Such AI and ML techniques have already
human societies, which the various threads of modern transformed multiple fields—e.g., computer vision, natural
life weave around. With networks becoming increasingly language processing (NLP), speech recognition, and opti-
dynamic, heterogeneous, and complex, the management of mal control (e.g., for developing autonomous self-driving
such networks has become less amenable to manual admin- vehicles)—with the success of these techniques mainly
istration, and it can benefit from leveraging support from attributed to firstly, significant advances in unsupervised
ML techniques such as deep learning, secondly, the ready
The associate editor coordinating the review of this manuscript and availability of large amounts of unstructured raw data
approving it for publication was Nuno Garcia. amenable to processing by unsupervised learning algorithms,

2169-3536 2019 IEEE. Translations and content mining are permitted for academic research only.
VOLUME 7, 2019 Personal use is also permitted, but republication/redistribution requires IEEE permission. 65579
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
M. Usama et al.: Unsupervised Machine Learning for Networking: Techniques, Applications, and Research Challenges

and finally, advances in computing technologies through ML [8]. Unsupervised ML techniques facilitate the analy-
advances such as cloud computing, graphics processing unit sis of raw datasets, thereby helping in generating analytic
(GPU) technology and other hardware enhancements. It is insights from unlabeled data. Recent advances in hierarchical
anticipated that AI and ML will also make a similar impact learning, clustering algorithms, factor analysis, latent models,
on the networking ecosystem and will help realize a future and outlier detection, have helped significantly advance the
vision of cognitive networks [1], [2], in which networks will state of the art in unsupervised ML techniques. In particular,
self-organize and will autonomously implement intelligent recent unsupervised ML advances—such as the development
network-wide behavior to solve problems such as routing, of ‘‘deep learning’’ techniques [22]—have however signif-
scheduling, resource allocation, and anomaly detection. The icantly advanced the ML state of the art by facilitating the
initial attempts towards creating cognitive or intelligent processing of raw data without requiring careful engineering
networks have relied mostly on supervised ML methods, and domain expertise for feature crafting. Deep learning is
which are efficient and powerful but are limited in scope a class of machine learning, where hierarchical architectures
by their need for labeled data. With network data becom- are used for unsupervised feature learning and these learned
ing increasingly voluminous (with a disproportionate rise features are then used for classification and other related tasks
in unstructured unlabeled data), there is a groundswell of [23]. The versatility of deep learning and distributed ML
interest in leveraging unsupervised ML methods to utilize can be seen in the diversity of their applications that range
unlabeled data, in addition to labeled data where available, from self-driving cars to the reconstruction of brain circuits
to optimize network performance [3]. The rising interest in [22]. Unsupervised learning is also often used in conjunction
applying unsupervised ML in networking applications also with supervised learning in semi-supervised learning setting
stems from the need to liberate ML applications from restric- to preprocess the data before analysis and thereby help in
tive demands of supervised ML. Another reason of employing crafting a good feature representation and in finding patterns
unsupervised ML in networking is the expensiveness of and structures in unlabeled data.
curating labeled network data at scale, since labeled data The rapid advances in deep neural networks, the democ-
may be unavailable and manual annotation is prohibitively ratization of enormous computing capabilities through cloud
inconvenient, in addition, to be outdated quickly (due to the computing and distributed computing, and the ability to store
highly dynamic nature of computer networks) [4]. and process large swathes of data have motivated a surging
We are already witnessing the failure of human network interest in applying unsupervised ML techniques in the net-
administrators to manage and monitor all bits and pieces working field. The field of networking also appears to be well
of network [5], and the problem will only exacerbate with suited to, and amenable to applications of unsupervised ML
further growth in the size of networks with paradigms such as techniques, due to the largely distributed decision-making
becoming the Internet of things (IoT). An ML-based network nature of its protocols, the availability of large amounts of
management system (NMS) is desirable in such large net- network data, and the urgent need for intelligent/cognitive
works so that faults/bottlenecks/anomalies may be predicted networking. Consider the case of routing in networks. Net-
in advance with reasonable accuracy. In this regard, networks works these days have evolved to be very complex, and
already have ample amount of untapped data, which can pro- they incorporate multiple physical paths for redundancy and
vide us with decision-making insights making networks more utilize complex routing methodologies to direct the traffic.
efficient and self-adapting. With unsupervised ML, the pipe The application traffic does not always take the optimal
dream is that every algorithm for adjusting network parame- path we would expect, leading to unexpected and inefficient
ters (be it, TCP congestion window or rerouting network traf- routing performance. To tame such complexity, unsupervised
fic during peak time) will optimize itself in a self-organizing ML techniques can autonomously self-organize the network
fashion according to the environment and application, user, taking into account a number of factors such as real-time
and network Quality of Service (QoS) requirements and network congestion statistics as well as application QoS
constraints [6]. Unsupervised ML methods, in concert with requirements [24].
existing supervised ML methods, can provide a more efficient The purpose of this paper is to highlight the important
method that lets a network manage, monitor, and optimize advances in unsupervised learning, and after providing a
itself while keeping the human administrators in the loop with tutorial introduction to these techniques, to review how such
the provisioning of timely actionable information. techniques have been, or could be, used for various tasks in
Next generation networks are expected to be self-driven, modern next-generation networks comprising both computer
which means they have the ability to self-configure, optimize, networks as well as mobile telecom networks.
and heal [7]. All these self-driven properties can be achieved
by building artificial intelligence in the system using ML A. CONTRIBUTION OF THE PAPER
techniques. Self-driven networks are supposed to utilize the To the best of our knowledge, there does not exist a survey that
network data to perform networking chores and most of specifically focuses on the important applications of unsu-
the network data is imbalanced and unlabeled. In order to pervised ML techniques in networks, even though a number
develop a reliable data-driven network, data quality must be of surveys exist that focus on specific ML applications per-
taken care before subjecting it to an appropriate unsupervised taining to networking—for instance, surveys on using ML

65580 VOLUME 7, 2019


M. Usama et al.: Unsupervised Machine Learning for Networking: Techniques, Applications, and Research Challenges


TABLE 1. Comparison of our paper with existing survey and review papers (Legend: means covered; × means not covered; ≈ means partially covered.).

for cognitive radios [11], traffic identification and classifica- describes future work and opportunities with respect to the
tion [10], and anomaly detection [9], [15]. Previous survey use of unsupervised ML in future networking. Section V dis-
papers have either focused on specific unsupervised learning cusses a few major pitfalls of the unsupervised ML approach
techniques (e.g., [25] have provided a survey of the applica- and its models. Finally, Section VI concludes this paper. For
tions of neural networks in wireless networks) or on some the reader’s facilitation, Table 2 shows all the acronyms used
specific applications of computer networking ([13] have pro- in this survey for convenient referencing.
vided a survey of the applications of ML in cyber intru-
sion detection). Our survey paper is timely since there is II. TECHNIQUES FOR UNSUPERVISED LEARNING
great interest in deploying automated and self-taught unsu- In this section, we will introduce some widely used
pervised learning models in the industry and academia. Due unsupervised learning techniques and their applications in
to relatively limited applications of unsupervised learning computer networks. We have divided unsupervised learning
in networking—in particular, the deep learning trend has techniques into six major categories: hierarchical learning,
not yet impacted networking in a major way—unsupervised data clustering, latent variable models, dimensionality reduc-
learning techniques hold a lot of promises for advancing tion, and outlier detection. Figure 2 depicts a taxonomy of
the state of the art in networking in terms of adaptability, unsupervised learning techniques and also the relevant sec-
flexibility, and efficiency. The novelty of this survey is that it tions in which these techniques are discussed. To provide a
covers many different important applications of unsupervised better understanding of the application of unsupervised ML
ML techniques in computer networks and provides readers techniques in networking, we have added few subsections
with a comprehensive discussion of the unsupervised ML highlighting significant applications of unsupervised ML
trends, as well as the suitability of various unsupervised ML techniques in networking domain.
techniques. A tabulated comparison of our paper with other
existing survey and review articles is presented in Table 1.
A. HIERARCHICAL LEARNING
Hierarchical learning is defined as learning simple and
B. ORGANIZATION OF THE PAPER complex features from a hierarchy of multiple linear and
The organization of this paper is depicted in Figure 1. nonlinear activations. In learning models, a feature is a mea-
Section II provides a discussion on various unsupervised ML surable property of the input data. Desired features are ideally
techniques (namely, hierarchical learning, data clustering, informative, discriminative, and independent. In statistics,
latent variable models, and outlier detection). Section III features are also known as explanatory (or independent) vari-
presents a survey of the applications of unsupervised ML ables [26]. Feature learning (also known as data representa-
specifically in the domain of computer networks. Section IV tion learning) is a set of techniques that can learn one or more

VOLUME 7, 2019 65581


M. Usama et al.: Unsupervised Machine Learning for Networking: Techniques, Applications, and Research Challenges

FIGURE 1. Outline of the paper.

FIGURE 2. Taxonomy of unsupervised learning techniques.

features from input data [27]. It involves the transformation gives rise to automated learning of generalized features from
of raw data into a quantifiable and comparable representation, the underlying structure of the input data. Like other learning
which is specific to the property of the input but general algorithms, feature learning is also divided among domains
enough for comparison to similar inputs. Conventionally, of supervised and unsupervised learning depending on the
features are handcrafted specific to the application on hand. type of available data. Almost all unsupervised learning
It relies on domain knowledge but even then they do not algorithms undergo a stage of feature extraction in order to
generalize well to the variation of real-world data, which learn data representation from unlabeled data and generate

65582 VOLUME 7, 2019


M. Usama et al.: Unsupervised Machine Learning for Networking: Techniques, Applications, and Research Challenges

TABLE 2. List of common acronyms used. transformation layers, a machine can self-learn a very com-
plex model or representation of data. Learning takes place
in hidden layers and the optimal weights and biases of the
neurons are updated in two passes, namely, the forward pass
and backward pass. A typical ANN and typical cyclic and
acyclic topologies of interconnection between neurons are
shown in Figure 3. A brief taxonomy of Unsupervised NNs
is presented in Figure 4.
An ANN has three types of layers (namely input,
hidden and output, each having different activation
parameters). Learning is the process of assigning optimal
activation parameters enabling ANN to perform input to
output mapping. For a given problem, an ANN may require
multiple hidden layers involving a long chain of computa-
tions, i.e., its depth [41]. Deep learning has revolutionized
ML and is now increasingly being used in diverse settings—
e.g., object identification in images, speech transcription into
text, matching user’s interests with items (such as news items,
movies, products) and making recommendations, etc. But
until 2006, relatively few people were interested in deep
learning due to the high computational cost of deep learning
procedures. It was widely believed that training deep learning
architectures in an unsupervised manner was intractable, and
supervised the training of deep NNs (DNN) also showed poor
performance with large generalization errors [42]. However,
recent advances [43]–[45] have shown that deep learning
can be performed efficiently by separate unsupervised pre-
training of each layer with the results revolutionizing the field
of ML. Starting from the input (observation) layer, which acts
as an input to the subsequent layers, pre-training tends to learn
data distributions while the usual supervised stage performs
a local search for fine-tuning.

1) UNSUPERVISED MULTILAYER FEED FORWARD NN


Unsupervised multilayer feedforward NN, with reference
to graph theory, has a directed graph topology as shown
in Figure 3. It consists of no cycles, i.e., does not have a feed-
back path in input propagation through NN. Such kind of NN
is often used to approximate a nonlinear mapping between
inputs and required outputs. Autoencoders are the prime
examples of unsupervised multilayer feedforward NNs.

a: AUTOENCODERS
An autoencoder is an unsupervised learning algorithm for
a feature vector on the basis of which further tasks are ANN used to learn compressed and encoded representation
performed. of data, mostly for dimensionality reduction and for unsu-
Hierarchical learning is intimately related to how deep pervised pre-training of feedforward NNs. Autoencoders are
learning is performed in modern multi-layer neural networks. generally designed using approximation function and trained
In particular, deep learning techniques benefits from the using backpropagation and stochastic gradient descent (SGD)
fundamental concept of artificial neural networks (ANNs), techniques. Autoencoders are the first of their kind to use
a deep structure consists of multiple hidden layers with mul- the back-propagation algorithm to train with unlabeled data.
tiple neurons in each layer, a nonlinear activation function, Autoencoders aim to learn a compact representation of the
a cost function, and a back-propagation algorithm. Deep function of input using the same number of input and output
learning [40] is a hierarchical technique that models high- units with usually less hidden units to encode a feature vector.
level abstraction in data using many layers of linear and They learn the input data function by recreating the input
nonlinear transformations. With deep enough stack of these at the output, which is called encoding/decoding, to learn

VOLUME 7, 2019 65583


M. Usama et al.: Unsupervised Machine Learning for Networking: Techniques, Applications, and Research Challenges

FIGURE 3. Illustration of an ANN (left); Different types of ANN topologies (right).

FIGURE 4. Taxonomy of unsupervised neural networks.

at the time of training NN. In short, a simple autoencoder a: SELF-ORGANIZING/ KOHONEN MAPS
learns a low-dimensional representation of the input data by Self-Organizing Maps (SOM), also known as Kohonen’s
exploiting similar recurring patterns. maps [48], [49], are a special class of NNs that uses the
Autoencoders have different variants [46] such as vari- concept of competitive learning, in which output neurons
ational autoencoders, sparse autoencoders, and denoising compete amongst themselves to be activated in a real-valued
autoencoders. Variational autoencoder is an unsupervised output, results having only single neuron (or group of neu-
learning technique used clustering, dimensionality reduction, rons), called winning neuron. This is achieved by creat-
and visualization, and for learning complex distributions [47]. ing lateral inhibition connections (negative feedback paths)
In a sparse autoencoder, a sparse penalty on the latent layer between neurons [50]. In this orientation, the network deter-
is applied for extracting a unique statistical feature from mines the winning neuron within several iterations; subse-
unlabeled data. Finally, denoising autoencoders are used to quently, it is forced to reorganize itself based on the input data
learn the mapping of a corrupted data point to its original distribution (hence they are called Self-Organizing Maps).
location in the data space in an unsupervised manner for They were initially inspired by the human brain, which has
manifold learning and reconstruction distribution learning. specialized regions in which different sensory inputs are rep-
resented/processed by topologically ordered computational
2) UNSUPERVISED COMPETITIVE LEARNING NN maps. In SOM, neurons are arranged on vertices of a lattice
Unsupervised competitive learning NNs is a winner-take-all (commonly one or two dimensions). The network is forced
neuron scheme, where each neuron competes for the right of to represent higher-dimensional data in lower-dimensional
the response to a subset of the input data. This scheme is used representation by preserving the topological properties of
to remove the redundancies from the unstructured data. Two input data by using neighborhood function while transform-
major techniques of unsupervised competitive learning NNs ing the input into a topological space in which neuron posi-
are self-organizing maps and adaptive resonance theory NNs. tions in the space are representatives of intrinsic statistical

65584 VOLUME 7, 2019


M. Usama et al.: Unsupervised Machine Learning for Networking: Techniques, Applications, and Research Challenges

features that tell us about the inherently nonlinear nature keeping their difference within the threshold limits of vigi-
of SOMs. lance parameter, which in result is considered as the member
Training a network comprising SOM is essentially a of the expected class of neurons [52]. Learning of an ART
three-stage process after random initialization of weighted model primarily consists of a comparison field, recognition
connections. The three stages are as follow [51]. field, vigilance (threshold) parameter, and a reset module.
• Competition: Each neuron in the network computes its The comparison field takes an input vector, which in result is
value using a discriminant function, which provides the passed, to best match in the recognition field; the best match
basis of competition among the neurons. Neuron with is the current winning neuron. Each neuron in the recognition
the largest discriminant value in the competition group field passes a negative output in proportion to the quality of
is declared the winner. the match, which inhibits other outputs, therefore, exhibiting
• Cooperation: The winner neuron then locates the center lateral inhibitions (competitions). Once the winning neuron
of the topological neighborhood of excited neurons in is selected after a competition with the best match to the input
the previous stage, providing a basis for cooperation vector, the reset module compares the quality of the match to
among excited neighboring neurons. the vigilance threshold. If the winning neuron is within the
• Adaption: The excited neurons in the neighborhood threshold, it is selected as the output, else the winning neuron
increase/decrease their individual values of the discrimi- is reset and the process is started again to find the next best
nant function in regard to input data distribution through match to the input vector. In case where no neuron is capable
subtle adjustments such that the response of the winning to pass the threshold test, a search procedure begins in which
neuron is enhanced for similar subsequent input. Adap- the reset module disables recognition neurons one at a time to
tion stage is distinguishable into two sub-stages: (1) the find a correct match whose weight can be adjusted to accom-
ordering or self-organizing phase, in which weight vec- modate the new match, therefore ART models are called self-
tors are reordered according to topological space; and organizing and can deal with the plasticity/stability dilemma.
(2) the convergence phase, in which the map is fine-
tuned and declared accurate to provide statistical quan- 3) UNSUPERVISED DEEP NN
tification of the input space. This is the phase in which In recent years unsupervised deep NN has become the most
the map is declared to be converged and hence trained. successful unsupervised structure due to its application in
One essential requirement in training a SOM is the many benchmarking problems and applications [53]. Three
redundancy of the input data to learn about the underlying major types of unsupervised deep NNs are deep belief NNs,
structure of neuron activation patterns. Moreover, sufficient deep autoencoders, and convolutional NNs.
quantity of data is required for creating distinguishable clus-
ters; withstanding enough data for classification problem, a: DEEP BELIEF NN
there exist a problem of gray area between clusters and cre-
Deep Belief Neural Network or simply Deep Belief Networks
ation of infinitely small clusters where input data has minimal
(DBN) is a probability-based generative graph model that is
patterns.
composed of hierarchical layers of stochastic latent variables
b: ADAPTIVE RESONANCE THEORY having binary valued activations, which are referred as hidden
Adaptive Resonance Theory (ART) is another different cat- units or feature detectors. The top layers in DBNs have
egory of NN models that is based on the theory of human undirected, symmetric connections between them forming
cognitive information processing. It can be explained as an an associative memory. DBNs provide a breakthrough in
algorithm of incremental clustering which aims at forming unsupervised learning paradigm. In the learning stage, DBN
multi-dimensional clusters, automatically discriminating and learns to reconstruct its input, each layer acting as feature
creating new categories based on input data. Primarily, ART detectors. DBN can be trained by greedy layer-wise training
models are classified as an unsupervised learning model; starting from the top layer with raw input, subsequent layers
however, there exist ART variants that employ supervised are trained with the input data from the previously visible
and semi-supervised learning approaches as well. The main layer [43]. Once the network is trained in an unsupervised
setback of most NN models is that they lose old information manner and learned the distribution of the data, it can be
(updating/diminishing weights) as new information arrives, fine-tuned using supervised learning methods, or supervised
therefore an ideal model should be flexible enough to accom- layers can be concatenated in order to achieve the desired task
modate new information without losing the old one, and this (for instance, classification).
is called the plasticity-stability problem. ART models provide
a solution to this problem by self-organizing in real time and b: DEEP AUTOENCODER
creating a competitive environment for neurons, automati- Another famous type of DBN is the deep autoencoder, which
cally discriminating/creating new clusters among neurons to is composed of two symmetric DBNs—the first of which is
accommodate any new information. used to encode the input vector, while the second decodes.
ART model resonates around (top-down) observer By the end of the training of the deep autoencoder, it tends
expectations and (bottom-up) sensory information while to reconstruct the input vector at the output neurons, and

VOLUME 7, 2019 65585


M. Usama et al.: Unsupervised Machine Learning for Networking: Techniques, Applications, and Research Challenges

therefore the central layer between both DBNs is the actual perspectives of RNN to be discussed in the scope of this
compressed feature vector. survey, namely, the depth of the architecture and the training
of the network. The depth, in the case of a simple artificial
c: CONVOLUTIONAL NN NN, is the presence of hierarchical nonlinear intermediate
Convolutional NN (CNN) are feed forward NN in which layers between the input and output signals. In the case of an
neurons are adapted to respond to overlapping regions in RNN, there are different hypotheses explaining the concept
two-dimensional input fields such as visual or audio input. of depth. One hypothesis suggests that RNNs are inherently
It is commonly achieved by local sparse connections among deep in nature when expanded with respect to sequential
successive layers and tied shared weights followed by rec- input; there are a series of nonlinear computations between
tifying and pooling layers which results in transformation the input at time t(i) and the output at time t(i + k).
invariant feature extraction. Another advantage of CNN over However, at an individual discrete time step, certain tran-
simple multilayer NN is that it is comparatively easier to train sitions are neither deep nor nonlinear. There exist input-to-
due to sparsely connected layers with the same number of hidden, hidden-to-hidden, and hidden-to-output transitions,
hidden units. CNN represents the most significant type of which are shallow in the sense that there are no intermediate
architecture for computer vision as they solve two challenges nonlinear layers at discrete time step. In this regard, different
with the conventional NNs: 1) scalable and computationally deep architectures are proposed in [56] that introduce inter-
tractable algorithms are needed for processing high- mediate nonlinear transitional layers in between the input,
dimensional images; and 2) algorithms should be transfor- hidden and output layers. Another novel approach is also
mation invariant since objects in an image can occur at an proposed by stacking hidden units to create a hierarchical
arbitrary position. However, most CNN’s are composed of representation of hidden units, which mimic the deep nature
supervised feature detectors in the lower and middle hidden of standard deep NNs.
layers. In order to extract features in an unsupervised manner, Due to the inherently complex nature of RNN, to the best
a hybrid of CNN and DBN, called Convolutional Deep Belief of our knowledge, there is no widely adopted approach for
Network (CDBN), is proposed in [54]. Making probabilistic training RNNs and many novel methods (both supervised
max-pooling1 to cover larger input area and convolution as and unsupervised) are introduced to train RNNs. Considering
an inference algorithm makes this model scalable with higher unsupervised learning of RNN in the scope of this paper, [57]
dimensional input. Learning is processed in an unsupervised employ Long Short-term Memory (LSTM) RNN to be trained
manner as proposed in [44], i.e., greedy layer-wise (lower to in an unsupervised manner using unsupervised learning algo-
higher) training with unlabeled data. rithms, namely Binary Information Gain Optimization and
CDBN is a promising scalable generative model for learn- non parametric Entropy Optimization, in order to make a
ing translation invariant hierarchical representation from any network to discriminate between a set of temporal sequences
high-dimensional unlabeled data in an unsupervised man- and cluster them into groups. Results have shown remarkable
ner taking advantage of both worlds, i.e., DBN and CNN. ability of RNNs for learning temporal sequences and cluster-
CNN, being widely employed for computer vision applica- ing them based on a variety of features. Two major types of
tions, can be employed in computer networks for optimiza- unsupervised recurrent NN are Hopfield NN and Boltzmann
tion of Quality of Experience (QoE) and Quality of Service machine.
(QoS) of multimedia content delivery over networks, which
is an open research problem for next-generation computer a: HOPFIELD NN
networks [55]. Hopfield NN is a cyclic recurrent NN where each node is
connected to others. Hopfield NN provides an abstraction
4) UNSUPERVISED RECURRENT NN of circular shift register memory with nonlinear activation
Recurrent NN (RNN) is the most complex type of NN, functions to form a global energy function with guaranteed
and hence the nearest match to an actual human brain that convergence to local minima. Hopfield NNs are used for
processes sequential inputs. It can learn temporal behaviors finding clusters in the data without a supervisor.
of a given training data. RNN employs an internal memory
per neuron to process such sequential inputs in order to b: BOLTZMANN MACHINE
exhibit the effect of the previous event on the next. Compared The Boltzmann machine is a stochastic symmetric recur-
to feed forward NNs, RNN is a stateful network. It may rent NN that is used for search and learning problems.
contain computational cycles among states and uses time Due to binary vector based simple learning algorithm of
as the parameter in the transition function from one unit to Boltzmann machine, very interesting features representing
another. Being complex and recently developed, it is an open the complex unstructured data can be learned [58]. Since
research problem to create domain-specific RNN models and the Boltzmann machine uses multiple hidden layers as fea-
train them with sequential data. Specifically, there are two ture detectors, the learning algorithm becomes very slow.
To avoid slow learning and to achieve faster feature detection
1 Max-pooling is an algorithm of selecting the most responsive receptive instead of Boltzmann machine, a faster version, namely the
field of a given interest region. restricted Boltzmann machine (RBM), is used for practical

65586 VOLUME 7, 2019


M. Usama et al.: Unsupervised Machine Learning for Networking: Techniques, Applications, and Research Challenges

FIGURE 5. Clustering process.

problems [59]. Restricted Boltzmann machine learns a prob- B. DATA CLUSTERING


ability distribution over its input data but since it is restricted Clustering is an unsupervised learning task that aims to find
in its layer to layer connectivity RBM loses its property of hidden patterns in unlabeled input data in the form of clus-
recurrence. It is faster than a Boltzmann machine because it ters [64]. Simply put, it encompasses the arrangement of data
only uses one hidden layer as a feature detector layer. RBM in meaningful natural groupings on the basis of the similarity
is used for dimensionality reduction, clustering and feature between different features (as illustrated in Figure 5) to learn
learning in computer networks. about its structure. Clustering involves the organization of
data in such a way that there are high intra-cluster and low
5) SIGNIFICANT APPLICATIONS OF HIERARCHICAL inter-cluster similarity. The resulting structured data is termed
LEARNING IN NETWORKS as data-concept [65]. Clustering is used in numerous applica-
ANNs/DNNs are the most researched topic when creat- tions from the fields of ML, data mining, network analysis,
ing intelligent systems in computer vision and natural lan- pattern recognition, and computer vision. The various tech-
guage processing whereas their application in computer niques used for data clustering are described in more detail
networks are very limited, they are employed in differ- later in Section II-B. In networking, clustering techniques
ent networking applications such as classification of traffic, are widely deployed for applications such as traffic analysis
anomaly/intrusion detection, detecting Distributed Denial of and anomaly detection in all kinds of networks (e.g., wireless
Service (DDoS) attacks, and resource management in cogni- sensor networks and mobile ad-hoc networks), with anomaly
tive radios [60]. The motivation of using DNN for learning detection [66].
and predicting in networks is the unsupervised training that Clustering improves performance in various applications.
detects hidden patterns in ample amount of data that is near McGregor et al. [67] propose an efficient packet tracing
to impossible for a human to handcraft features catering for approach using the Expectation-Maximization (EM) proba-
all scenarios. Moreover, many new research shows that a bilistic clustering algorithm, which groups flows (packets)
single model is not enough for the need of some applications, into a small number of clusters, where the goal is to analyze
so developing a hybrid NN architecture having pros and network traffic using a set of representative clusters.
cons of different models creates a new efficient NN which A brief overview of different types of clustering methods
provides even better results. Such an approach is used in [61], and their relationships can be seen in Figure 6. Clustering can
in which a hybrid model of ART and RNN is employed to be divided into three main types [68], namely hierarchical
learn and predict traffic volume in a computer network in clustering, Bayesian clustering, and partitional clustering.
real time. Real-time prediction is essential to adaptive flow Hierarchical clustering creates a hierarchical decomposition
control, which is achieved by using hybrid techniques so that of data, whereas Bayesian clustering forms a probabilistic
ART can learn new input patterns without re-training the model of the data that decides the fate of a new test point
entire network and can predict accurately in the time series probabilistically. In contrast, partitional clustering constructs
of RNN. Furthermore, DNNs are also being used in resource multiple partitions and evaluates them on the basis of certain
allocation and QoE/QoS optimizations. Using NN for opti- criterion or characteristic such as the Euclidean distance.
mization, efficient resource allocation without affecting the Before delving into the general sub-types of clustering,
user experience can be crucial in the time when resources are there are two unique clustering techniques, which need to be
scarce. Authors of [62], [63] propose a simple DBN for opti- discussed, namely density-based clustering and grid-based
mizing multimedia content delivery over wireless networks clustering. In some cases, density-based clustering is classi-
by keeping QoE optimal for end users. Table 3 also provides fied as a partitional clustering technique; however, we have
a tabulated description of hierarchical learning in networking kept it separate considering its applications in networking.
applications. However, these are just a few notable examples Density-based models target the most densely populated area
of deep learning and neural networks in networks, refer to of data space and separate it from areas having low densities,
Section III for more applications and detailed discussion on thus forming clusters [69]. [70] use density-based clustering
deep learning and neural networks in computer networks. to cluster data stream in real time, which is important in many
VOLUME 7, 2019 65587
M. Usama et al.: Unsupervised Machine Learning for Networking: Techniques, Applications, and Research Challenges

TABLE 3. Applications of hierarchical learning/ deep learning in networking applications.

FIGURE 6. Clustering taxonomy.

applications (e.g., intrusion detection in networks). Another 1) HIERARCHICAL CLUSTERING


technique is grid-based clustering, which divides the data Hierarchical clustering is a well-known strategy in data min-
space into cells to form a grid-like structure; subsequently, ing and statistical analysis in which data is clustered into a
all clustering actions are performed on this grid [71]. [71] hierarchy of clusters using an agglomerative (bottom-up) or a
also present a novel approach that uses a customized grid- divisive (top-down) approach. Almost all hierarchical clus-
based clustering algorithm to detect anomalies in networks. tering algorithms are unsupervised and deterministic. The
[72] proposed a novel method for clustering the time series primary advantage of hierarchical clustering over unsuper-
data, this scheme was based on a distance measure between vised K-means and EM algorithms is that it does not require
temporal features of the time series. the number of clusters to be specified beforehand. However,
We move on next to describe three major types of data clus- this advantage comes at the cost of computational efficiency.
tering approaches as per the taxonomy is shown in Figure 6. Common hierarchical clustering algorithms have at least

65588 VOLUME 7, 2019


M. Usama et al.: Unsupervised Machine Learning for Networking: Techniques, Applications, and Research Challenges

quadratic computational complexity compared to the linear algorithm have demonstrated a better ability to search clusters
complexity of K-means and EM algorithms. Hierarchical globally.
clustering methods have a pitfall: these methods fail to accu- Another variation of K-means is known as K-medoids,
rately classify messy high-dimensional data as its heuristic in which rather than taking the mean of the clusters, the most
may fail due to the structural imperfections of empirical centrally located data point of a cluster is considered as the
data. Furthermore, the computational complexity of the com- reference point of the corresponding cluster [80]. Few of
mon agglomerative hierarchical algorithms is NP-hard. SOM, the applications of K-medoids in the spectrum of anomaly
as discussed in Section II-A.2, is a modern approach that can detection can be seen here [80], [81].
overcome the shortcomings of hierarchical models [73].
b: MIXTURE MODELS
2) BAYESIAN CLUSTERING Mixture models are powerful probabilistic models for uni-
Bayesian clustering is a probabilistic clustering strategy variate and multivariate data. Mixture models are used to
where the posterior distribution of the data is learned on the make statistical inferences and deductions about the prop-
basis of a prior probability distribution. Bayesian clustering erties of the sub-populations given only observations on the
is divided into two major categories, namely parametric and pooled population. They have also used to statistically model
non-parametric [74]. The major difference between para- data in the domains of pattern recognition, computer vision,
metric and non-parametric techniques is the dimensionality ML, etc. Finite mixtures, which are a basic type of mixture
of parameter space: if there are finite dimensions in the model, naturally model observations that are produced by
parameter space, the underlying technique is called Bayesian a set of alternative random sources. Inferring and deduc-
parametric; otherwise, the underlying technique is called ing different parameters from these sources based on their
Bayesian non-parametric. A major pitfall with the Bayesian respective observations lead to clustering of the set of obser-
clustering approach is that the choice of the wrong prior prob- vations. This approach to clustering tackles drawbacks of
ability distributions can distort the projection of the data. [75] heuristic-based clustering methods, and hence it is proven to
performed Bayesian non-parametric clustering of network be an efficient method for node classification in any large-
traffic data to determine the network application type. scale network and has shown to yield effective results com-
pared to techniques commonly used. For instance, K-means
3) PARTITIONAL CLUSTERING and hierarchical agglomerative methods rely on supervised
Partitional clustering corresponds to a special class of cluster- design decisions, such as the number of clusters or validity
ing algorithms that decomposes data into a set of disjoint clus- of models [82]. Moreover, combining the EM algorithm with
ters. Given n observations, the clustering algorithm partitions mixture models produces remarkable results in deciphering
a data into k < n clusters [76]. Partitional clustering is further the structure and topology of the vertices connected through a
classified into K-means clustering and mixture models. multi-dimensional network [83]. Reference [84] used Gaus-
sian mixture model (GMM) to outperform signature based
a: K-MEANS CLUSTERING anomaly detection in network traffic data.
K-means clustering is a simple, yet widely used approach
for classification. It takes a statistical vector as an input to 4) SIGNIFICANT APPLICATIONS OF
deduce classification models or classifiers. K-means cluster- CLUSTERING IN NETWORKS
ing tends to distribute m observations into n clusters where Clustering can be found in mostly all unsupervised learning
each observation belongs to the nearest cluster. The member- problems, and there are diverse applications of clustering
ship of observation to a cluster is determined using the cluster in the domain of computer networks. Two major network-
mean. K-means clustering is used in numerous applications ing applications where significant use of clustering can be
in the domains of network analysis and traffic classifica- seen are intrusion detection and Internet traffic classifica-
tion. [77] used K-means clustering in conjunction with super- tion. One novel way to detect anomaly is proposed in [85],
vised ID3 decision tree learning models to detect anomalies this approach preprocesses the data using Genetic Algo-
in a network. An ID3 decision tree is an iterative supervised rithm (GA) combined with hierarchical clustering approach
decision tree algorithm based on the concept learning system. called Balanced Iterative Reducing using Clustering Hier-
K-means clustering provided excellent results when used archies (BIRCH) to provide an efficient classifier based on
in traffic classification. [78] showed that K-means cluster- Support Vector Machine (SVM). This hierarchical cluster-
ing performs well in traffic classification with an accuracy ing approach stores abstracted data points instead of the
of 90%. whole dataset, thus giving more accurate and quick clas-
K-means clustering is also used in the domain of network sification compared to all past methods, producing bet-
security and intrusion detection. Reference [79] proposed ter results in detecting anomalies. Another approach [71]
a K-means algorithm for intrusion detection. Experimental discusses the use of grid-based and density-based cluster-
results on a subset of KDD-99 dataset shows that the detec- ing for anomaly and intrusion detection using unsupervised
tion rate stays above 96% while the false alarm rate stays learning. Reference [86] used k-shape clustering scheme
below 4%. Results and analysis of experiments on K-means for analyzing spatiotemporal heterogeneity in mobile usage.

VOLUME 7, 2019 65589


M. Usama et al.: Unsupervised Machine Learning for Networking: Techniques, Applications, and Research Challenges

TABLE 4. Applications of data clustering in networking applications.

Basically, a scalable parallel framework for clustering large Mixture distribution provides a general framework for den-
datasets with high dimensions is proposed and then improved sity estimation by using the simpler parametric distributions.
by inculcating frequency pattern trees. Table 4 also provides Expectation maximization (EM) algorithm is used for esti-
a tabulated description of data clustering applications in net- mating the mixture distribution model [97], through max-
works. These are just a few notable examples of clustering imization of the log-likelihood of the mixture distribution
approaches in networks: refer to Section III for the detailed model.
discussion on some salient clustering applications in the con-
text of networks. 2) FACTOR ANALYSIS
Another important type of latent variable model is factor
C. LATENT VARIABLE MODELS
analysis, which is a density estimation model. It has been
A latent variable model is a statistical model that relates
used quite often in collaborative filtering and dimensionality
the manifest variables with a set of latent or hidden vari-
reduction. It is different from other latent variable models
ables. Latent variable model allows us to express relatively
in terms of the allowed variance for different dimensions
complex distributions in terms of tractable joint distributions
as most latent variable models for dimensionality reduction
over an expanded variable space [95]. Underlying variables
in conventional settings use a fixed variance Gaussian noise
of a process are represented in higher dimensional space
model. In the factor analysis model, latent variables have
using a fixed transformation, and stochastic variations are
diagonal covariance rather than isotropic covariance.
known as latent variable models where the distribution in
higher dimension is due to small number of hidden variables
3) BLIND SIGNAL SEPARATION
acting in a combination [96]. These models are used for
data visualization, dimensionality reduction, optimization, Blind Signal Separation (BSS), also referred to as Blind
distribution learning, blind signal separation and factor anal- Source Separation, is the identification and separation of
ysis. Next we will begin our discussion on various latent independent source signals from mixed input signals with-
variable models, namely mixture distribution, factor analysis, out or very little information about the mixing process.
blind signal separation, non-negative matrix factorization, Figure 7 depicts the basic BSS process in which source
Bayesian networks & probabilistic graph models (PGM), signals are extracted from a mixture of signals. It is a funda-
hidden Markov model (HMM), and nonlinear dimensional- mental and challenging problem in the domain of signal pro-
ity reduction techniques (which further includes generative cessing although the concept is extensively used in all types of
topographic mapping, multi-dimensional scaling, principal multi-dimensional data processing. Most common techniques
curves, Isomap, localliy linear embedding, and t-distributed employed for BSS are principal component analysis (PCA)
stochastic neighbor embedding). and independent component analysis (ICA).
a) Principal Component Analysis (PCA) is a statisti-
1) MIXTURE DISTRIBUTION cal procedure that utilizes orthogonal transformation on
Mixture distribution is an important latent variable model the data to convert n number of possibly correlated vari-
that is used for estimating the underlying density function. ables into lesser k number of uncorrelated variables named

65590 VOLUME 7, 2019


M. Usama et al.: Unsupervised Machine Learning for Networking: Techniques, Applications, and Research Challenges

FIGURE 7. Blind signal separation (BSS): A mixed signal composed of various input signals mixed by some
mixing process is blindly processed (i.e., with no or minimal information about the mixing process) to
show the original signals.

principal components. Principal components are arranged in 4) NON-NEGATIVE MATRIX FACTORIZATION


the descending order of their variability, first one catering Non-Negative Matrix Factorization (NMF) is a technique to
for the most variable and the last one for the least. Being a factorize a large matrix into two or more smaller matrices
primary technique for exploratory data analysis, PCA takes a with no negative values, that is when multiplied, it recon-
cloud of data in n dimensions and rotates it such that maxi- structs the approximate original matrix. NMF is a novel
mum variability in the data is visible. Using this technique, method in decomposing multivariate data making it easy
it brings out the strong patterns in the dataset so that these and straightforward for exploratory analysis. By NMF, hid-
patterns are more recognizable thereby making the data easier den patterns and intrinsic features within the data can be
to explore and visualize. identified by decomposing them into smaller chunks, enhanc-
PCA has primarily been used for dimensionality reduction ing the interpretability of data for analysis, with posi-
in which input data of n dimensions is reduced to k dimen- tivity constraints. However, there exist many classes of
sions without losing critical information in the data. The algorithms [100] for NMF having different generalization
choice of the number of principal components is a question properties, for example, two of them are analyzed in [101],
of the design decision. Much research has been conducted on one of which minimizes the least square error and while the
selecting the number of components such as cross-validation other focuses on the Kullback-Leibler divergence keeping
approximations [98]. Optimally, k is chosen such that the algorithm convergence intact.
ratio of the average squared projection error to the total
variation in the data is less than or equal to 1% by which 5) HIDDEN MARKOV MODEL
99% of the variance is retained in the k principal components. Hidden Markov Models (HMM) are stochastic models of
But, depending on the application domain, different designs great utility, especially in domains where we wish to analyze
can increase/decrease the ratio while maximizing the required temporal or dynamic processes such as speech recognition,
output. Commonly, many features of a dataset are often primary users (PU) arrival pattern in cognitive radio net-
highly correlated; hence, PCA results in retaining 99% of the works (CRNs), etc. HMMs are highly relevant to CRNs since
variance while significantly reducing the data dimensions. many environmental parameters in CRNs are not directly
b) Independent Component Analysis (ICA) is another tech- observable. An HMM-based approach can analytically model
nique for BSS that focuses on separating multivariate input a Markovian stochastic process in which we do not have
data into additive components with the underlying assump- access to the actual states, which are assumed to be unob-
tion that the components are non-Gaussian and statistically served or hidden; instead, we can observe a state that is
independent. The most common example to understand ICA stochastically dependent on the hidden state. It is for this
is the cocktail party problem in which there are n people reason that an HMM is defined to be a doubly stochastic
talking simultaneously in a room and one tries to listen to process.
a single voice. ICA actually separates source signals from
input mixed signal by either minimizing the statistical depen- 6) BAYESIAN NETWORKS & PROBABILISTIC
dence or maximizing the non-Gaussian property among the GRAPH MODELS (PGM)
components in the input signals by keeping the underly- In Bayesian learning we try to find the posterior proba-
ing assumptions valid. Statistically, ICA can be seen as the bility distributions for all parameter settings, in this setup,
extension of PCA, while PCA tries to maximize the second we ensure that we have a posterior probability for every
moment (variance) of data, hence relying heavily on Gaussian possible parameter setting. It is computationally expensive
features; on the other hand, ICA exploits inherently non- but we can use complicated models with a small dataset and
Gaussian features of the data and tries to maximize the fourth still avoid overfitting. Posterior probabilities are calculated by
moment of linear combination of inputs to extract non-normal dividing the product of sampling distribution and prior dis-
source components in the data [99]. tribution by marginal likelihood; in simple words, posterior

VOLUME 7, 2019 65591


M. Usama et al.: Unsupervised Machine Learning for Networking: Techniques, Applications, and Research Challenges

probabilities are calculated using Bayes theorem. The basis refer to Section III for more applications and detailed discus-
of reinforcement learning was also derived by using Bayes sion on BSS techniques in the networking domain.
theorem [102]. Since Bayesian learning is computationally Bayesian learning has been applied for classifying Inter-
expensive a new research trend is approximate Bayesian net traffic, where Internet traffic is classified based on the
learning [103]. Authors in [104] have given a comprehensive posterior probability distributions. For early traffic identifi-
survey of different approximate Bayesian inference algo- cation in campus network real discretized conditional proba-
rithms. With the emergence of Bayesian deep learning frame- bility has been used to construct a Bayesian classifier [116].
work the deployment of Bayes learning based solution is Host-level intrusion detection using Bayesian networks is
increasing rapidly. proposed in [117]. Authors in [118] purposed a Bayesian
Probabilistic graph modeling is a concept associated with learning based feature vector selection for anomalies classi-
Bayesian learning. A model representing the probabilistic fication in BGP. Port scan attacks prevention scheme using
relationship between random variables through a graph is a Bayesian learning approach is discussed in [119]. Inter-
known as a probabilistic graph model (PGM). Nodes and net threat detection estimation system is presented in [120].
edges in the graph represent a random variable and their prob- A new approach towards outlier detection using Bayesian
abilistic dependence, respectively. PGM are of two types: belief networks is described in [121]. Application of Bayesian
directed PGM and undirected PGM. Bayes networks also networks in MIMO systems has been explored in [122].
fall in the regime of directed PGM. PGM is used in many Location estimation using Bayesian network in LAN is dis-
important areas such as computer vision, speech processing, cussed in [123]. Similarly, Bayes theory and PGM are both
and communication systems. Bayesian learning combined used in Low-Density Parity Check (LDPC) and Turbo codes,
with PGM and latent variable models forms a probabilistic which are the fundamental components of information coding
framework where deep learning is used as a substrate for mak- theory. Table 5 also provides a tabulated description of latent
ing improved learning architecture for recommender systems, variable models applications in networking.
topic modeling, and control systems [105].
D. DIMENSIONALITY REDUCTION
7) SIGNIFICANT APPLICATIONS OF LATENT VARIABLE Representing data in fewer dimensions is another well-
MODELS IN NETWORKS established task of unsupervised learning. Real world data
In [106], authors have applied latent structure on email corpus often have high dimensions—in many datasets, these dimen-
to find interpretable latent structure as well as evaluating sions can run into thousands, even millions, of potentially
its predictive accuracy on missing data task. A dynamic correlated dimensions [133]. However, it is observed that the
latent model for a social network is represented in [107]. intrinsic dimensionality (governing parameters) of the data is
Characterization of the end-to-end delay using a Weibull less than the total number of dimensions. In order to find the
mixture model is discussed in [108]. Mixture models for end essential pattern of the underlying data by extracting intrinsic
host traffic analysis have been explored in [109]. BSS is a dimensions, it is necessary that the real essence is not lost;
set of statistical algorithms that are widely used in differ- e.g., it may be the case that a phenomenon is observable
ent application domains to perform different tasks such as only in higher-dimensional data and is suppressed in lower
dimensionality reduction, correlating and mapping features, dimensions, these phenomena are said to suffer from the
etc. [110] employed PCA for Internet traffic classification in curse of dimensionality [134]. While dimensionality reduc-
order to separate different types of flows in a network packet tion is sometimes used interchangeably with feature selection
stream. Similarly, authors of [111] used a semi-supervised [135], [136], a subtle difference exists between the two [137].
approach, where PCA is used for feature learning and an Feature selection is traditionally performed as a supervised
SVM classifier for intrusion detection in an autonomous task with a domain expert helping in handcrafting a set of
network system. Another approach for detecting anomalies critical features of the data. Such an approach generally
and intrusions proposed in [112] uses NMF to factorize differ- can perform well but is not scalable and prone to judgment
ent flow features and cluster them accordingly. Furthermore, bias. Dimensionality reduction, on the other hand, is more
ICA has been widely used in telecommunication networks to generally an unsupervised task, where instead of choosing
separate mixed and noisy source signals for efficient service. a subset of features, it creates new features (dimensions) as
For example, [113] extends a variant of ICA called Efficient a function of all features. Said differently, feature selection
Fast ICA (EF-ICA) for detecting and estimating the symbol considers supervised data labels, while dimensionality reduc-
signals from the mixed CDMA signals received from the tion focuses on the data points and their distributions in an
source endpoint. N-dimensional space.
In other literature, PCA uses a probabilistic approach to There exist different techniques for reducing data dimen-
find the degree of confidence in detecting an anomaly in sions [138] including projection of higher dimensional points
wireless networks [114]. Furthermore, PCA is also chosen onto lower dimensions, independent representation, and
as a method of clustering and designing Wireless Sensor sparse representation, which should be capable of recon-
Networks (WSNs) with multiple sink nodes [115]. However, structing the approximate data. Dimensionality reduction is
these are just a few notable examples of BSS in networks, useful for data modeling, compression, and visualization.

65592 VOLUME 7, 2019


M. Usama et al.: Unsupervised Machine Learning for Networking: Techniques, Applications, and Research Challenges

TABLE 5. Applications of latent variable models in networking applications.

By creating representative functional dimensions of the data Even though some supervised variants also exist, manifold
and eliminating redundant ones, it becomes easier to visualize learning is mostly performed in an unsupervised fashion
and form a learning model. Independent representation tries using the nonlinear manifold substructure learned from the
to disconnect the source of variation underlying the data high-dimensional structure of the data from the data itself
distribution such that the dimensions of the representation without the use of any predetermined classifier or labeled
are statistically independent [40]. Sparse representation tech- data. Some nonlinear dimensionality reduction (manifold
nique represents the data vectors in linear combinations of learning) techniques are described below:
small basis vectors.
It is worth noting here that many of the latent variable mod- 1) ISOMAP
els (e.g., PCA, ICA, factor analysis) also function as tech- Isomap is a nonlinear dimensionality reduction technique that
niques for dimensionality reduction. In addition to techniques finds the underlying low dimensional geometric information
such as PCA, ICA—which infer the latent inherent structure about a dataset. Algorithmic features of PCA and MDS
of the data through a linear projection of the data—a number are combined to learn the low dimensional nonlinear man-
of nonlinear dimensionality reduction techniques have also ifold structure in the data [139]. Isomap uses geodesic dis-
been developed and will be focused upon in this section to tance along the shortest path to calculate the low dimension
avoid repetition of linear dimensionality reduction techniques representation shortest path, which can be computed using
that have already been covered as part of the previous subsec- Dijkstra’s algorithm.
tion. Linear dimensionality reduction techniques are useful in
many settings but these methods may miss important nonlin- 2) GENERATIVE TOPOGRAPHIC MODEL
ear structure in the data due to their subspace assumption, Generative topographic mapping (GTM) represents the
which posits that the high-dimensional data points lie on a nonlinear latent variable mapping from continuous low
linear subspace (for example, on a 2-D or 3D plane). Such dimensional distributions embedded in high dimensional
an assumption fails in high dimensions when data points are spaces [140]. Data space in GTM is represented as reference
random but highly correlated with neighbors. In such environ- vectors and these vectors are a projection of latent points in
ments nonlinear dimensionality reductions through manifold data space. It is a probabilistic variant of SOM and works
learning techniques—which can be construed as an attempt by calculating the Euclidean distance between data points.
to generalize linear frameworks like PCA so that nonlinear GTM optimizes the log-likelihood function, and the resulting
structure in data can also be recognized—become desirable. probability defines the density in data space.

VOLUME 7, 2019 65593


M. Usama et al.: Unsupervised Machine Learning for Networking: Techniques, Applications, and Research Challenges

TABLE 6. Applications of dimensionality reduction in networking applications.

3) LOCALLY LINEAR EMBEDDING t-SNE constructs a probability distribution in high dimen-


Locally linear embedding (LLE) [133] is an unsupervised sional space and constructs a similar distribution in lower
nonlinear dimensionality reduction algorithm. LLE repre- dimensions and minimizes the KullbackâĂŞLeibler (KL)
sents data in lower dimensions yet preserving the higher divergence between two distributions (which is a useful
dimensional embedding. LLE depicts data in a single global way to measure the difference between two probability
coordinate of lower dimensional mapping of input data. LLE distributions) [144].
is used to visualize multi-dimensional dimensional manifolds Table 6 also provides a tabulated description of dimen-
and feature extraction. sionality reduction applications in networking. The applica-
tions of nonlinear dimensionality reduction methods are later
4) PRINCIPAL CURVES described in detail in Section III-D.
The principal curve is a nonlinear dataset summarizing tech-
nique where non-parametric curves pass through the middle E. OUTLIER DETECTION
of multi-dimensional dataset providing the summary of the Outlier detection is an important application of unsupervised
dataset [141]. These smooth curves minimize the average learning. A sample point that is distant from other samples is
squared orthogonal distance between data points, this process called an outlier. An outlier may occur due to noise, measure-
also resembles the maximum likelihood for nonlinear regres- ment error, heavy tail distributions and a mixture of two dis-
sion in the presence of Gaussian noise [142]. tributions. There are two popular underlying techniques for
5) NONLINEAR MULTI-DIMENSIONAL SCALING unsupervised outlier detection upon which many algorithms
Nonlinear multi-dimensional scaling (NMDS) [143] is a non- are designed, namely the nearest neighbor based technique
linear latent variable representation scheme. It works as an and clustering based method.
alternative scheme for factor analysis. In factor analysis,
1) NEAREST NEIGHBOR BASED OUTLIER DETECTION
a multivariate normal distribution is assumed and similari-
ties between different objects are expressed as a correlation The nearest neighbor method works on estimating the
matrix. Whereas NMDS does not impose such a condition, Euclidean distances or average distance of every sample from
and it is designed to reach the optimal low dimensional con- all other samples in the dataset. There are many algorithms
figuration where similarities and dissimilarities among matri- based on nearest neighbor based techniques, with the most
ces can be observed. NMDS is also used in data visualization famous extension of the nearest neighbor being a k-nearest
and mining tools for depicting the multi-dimensional data in 3 neighbor technique in which only k nearest neighbors par-
dimensions based on the similarities in the distance matrix. ticipate in the outlier detection [154]. Local outlier factor is
another outlier detection algorithm, which works as an exten-
6) T-DISTRIBUTED STOCHASTIC NEIGHBOR EMBEDDING sion of the k-nearest neighbor algorithm. Connectivity-based
t-distributed stochastic neighbor embedding (t-SNE) is outlier factors [155], influenced outlierness [156], and local
another nonlinear dimensionality reduction scheme. It is used outlier probability models [157] are few famous examples of
to represent high dimensional data in 2 or 3 dimensions. the nearest neighbor based techniques.

65594 VOLUME 7, 2019


M. Usama et al.: Unsupervised Machine Learning for Networking: Techniques, Applications, and Research Challenges

2) CLUSTER BASED OUTLIER DETECTION measure the Internet. Internet traffic classification is an
Clustering based methods use the conventional K-means important component for service providers to understand
clustering technique to find dense locations in the data and the characteristics of the service such as quality of service,
then perform density estimation on those clusters. After den- quality of experience, user behavior, network security and
sity estimation, a heuristic is used to classify the formed clus- many other key factors related to the overall structure of a
ter according to the cluster size. Anomaly score is computed network [163]. In this subsection, we will survey the unsuper-
by calculating the distance between every point and its cluster vised learning applications in network traffic classification.
head. Local density cluster based outlier factor [158], cluster- As networks evolve at a rapid pace, malicious intruders are
ing based multivariate Gaussian outlier score [159], [160] and also evolving their strategies. Numerous novel hacking and
histogram based outlier score [161] are the famous cluster intrusion techniques are being regularly introduced causing
based outlier detection models in literature. SVM and PCA severe financial jolts to companies and headaches to their
are also suggested for outlier detection in literature. administrators. Tackling these unknown intrusions through
accurate traffic classification on the network edge, therefore,
3) SIGNIFICANT APPLICATIONS OF OUTLIER becomes a critical challenge and an important component of
DETECTION IN NETWORKS the network security domain. Initially, when networks used
Outlier detection algorithms are used in many different appli- to be small, simple port-based classification technique that
cations such as intrusion detection, fraud detection, data leak- tried to identify the associated application with the corre-
age prevention, surveillance, energy consumption anomalies, sponding packet based on its port number was used. However,
forensic analysis, critical state detection in designs, elec- this approach is now obsolete because recent malicious soft-
trocardiogram and computed tomography scan for tumor ware uses a dynamic port-negotiation mechanism to bypass
detection. Unsupervised anomaly detection is performed by firewalls and security applications. A number of contrast-
estimating the distances and densities of the provided non- ing Internet traffic classification techniques have been pro-
annotated data [162]. More applications of outlier detection posed since then, and some important ones are discussed
schemes will be discussed in Section III. next.
Most of the modern traffic classification methods use
F. LESSONS LEARNT
different ML and clustering techniques to produce accurate
Key lessons drawn from the review of unsupervised learning
clusters of packets depending on their applications, thus pro-
techniques are summarized below.
ducing efficient packet classification [10]. The main purpose
1) Hierarchical learning techniques are the most pop-
of classifying network’s traffic is to recognize the destination
ular schemes in literature for feature detection and
application of the corresponding packet and to control the
extraction.
flow of the traffic when needed such as prioritizing one flow
2) Learning the joint distribution of a complex distribution
over others. Another important aspect of traffic classification
over an expanded variable space is a difficult task.
is to detect intrusions and malicious attacks or screen out
Latent variable models have been the recommended
forbidden applications (packets).
and well-established schemes in literature for this prob-
The first step in classifying Internet traffic is selecting
lem. These models are also used for dimensionality
accurate features, which is an extremely important, yet com-
reduction and better representation of data.
plex task. Accurate feature selection helps ML algorithms
3) Visualization of unlabeled multidimensional data is
to avoid problems like class imbalance, low efficiency, and
another unsupervised task. In this research, we have
low classification rate. There are three major feature selec-
explored the dimensionality reduction as an underlying
tion methods in Internet traffic for classification: the fil-
scheme for developing better multidimensional data
ter method, the wrapper based method, and the embedded
visualization tools.
method. These methods are based on different ML and
III. APPLICATIONS OF UNSUPERVISED genetic learning algorithms [164]. Two major concerns in
LEARNING IN NETWORKING feature selection for Internet traffic classification are the
In this section, we will introduce some significant appli- large size of data and imbalanced traffic classes. To deal
cations of the unsupervised learning techniques that have with these issues and to ensure accurate feature selection,
been discussed in Section II in the context of computer net- a min-max ensemble feature selection scheme is proposed
works. We highlight the broad spectrum of applications in in [165]. A new information-theoretic approach for feature
networking and emphasize the importance of ML-based tech- selection for skewed datasets is described in [166]. This
niques, rather than classical hard-coded statistical methods, algorithm has resolved the multi-class imbalance issue but
for achieving more efficiency, adaptability, and performance it does not resolve the issues of feature selection. In 2017,
enhancement. an unsupervised autoencoder based scheme has outperformed
previous feature learning schemes, autoencoders were used
A. INTERNET TRAFFIC CLASSIFICATION as a generative model and were trained in a way that the
Internet traffic classification is of prime importance in net- bottleneck layer learned a latent representation of the feature
working as it provides a way to understand, develop and set; these features were then used for malware classification

VOLUME 7, 2019 65595


M. Usama et al.: Unsupervised Machine Learning for Networking: Techniques, Applications, and Research Challenges

and anomaly detection to produce results that improved the clustering process is used for Internet traffic data characteri-
state of the art in feature selection [30]. zation [177]. In another work, a two-phased ML approach for
Much work has been done on classifying traffic based on Internet traffic classification using K-means and C5.0 deci-
supervised ML techniques. Initially, in 2004, the concept of sion tree is presented in [178] where the average accuracy of
clustering bi-directional flows of packets came out with the classification was 92.37%.
use of EM probabilistic clustering algorithm, which clusters A new approach for Internet traffic classification has been
the flows depending on various attributes such as packet size introduced in 2017 by [88] in which unidirectional and bidi-
statistics, inter-arrival statistics, byte counts, and connection rectional information is extracted from the collected traffic,
duration, etc. [67]. Furthermore, clustering is combined with and K-means clustering is performed on the basis of statistical
the above model [172]; this strategy uses Naïve Bayes clus- properties of the extracted flows. A supervised classifier then
tering to classify traffic in an automated fashion. Recently, classifies these clusters. Another unsupervised learning based
unsupervised ML techniques have also been introduced in algorithm for Internet traffic detection is described in [179]
the domain of network security for classifying traffic. Major where a restricted Boltzmann machine based SVM is pro-
developments include a hybrid model to classify traffic in posed for traffic detection, this paper model the detection as
more unsupervised manner [173], which uses both labeled a classification problem. Results were compared with ANN
and unlabeled data to train the classifier making it more and decision tree algorithms on the basis of precision and
durable and efficient. However, later on, completely unsuper- recall. Application of deep learning algorithms in Internet
vised methods for traffic classification have been proposed, traffic classification has been discussed in [16], with this work
and still, much work is going on in this area. Initially, a com- also outlining the open research challenges in applying deep
pletely unsupervised approach for traffic classification was learning for Internet traffic classification. These problems
employed using the K-means clustering algorithm combined are related to training the models for big data since Internet
with log transformation to classify data into corresponding data for deep learning falls in big data regime, optimiza-
clusters. Then, [78] highlighted that using K-means and this tion issues of the designed models given the uncertainty in
method for traffic classification can improve accuracy by Internet traffic and scalability of deep learning architectures
10% to achieve an overall 90% accuracy. in Internet traffic classification. To cope with the challenges
Another improved and faster approach was proposed of developing a flexible high-performance platform that can
in 2006 [174], which examines the size of the first five capture data from a high-speed network operating at more
packets and determines the application correctly using unsu- than 60 Gbps, [180] have introduced a platform for high-
pervised learning techniques. This approach has shown to speed packet to tuple sequence conversion which can sig-
produce better results than the state-of-the-art traffic classi- nificantly advance the state of the art in real-time network
fier, and also has removed its drawbacks (such as dealing traffic classification. In another work, [181] used stacked
with outliers or unknown packets, etc.). Another similar auto- autoencoders for Internet traffic classification and produced
mated traffic classifier and application identifier can be seen more than 90% accurate results for the two classes in KDD
in [175], and they use the auto-class unsupervised Bayesian 99 dataset.
classifier, which automatically learns the inherent natural Deep belief network combined with Gaussian model
classes in a dataset. employed for Internet traffic prediction in wireless mesh
In 2013, another novel strategy for traffic classification backbone network has been shown to outperform the pre-
known as network traffic classification using correlation was vious maximum likelihood estimation technique for traffic
proposed [167], which uses non-parametric NN combined prediction [182]. Given the uncertainty of WLAN channel
with statistical measurement of correlation within data to traffic classification is very tricky, [169] proposed a new
efficiently classify traffic. The presented approach addressed variant of Gaussian mixture model by incorporating universal
the three major drawbacks of supervised and unsupervised background model and used it for the first time to classify
learning classification models: firstly, they are inappropriate the WLAN traffic. A brief overview of the different Internet
for sparse complex networks as labeling of training data takes traffic classification systems, classified on the basis of unsu-
too much computation and time; secondly, many supervised pervised technique and tasks discussed earlier, is presented in
schemes such as SVM are not robust to training data size; and the Table 7.
lastly, and most importantly, all supervised and unsupervised
algorithms perform poorly if there are few training samples. B. ANOMALY/INTRUSION DETECTION
Thus, classifying the traffic using correlations appears to The increasing use of networks in every domain has increased
be more efficient and adapting. [176] compared four ANN the risk of network intrusions, which makes user privacy and
approaches for computer network traffic, and modeled the the security of critical data vulnerable to attacks. According
Internet traffic like a time series and used mathematical to the annual computer crime and security survey 2005 [199],
methods to predict the time series. A greedy layer-wise train- conducted by the combined teams of CSI (Computer Security
ing for unsupervised stacked autoencoder produced excellent Institute) and FBI (Federal Bureau of Investigation), total
classification results, but at the cost of significant system financial losses faced by companies due to the security attacks
complexity. Genetic algorithm combined with constraint and network intrusions were estimated as US $130 million.

65596 VOLUME 7, 2019


M. Usama et al.: Unsupervised Machine Learning for Networking: Techniques, Applications, and Research Challenges

TABLE 7. Internet traffic classification with respect to unsupervised learning techniques and tasks.

Moreover, according to the Symantec Internet Security of detecting insider attacks such as using system resources
Threat Report [200], approximately 5000 new vulnerabilities through another user profile; secondly, each ADS is based on
were identified in the year 2015. In addition, more than a customized user profile which makes it very difficult for
400 million new variants of malware programs and 9 major attackers to ascertain which types of attacks would not set an
breaches were detected exposing 10 million identities. There- alarm; and lastly, it detects unknown behavior in a computer
fore, insecurity in today’s networking environment has given system rather than detecting intrusions, thus it is capable of
rise to the ever-evolving domain of network security and detecting any unknown sophisticated attack which is different
intrusion/anomaly detection [200]. from the users’ usual behavior. However, these benefits come
In general, Intrusion Detection Systems (IDS) recognize with a trade-off, in which the process of training a system on
or identify any act of security breach within a computer or a a user’s ‘normal’ profile and maintaining those profiles is a
network; specifically, all requests which could compromise time consuming and challenging task. If an inappropriate user
the confidentiality and availability of data or resources of a profile is created, it can result in poor performance. Since
system or a particular network. Generally, intrusion detection ADS detects any behavior that does not align with a user’s
systems can be categorized into three types: (1) signature- normal profile, its false alarm rate can be high. Lastly, another
based intrusion detection systems; (2) anomaly detection pitfall of ADS is that a malicious user can train ADS gradually
systems; and (3) compound/hybrid detection systems, which to accept inappropriate traffic as normal.
include selective attributes of both preceding systems. As anomaly and intrusion detection have been a popular
Signature detection, also known as misuse detection, is a research area since the origin of networking and Internet,
technique that was initially used for tracing and identify- numerous supervised as well as unsupervised [201] learning
ing misuses of user’s important data, computer resources, techniques have been applied to efficiently detect intrusions
and intrusions in the network based on the previously col- and malicious activities. However, latest research focuses on
lected or stored signatures of intrusion attempts. The most the application of unsupervised learning techniques in this
important benefit of a signature-based system is that a com- area due to the challenge and promise of using big data for
puter administrator can exactly identify the type of attack a optimizing networks.
computer is currently experiencing based on the sequence Initial work focuses on the application of basic unsu-
of the packets defined by stored signatures. However, it is pervised clustering algorithms for detecting intrusions and
nearly impossible to maintain the signature database of all anomalies. In 2005, an unsupervised approach was proposed
evolving possible attacks, thus this pitfall of the signature- based on density and grid-based clustering to accurately
based technique has given rise to anomaly detection systems. classify the high-dimensional dataset in a set of clusters;
Anomaly Detection System (ADS) is a modern intrusion those points which do not fall in any cluster are marked
and anomaly detection system. Initially, it creates a baseline as abnormal [71]. This approach has produced good results
image of a system profile, its network and user program but the false positive rate was very high. In follow-up work,
activity. Then, on the basis of this baseline image, ADS classi- another improved approach that used fuzzy rough C-means
fies any activity deviating from this behavior as an intrusion. clustering was introduced [85], [195]. K-means clustering is
Few benefits of this technique are: firstly, they are capable also another famous approach used for detecting anomalies

VOLUME 7, 2019 65597


M. Usama et al.: Unsupervised Machine Learning for Networking: Techniques, Applications, and Research Challenges

which were later proposed in 2009 [79], which showed great area networks, wireless sensor networks, cyber-physical sys-
accuracy and outperformed existing unsupervised methods. tems, and WLANs are surveyed in [212].
However, later in 2012, an improved method which used Another paper [213] reviewing anomaly detection has pre-
K-means clustering combined with the C4.5 decision tree sented the application of unsupervised SVM and clustering
algorithm was proposed [192] to produce more efficient based applications in network intrusion detection systems.
results than prior approaches. [202] combines cluster centers Unsupervised discretization algorithm is used in Bayesian
and nearest neighbors for effective feature representation network classifier for intrusion detection, which is based on
which ensures a better intrusion detection, however, a limi- Bayesian model averaging [214]; the authors show that the
tation with this approach is that it is not able to detect user to proposed algorithm performs better than the Naïve Bayes
resource and remote to local attacks. Another scheme using classifier in terms of accuracy on the NSL-KDD intru-
unsupervised learning approach for anomaly detection is pre- sion detection dataset. Border gateway protocol (BGP)—
sented in [203]. The presented scheme combines subspace the core Internet inter-autonomous systems (inter-AS) rout-
clustering and correlation analysis to detect anomalies and ing protocol—is also error prone to intrusions and anoma-
provide protection against unknown anomalies; this exper- lies. To detect these BGP anomalies, many supervised and
iment used WIDE backbone networks data [204] spanning unsupervised ML solutions (such as hidden Markov models
over six years and produced better results than previous and principal component analysis) have been proposed in
K-means based techniques. Work presented in [205] shows literature [215]. Another problem for anomaly detection is
that for different intrusions schemes, there are a small set low volume attacks, which have become a big challenge for
of measurements required to differentiate between normal network traffic anomaly detection. While long-range depen-
and anomalous traffic; the authors used two co-clustering dencies (LRD) are used to identify these low volume attacks,
schemes to perform clustering and to determine which LRD usually works on aggregated traffic volume; but since
measurement subset contributed the most towards accurate the volume of traffic is low, the attacks can pass undetected.
detection. To accurately identify low volume abnormalities, [216] pro-
Another famous approach for increasing detection accu- posed the examination of LRD behavior of control plane and
racy is ensemble learning, work presented in [206] employed data plane separately to identify low volume attacks.
many hybrid incremental ML approaches with gradient Other than clustering, another widely used unsupervised
boosting and ensemble learning to achieve better detection technique for detecting malicious and abnormal behavior in
performance. Authors in [207] surveyed anomaly detection networks is SOMs. The specialty of SOMs is that they can
research from 2009 to 2014 and find out the unique algo- automatically organize a variety of inputs and deduce patterns
rithmic similarity for anomaly detection in Internet traf- among themselves, and subsequently determine whether the
fic: most of the algorithms studied have following sim- new input fits in the deduced pattern or not, thus detecting
ilarities 1) Removal of redundant information in training abnormal inputs [184], [185]. SOMs have also been used
phase to ensure better learning performance 2) Feature selec- in host-based intrusion detection systems in which intruders
tion usually performed using unsupervised techniques and and abusers are identified at a host system through incom-
increases the accuracy of detection 3) Use ensembles clas- ing data traffic [188], later on, a more robust and efficient
sifiers or hybrid classifiers rather than baseline algorithms technique was proposed to analyze data patterns in TCP
to get better results. Authors in [208] have developed an traffic [186]. Furthermore, complex NNs have also been
artificial immune system based intrusion detection system applied to solve the same problem and remarkable results
they have used density-based spatial clustering of applica- have been produced. A few examples include the application
tions with noise to develop an immune system against the of ART combined with SOM [189]. The use of PCA can also
network intrusion detection. be seen in detecting intrusions [197]. NMF has also been
The application of unsupervised intrusion detection in used for detecting intruders and abusers [112], and lastly
cloud network is presented in [209] where authors have pro- dimensionality reduction techniques have also been applied
posed a fuzzy clustering ANN to detect the less frequent to eradicate intrusions and anomalies in the system [198]. For
attacks and improve the detection stability in cloud networks. more applications, refer to Table 8, which classifies different
Another application of unsupervised intrusion detection sys- network anomaly and intrusion detection systems on the basis
tem for clouds is surveyed in [210], where fuzzy logic based of unsupervised learning techniques discussed earlier.
intrusion detection system using supervised and unsupervised
ANN is proposed for intrusion detection; this approach is C. NETWORK OPERATIONS, OPTIMIZATIONS,
used for DOS and DDoS attacks where the scale of the attack AND ANALYTICS
is very large. Network intrusion anomaly detection system Network management comprises of all the operations
(NIDS) based on K-means clustering are surveyed in [211]; included in initializing, monitoring and managing of a com-
this survey is unique as it provides distance and similarity puter network based on its network functions, which are the
measure of the intrusion detection and this perspective has not primary requirements of the network operations. The general
been studied before 2015. Unsupervised learning based appli- purpose of network management and monitoring systems
cations of anomaly detection schemes for wireless personal is to ensure that basic network functions are fulfilled, and

65598 VOLUME 7, 2019


M. Usama et al.: Unsupervised Machine Learning for Networking: Techniques, Applications, and Research Challenges

TABLE 8. Anomaly & network intrusion detection systems (A-NIDS) with respect to unsupervised learning techniques.

if there is any malfunctioning in the network, it should be different measurable factors to determine the overall approx-
reported and addressed accordingly. Following is a summary imation of QoS such as error rates, bit rate, throughput, trans-
of different network optimization tasks achieved through mission delay, availability, jitters, etc. Furthermore, these
unsupervised learning models. factors are used to correlate QoS with QoE in the perspective
of video streaming where QoE is essential to end-users.
1) QOS/QOE OPTIMIZATION The dynamic nature of the Internet dictates network design
QoS and QoE are measures of service performance and end- for different applications to maximize QoS/QoE since there
user experience, respectively. QoS mainly deals with the is no predefined adaptive algorithm that can be used to fulfill
performance as seen by the user being measured quantita- all the necessary requirements for prospective application.
tively, while QoE is a qualitative measure of subjective met- Due to this fact, ML approaches are employed in order to
rics experienced by the user. QoS/QoE for Internet services adapt to the real-time network conditions and take measures
(especially multimedia content delivery services) is crucial in to stabilize/maximize the user experience. Reference [234]
order to maximize the user experience. With the dynamic and employed a hybrid architecture having unsupervised feature
bursty nature of Internet traffic, computer networks should learning with a supervised classification for QoE-based video
be able to adapt to these changes without compromising admission control and resource management. Unsupervised
end-user experiences. As QoE is quite subjective, it heavily feature learning in this system is carried out by using a fully
relies on the underlying QoS which is affected by different connected NN comprising RBMs, which capture descriptive
network parameters. References [232] and [233] suggested features of video that are later classified by using a supervised

VOLUME 7, 2019 65599


M. Usama et al.: Unsupervised Machine Learning for Networking: Techniques, Applications, and Research Challenges

classifier. Similarly, [235] presents an algorithm to estimate game theory [241]. Another promising approach, named
the Mean Opinion Score, a metric for measuring QoE, for Remy [242], uses a modified model of Markov decision pro-
VoIP services by using SOM to map quality metrics to cess based on three factors: 1) prior knowledge about the
features. network; 2) a traffic model based on user needs (i.e., through-
Moreover, research has shown that QoE-driven content put and delay); and 3) an objective function that is to be
optimization leads to the optimal utilization of the network. maximized. By this learning approach, a customized best-
Reference [236] showed that 43% of the bit overhead on suited congestion control scheme is produced specifically for
average can be reduced per image delivered on the web. This that part of the network, adapted to its unique requirements.
is achieved by using the quality metric VoQS (Variation of However, classifying packet losses using unsupervised learn-
Quality Signature), which can arbitrarily compare two images ing methods is still an open research problem and there is a
in terms of web delivery performance. By applying this met- need for real-time adaptive congestion control mechanism for
ric for unsupervised clustering of the large image dataset, multi-modal hybrid networks.
multiple coherent groups are formed in device-targeted and For more applications, refer to Table 9, which classifies
content-dependent manner. In another study [237], deep different various network optimization and operation works
learning is used to assess the QoE of 3D images that have on the basis of their network type and the unsupervised
yet to show good results compared with the other determin- learning technique used.
istic algorithms. The outcome is a Reduced Reference QoE
assessment process for automatic image assessment, and it
has a significant potential to be extended to work on 3D video D. DIMENSIONALITY REDUCTION & VISUALIZATION
assessment. Network data usually consists of multiple dimensions.
In [238], a unique technique of the model-based RL To apply machine learning techniques effectively the num-
approach is applied to improve bandwidth availability, and ber of variables is needed to be reduced. Dimensionality
hence throughput performance, of a network. The MRL reduction schemes have a number of significant potential
model is embedded in a node that creates a model of the applications in networks. In particular, dimensionality reduc-
operating environment and uses it to generate virtual states tion can be used to facilitate network operations (e.g., for
and rewards for the virtual actions taken. As the agent does anomaly/intrusion detection, reliability analysis, or for fault
not need to wait for the real states and rewards from the prediction) and network management (e.g., through visual-
operating environment, it can explore various kinds of actions ization of high-dimensional networking data). A tabulated
on the virtual operating environment within a short period summary of various research works using dimensionality
of time which helps to expedite the learning process, and reduction techniques for various kinds of networking appli-
hence the convergence rate to the optimal action. In [239], cations is provided in Table 10.
a MARL approach is applied in which nodes exchange Dimensionality reduction techniques have been used to
Q-values among themselves and select their respective next- improve the effectiveness of the anomaly/intrusion detec-
hop nodes with the best possible channel conditions while tion system. Reference [255] proposed a DDoS detection
forwarding packets towards the destination. This helps to system in SDN where dimensionality reduction is used for
improve throughput performance as nodes in a network feature extraction and reduction in an unsupervised manner
ensure that packets are successfully sent to the destination in using stacked sparse autoencoders. Reference [256] proposed
a collaborative manner. a flow-based anomaly intrusion detection using replicator
neural network. Proposed network is based on an encoder
2) TCP OPTIMIZATION and decoder where the hidden layer between encoder and
Transmission Control Protocol (TCP) is the core end-to-end decoder performs the dimensionality reduction in an unsu-
protocol in TCP/IP stack that provides reliable, ordered and pervised manner, this process also corresponds to PCA.
error-free delivery of messages between two communicating Similarly, [257] have proposed another anomaly detection
hosts. Due to the fact that TCP provides reliable and in-order procedure where dimensionality reduction for feature extrac-
delivery, congestion control is one of the major concerns of tion is performed using multi-scale PCA and then using
this protocol, which is commonly dealt with the algorithms wavelet analysis, so that the anomalous traffic is separated
defined in RFC 5681. However, classical congestion control from the flow. Dimensionality reduction using robust PCA
algorithms are sub-optimal in hybrid wired/wireless networks based on minimum covariance determinant estimator for
as they react to packet loss in the same manner in all net- anomaly detection is presented in [258]. [259] applied PCA
work situations. In order to overcome this shortcoming of for dimensionality reduction in network intrusion detec-
classical TCP congestion control algorithms, an ML-based tion application. To improve the performance of intrusion
approach is proposed in [240], which employs a supervised detection scheme, another algorithm based on dimensionality
classifier based on features learned for classifying a packet reduction for new feature learning using PCA is presented
loss due to congestion or link errors. Other approaches to in [260], [261]. [262] have reviewed the dimensionality
this problem currently employed in literature include using reduction schemes for intrusion detection in multimedia
RL that uses fuzzy logic based reward evaluator based on traffic and proposed an unsupervised feature selection

65600 VOLUME 7, 2019


M. Usama et al.: Unsupervised Machine Learning for Networking: Techniques, Applications, and Research Challenges

TABLE 9. Unsupervised learning techniques employed for network operations, optimizations and analytics.

scheme based on the dimensionality-reduced multimedia dimensionality reduction along with fuzzy C-mean clustering
data. algorithm for the quality of web usage. In another work, [266]
Dimensionality reduction using autoencoders performs a used Shrinking Sparse AutoEncoders (SSAE) for represent-
vital role in fault prediction and reliability analysis of the ing high-dimensional data and utilized SSAE in compressive
cellular networks, this work also recommends deep belief sensing settings.
networks and autoencoders as logical fault prediction tech- Visualization of high dimensional data in lower dimension
niques for self-organizing networks [263]. Most of the Inter- representation is another application of dimensionality reduc-
net applications use encrypted traffic for communication, tion. There are many relevant techniques such as PCA and
previously deep packet inspection (DPI) was considered a t-SNE that can be used to extract the underlying structure of
standard way of classifying network traffic but with the vary- high-dimensional data, which can then be visualized to aid
ing nature of the network application and randomization of human insight seeking and decision making [144]. A num-
port numbers and payload size DPI has lost its significance. ber of researchers have proposed to utilize dimensionality
Authors in [264] have proposed a hybrid scheme for network reduction techniques to aid visualization of networking data.
traffic classification. The proposed scheme uses extreme [252] proposed a manifold learning based visualization tool
machine learning, genetic algorithms and dimensionality for network traffic visualization and anomaly detection.
reduction for feature selection and traffic classification. Reference [267] proposed a PCA-based solution for the
Reference [265] applied fuzzy set theoretic approach for detection and visualization of networking attacks, in which

VOLUME 7, 2019 65601


M. Usama et al.: Unsupervised Machine Learning for Networking: Techniques, Applications, and Research Challenges

TABLE 10. Dimensionality reduction techniques employed for networking applications.

PCA is used for the dimensionality reduction of the feature outage in SON. Reference [277] used PCA for dimension-
vector extracted from KDD network traffic dataset. [268] ality reduction of drive test samples to detect cell outages
used t-SNE for depicting malware fingerprints in their pro- autonomously in SON. Conventional routing schemes are not
posed network intrusion detection system. Reference [269] sufficient for the fifth generation of communication systems.
proposed a rectangular dualization scheme for visualiz- Reference [278] proposed a supervised deep learning based
ing the underlying network topology. Reference [270] used routing scheme for heterogeneous network traffic control.
dimensionality reduction and t-SNE of clustering and visu- Although supervised approach performed well, gathering a
alization of botnet traffic. Finally, a lightweight platform lot of heterogeneous traffic with labels, and then processing
for home Internet monitoring is presented in [271] where them with a plain ANN is computationally extensive and
PCA and t-SNE are used for dimensionality reduction and prone to errors due to the imbalanced nature of the input
visualization of the network traffic. A number of tools data and the potential for overfitting. In 2017, [279] has
are readily available—e.g., Divvy [272], Weka [273]—that presented a deep learning based approach for routing and
implement dimensionality reduction and other unsupervised cost-effective packet processing. The proposed model uses
ML techniques (such as PCA and manifold learning) and deep belief architecture and benefits from the dimensionality
allow exploratory data analysis and visualization of high- reduction property of the restricted Boltzmann machine. The
dimensional data. proposed work also provides a novel Graphics Processing
Dimensionality reduction techniques and tools have been Unit (GPU) based router architecture. The detailed analysis
utilized in all kinds of networks and we present some recent shows that deep learning based SDR and routing technique
examples related to self-organizing networks (SONs) and can meet the changing network requirements and massive
software-defined radios (SDRs). Reference [274] proposed network traffic growth. The routing scheme proposed in [279]
a semi-supervised learning scheme for anomaly detection in outperforms conventional open shortest path first (OSPF)
SON based on dimensionality reduction and fuzzy classifica- routing technique in terms of throughput and average delay
tion technique. Reference [275] used minor component anal- per hop.
ysis (MCA) for dimensionality reduction as a preprocessing
step for user-level statistical data in LTE-A networks to detect E. EMERGING NETWORKING APPLICATIONS
the cell outage. Reference [247] used multi-dimensional scal- OF UNSUPERVISED LEARNING
ing (MDS), a dimensionality reduction scheme, as part of the Next generation network architectures such as Software-
preprocessing step for cell outage detection in SON. Another defined Networks (SDN), Self Organizing Networks (SON),
data-driven approach by [276] also uses MDS for getting and the Internet of Things (IoT) are expected to be the basis
a low dimensional embedding of target key point indicator of future intelligent, adaptive, and dynamic networks [280].
vector as a preprocessing step to automatically detect cell ML techniques will be at the center of this revolution provid-

65602 VOLUME 7, 2019


M. Usama et al.: Unsupervised Machine Learning for Networking: Techniques, Applications, and Research Challenges

ing the aforementioned properties. This subsection covers the in the self-organization and achieves the task by learning
recent applications of unsupervised ML techniques in SDNs, from the surrounding environment. As the connected net-
SONs, and IoTs. work devices are growing exponentially, and the communi-
cation cell size has reduced to femtocells, the property of
1) SOFTWARE DEFINED NETWORKS
self-organization is becoming increasingly desirable [290].
SDN is a disruptive new networking architecture that sim-
Feasibility of SON application in the fifth generation (5G)
plifies network operating and managing tasks and provides
of wireless communication is studied in [291] and the study
infrastructural support for novel innovations by making the
shows that without (supervised as well as unsupervised)
network programmable [281]. In simple terms, the idea of
ML support, SON is not possible. Application of ML tech-
programmable networks is to simply decouple the data for-
niques in SON has become a very important research area
warding plane and control/decision plane, which is rather
as it involves learning from the surroundings for intelligent
tightly coupled in the current infrastructure. The use of SDN
decision-making and reliable communication [2].
can also be seen in managing and optimizing networks as
Application of different ML-based SON for heterogeneous
network operators go through a lot of hassle to imple-
networks is considered in [292], this paper also describes
ment high-level security policies in term of distributed low-
the unsupervised ANN and hidden Markov models tech-
level system configurations, thus SDN resolves this issue
niques employed for better learning from the surroundings
by decoupling the planes and giving network operators bet-
and adapting accordingly. PCA and clustering are the two
ter control and visibility over network, enabling them to
most used unsupervised learning schemes utilized for param-
make frequent changes to network state and providing sup-
eter optimization and feature learning in SON [290]. These
port for high-level specification language for network con-
ML schemes are used in self-configuration, self-healing, and
trol [282]. SDN is applicable in a wide variety of areas
self-optimization schemes. Game theory is another unsuper-
ranging from enterprise networks, data centers, infrastructure
vised learning approach used for designing self-optimization
based wireless access networks, optical networks to home
and greedy self-configuration design of SON systems [293].
and small businesses, each providing many future research
Authors in [294] proposed an unsupervised ANN for link
opportunities [281].
quality estimation of SON which outperformed simple mov-
Unsupervised ML techniques are seeing a surging interest
ing average and exponentially weighted moving averages.
in SDN community as can be seen by a spate of recent work.
A popular application of unsupervised ML techniques in
3) INTERNET OF THINGS
SDNs relates to the application of intrusion detection and mit-
IoT is an emerging paradigm with a growing academic and
igation of security attacks [283]. Another approach for detect-
industry interest. IoT is an abstraction of intelligent, phys-
ing anomalies in a cloud environment using unsupervised
ical and virtual devices with unique identities, connected
learning model has been proposed by [284] that uses SOM to
together to form a cyber-physical framework. These devices
capture emergent system behavior and predict unknown and
collect, analyze and transmit data to public or private cloud
novel anomalies without any prior training or configuration.
for intelligent [295]. IoT is a new networking paradigm and
A DDoS detection system for SDN is presented in [255]
it is expected to be deployed in health care, smart cities,
where stacked autoencoders are used to detect DDoS attacks.
home automation, agriculture, and industry. With such a vast
A density peak based clustering algorithm for DDoS attack is
plane of applications, IoT needs ML to collect and analyze
proposed as a new method to review the potentials of using
data to make intelligent decisions. The key challenge that
SDN to develop an efficient anomaly detection method [285].
IoT must deal with is the extremely large scale (billions of
[286] have recently presented an intelligent threat aware
devices) of future IoT deployments [296]. Designing, analyz-
response system for SDN using reinforcement learning, this
ing and predicting are the three major tasks and all involve
work also recommends using unsupervised feature learning
ML, a few examples of unsupervised ML are shared next.
to improve the threat detection process. Another framework
Reference [297] recommend using unsupervised ML tech-
for anomaly detection, classification, and mitigation for SDN
niques for feature extraction and supervised learning for
is presented in [287] where unsupervised learning is used
classification and predictions. Given the scale of the IoT,
for traffic feature analysis. Reference [288] have presented
a large amount of data is expected in the network and there-
a forensic framework for SDN and recommended K-means
fore requires a load balancing method, a load balancing
clustering for anomaly detection in SDN. Another work [289]
algorithm based on a restricted Boltzmann machine is pro-
discusses the potential opportunities for using unsupervised
posed in [298]. Online clustering scheme forms dynamic IoT
learning for traffic classification in SDN. Moreover, deep
data streams is described in [299]. Another work describing
learning and distributed processing can also be applied to
an ML application in IoT recommends a combination of
such models in order to better adapt to evolving networks and
PCA and regression for IoT to get better prediction [300].
contribute to the future of SDN infrastructure as a service.
Usage of clustering technique in embedded systems for
2) SELF ORGANIZING NETWORKS IoT applications is presented in [301]. An application using
SON is another new and popular research regime in network- denoising autoencoders for acoustic modeling in IoT is
ing, SON is inspired by the biological system which works presented in [302].

VOLUME 7, 2019 65603


M. Usama et al.: Unsupervised Machine Learning for Networking: Techniques, Applications, and Research Challenges

F. LESSONS LEARNT are created for different services and network regions. Ideally,
Key lessons drawn from the review of unsupervised learning this would be abstract yet informative, such as Google Maps
in networking applications are summarized below: Directions, e.g. ‘‘there is heavier traffic than usual on your
1) A recommended and well-studied method for unsu- route’’ as well as suggestions about possible actions. This
pervised Internet traffic classification in literature is could be coupled with an automated correlation of different
data clustering combined with the latent representation reports coming from different parts of the network. This will
learning on traffic feature set by using autoencoders. require a move beyond mere notifications and visualizations
Min-max ensemble learning will help to increase the to more substantial synthesis through which potential sources
efficiency of unsupervised learning if required. of problems can be identified. Another example relates to
2) Semi-supervised learning is also an appropriate method making measurements more user-oriented. Most users would
for Internet traffic classification given some labeled be more interested in QoE instead of QoS, i.e., how the
traffic data and channel characteristics are available for current condition of the network affects their applications and
initial model training. services rather than just raw QoS metrics. The development
3) Application of generative models and transfer learn- of measurement objectives should be from a business-eyeball
ing for the Internet traffic classification has not been perspective—and not only through presenting statistics gath-
explored properly in literature and can be a potential ered through various tools and protocols such as traceroute,
research direction. ping, BGP, etc. with the burden of putting the various pieces
4) The overwhelming growth in network traffic and of knowledge together being on the user.
expected surge in traffic with the evolution of 5G and
B. SEMI-SUPERVISED LEARNING FOR
IoT also elevates the level of threat and anomalies in
COMPUTER NETWORKS
network traffic. To deal with these anomalies in Internet
Semi-supervised learning lies between supervised and unsu-
traffic, data clustering, PCA, SOM, and ART are well
pervised learning. The idea behind semi-supervised learning
explored unsupervised learning techniques in the liter-
is to improve the learning ability by using unlabeled data
ature. Self-taught learning has also been explored as a
incorporation with a small set of labeled examples. In com-
potential solution for anomaly detection and remains
puter networks, semi-supervised learning is partially used
a possible research direction for future research in
in anomaly detection and traffic classification and has great
anomaly detection in network traffic.
potential to be used with deep unsupervised learning archi-
5) Current state of the art in dimensionality reduction in
tectures like generative adversarial networks for improving
network traffic is based on PCA and multidimensional
the state of the art in anomaly detection and traffic classi-
scaling. Autoencoders, t-SNE, and manifold learning
fication. Similarly, user behavior learning for cybersecurity
are potential areas of research in terms of dimensional-
can also be tackled in a semi-supervised fashion. A semi-
ity reduction and visualization.
supervised learning based anomaly detection approach is pre-
IV. FUTURE WORK: SOME RESEARCH CHALLENGES sented in [304]. The presented approach used large amounts
AND OPPORTUNITIES of unlabeled samples together with labeled samples to build
This section provides a discussion on some open directions a better intrusion detection classifier. In particular, a single
for future work and the relevant opportunities in applying hidden layer feed-forward NN has trained to output a
unsupervised ML in the field of networking. fuzzy membership vector. The results show that using unla-
beled samples help significantly improve the classifier’s
A. SIMPLIFIED NETWORK MANAGEMENT
performance. In another work, [305] have proposed semi-
While new network architectures such as SDN have been
supervised learning with 97% accuracy to filter out non-
proposed in recent years to simplify network management,
malicious data in millions of queries that Domain Name
network operators are still expected to know too much,
Service (DNS) servers receive.
and to correlate between what they know about how their
network is designed with the current network’s condition C. TRANSFER LEARNING IN COMPUTER NETWORKS
through their monitoring sources. Operators who manage Transfer learning is an emerging ML technique in which
these requirements by wrestling with complexity manu- knowledge learned from one problem is applied to a different
ally will definitely welcome any respite that they can but related problem [306]. Although it is often thought that
get from (semi-)automated unsupervised machine learning. for ML algorithms, the training and future data must be in
As highlighted in by [303], for ML to become pervasive in the same feature space and must have the same distribution,
networking, the ‘‘semantic gap’’—which refers to the key this is not necessarily the case in many real-world applica-
challenge of transferring ML results into actionable insights tions. In such cases, it is desirable to have transfer learn-
and reports for the network operator—must be overcome. ing or knowledge transfer between the different task domains.
This can facilitate a shift from a reactive interaction style Transfer learning has been successfully applied in computer
for network management, where the network manager is vision and NLP applications but its implementation for net-
expected to check maps and graphs when things go wrong, working has not been witnessed—even though in principle,
to a proactive one, where automated reports and notifications this can be useful in networking as well due to the similar

65604 VOLUME 7, 2019


M. Usama et al.: Unsupervised Machine Learning for Networking: Techniques, Applications, and Research Challenges

nature of Internet traffic and enterprise network traffic in predictions are made—thus parameter optimization is also
many respects. Reference [307] used transfer learning based important for unsupervised algorithms.
caching procedure for wireless networks providing backhaul
offloading in 5G networks. B. LACK OF INTERPRETABILITY OF SOME
UNSUPERVISED ML ALGORITHMS
D. FEDERATED LEARNING IN COMPUTER NETWORKS Some unsupervised algorithms such as deep NNs operate as
Federated learning is a collaborative ML technique, which a black box, which makes it difficult to explain and interpret
does not make use of centralized training data, and works the working of such models. This makes the use of such
by distributing the processing on different machines. Feder- techniques unsuitable for applications in which interpretabil-
ated learning is considered to be the next big thing in cloud ity is important. As pointed out in [303], understandability
networks as they ensure the privacy of the user data and less of the semantics of the decisions made by ML is especially
computation on the cloud to reduce the cost and energy [308]. important for the operational success of ML in large-scale
System and method for network address management in the operational networks and its acceptance by operators, net-
federated cloud are presented in [309] and the application work managers, and users. But prediction accuracy and sim-
of federated IoT and cloud computing for health care is plicity are often in conflict [315]. As an example, the greater
presented in [310]. An end-to-end security architecture for accuracy of NNs accrues from its complex nature in which
federated cloud and IoT is presented in [311]. input variables are combined in a nonlinear fashion to build
a complicated hard-to-explain model; with NNs it may not
E. GENERATIVE ADVERSARIAL NETWORKS (GANS) be possible to get interpretability as well since they make
IN COMPUTER NETWORKS a tradeoff in which they sacrifice interpretability to achieve
Adversarial networks—based on generative adversarial net- high accuracy. There are various ongoing research efforts
work (GAN) training originally proposed by Goodfellow that are focused on making techniques such as NNs less
and colleagues at the University of Montreal [312]—have opaque [316]. Apart from the focus on NNs, there is a gen-
recently emerged as a new technique using which machines eral interest in making AI and ML more explainable and
can be trained to predict outcomes by only the observing interpretable—e.g., the Defense Advanced Research Projects
the world (without necessarily being provided labeled data). Agency or DARPA’s explainable AI project2 is aiming to
An adversarial network has two NN models: a generator develop explainable AI models (leveraging various design
which is responsible for generating some type of data from options spanning the performance-vs-explainability trade-
some random input and a discriminator, which has the task of off space) that can explain the rationale of their decision-
distinguishing between input from the generator or a real data making so that users are able to appropriately trust these
set. The two NNs optimize themselves together resulting in a models particularly for new envisioned control applications
more realistic generation of data by the generator, and a better in which optimization decisions are made autonomously by
sense of what is plausible in the real world for the discrimina- algorithms.
tor. Reference [313] proposed a GAN for generating malware C. LACK OF OPERATIONAL SUCCESS
examples to attack a malware classifier and then proposes a OF ML IN NETWORKING
defense against it. Another adversarial perturbation attack on In literature, researchers have noted that despite substantial
malware classifier is proposed in [314]. The use of GANs for academic research, and practical applications of unsupervised
ML in networking can improve the performance of ML-based learning in other fields, we see that there is a dearth of prac-
networking applications such as anomaly detection in which tical applications of ML solutions in operational networks—
malicious users have an incentive to adversarial craft new particular for applications such as network intrusion detec-
attacks to avoid detection by network managers. tion [303], which are challenging problems for a number of
reasons including 1) the very high cost of errors; 2) the lack
V. PITFALLS AND CAVEATS OF USING UNSUPERVISED
of training data; 3) the semantic gap between results and their
ML IN NETWORKING
operational interpretation; 4) enormous variability in input
With the benefits and intriguing results of unsupervised learn-
data; and finally, 5) fundamental difficulties in conducting
ing, there also exist many shortcomings that are not addressed
sound performance evaluations. Even for other applications,
widely in the literature. Some potential pitfalls and caveats
the success of ML and its wide adoption in practical systems
related to unsupervised learning are discussed next.
at scale lags the success of ML solutions in many other
domains.
A. INAPPROPRIATE TECHNIQUE SELECTION
To start with, the first potential pitfall could be the selection D. IGNORING SIMPLE NON-MACHINE-LEARNING
of technique. Different unsupervised learning and predicting BASED TOOLS
techniques may have excellent results on some applications One should also keep in mind a common pitfall that aca-
while performing poorly on others—it is important to choose demic researchers may suffer from which is not realizing that
the best technique for the task at hand. Another reason could
be a poor selection of features or parameters on which basis 2 https://www.darpa.mil/program/explainable-artificial-intelligence

VOLUME 7, 2019 65605


M. Usama et al.: Unsupervised Machine Learning for Networking: Techniques, Applications, and Research Challenges

FIGURE 8. Intuitively, we expect the ML model’s performance to improve with more data but to
deteriorate in performance if the model becomes overly complex for the data. Figure adapted from [317].

network operators may have simpler non-machine learning mar the performance of ML algorithms. A potential prob-
based solutions that may work as well as naïve ML-based lem is that dataset may be imbalanced if the samples size
solutions in practical settings. Failure to examine the ground from one class is very much smaller or larger than the other
realities of operational networks will undermine the effec- classes [319]. In such imbalanced datasets, the algorithm
tiveness of ML-based solutions. We should expect ML-based must be careful not to ignore the rare class by assuming
solutions to augment and supplement rather than replace it to be noise. Although imbalanced datasets are more of a
other non-machine-learning based solutions—at least for the nuisance for supervised learning techniques, they may also
foreseeable future. pose problems for unsupervised and semi-supervised learn-
ing techniques.
E. OVERFITTING
Another potential issue with unsupervised models is overfit-
G. INACCURATE MODEL BUILDING
ting; it corresponds to a model representing the noise or ran-
It is difficult to build accurate and generic models since
dom error rather than learning the actual pattern in data.
each model is optimized for certain kind of applications.
While commonly associated with supervised ML, the prob-
Unsupervised ML models should be applied after carefully
lem of overfitting lurks whenever we learn from data and
studying the application and the suitability of the algorithm in
thus is applicable to unsupervised ML as well. As illustrated
such settings [320]. For example, we highlight certain issues
in Figure 8, ideally speaking, we expect ML algorithms to
related to the unsupervised task of clustering: 1) random
provide improved performance with more data; but with
initialization in K-means is not recommended; 2) number of
increasing model complexity, performance starts to deterio-
clusters is not known before the clustering operation as we do
rate after a certain point—although, it is possible to get poorer
not have labels; 3) in the case of hierarchical clustering, we do
results empirically with increasing data when working with
not know when to stop and this can cause increase in the time
unoptimized out-of-the-box ML algorithms [317]. According
complexity of the process, and 4) evaluating the clustering
to the Occam Razor principle, the model complexity should
result is very tricky since the ground truth is mostly unknown.
be commensurate with the amount of data available, and with
overly complex models, the ability to predict and generalize
diminishes. Two major reasons for overfitting could be the H. MACHINE LEARNING IN
overly large size of the learning model and fewer sample data ADVERSARIAL ENVIRONMENTS
used for training purposes. Generally, data is divided into Many networking problems, such as anomaly detection, are
two portions (actual data and stochastic noise); due to the adversarial problems in which the malicious intruder is con-
unavailability of labels or related information, unsupervised tinually trying to outwit the network administrators (and the
learning model can overfit the data, which causes issues in tools used by the network administrators). In such settings,
testing and deployment phase. Cross-validation, regulariza- machine learning that learns from historical data may not
tion, and Chi-squared testing are highly recommended for perform due to clever crafting of attacks specifically for
designing or tweaking an unsupervised learning algorithm to circumventing any schemes based on previous data.
avoid overfitting [318]. Due to these challenges, pitfalls, and weaknesses, due
care must be exercised while using unsupervised and semi-
F. DATA QUALITY ISSUES supervised ML. These pitfalls can be avoided in part by using
It should be noted that all ML is data dependent, and the per- various best practices [321], such as end-to-end learning
formance of ML algorithms is affected largely by the nature, pipeline testing, visualization of the learning algorithm, regu-
volume, quality, and representation of data. In the case of larization, proper feature engineering, dropout, sanity checks
unsupervised ML data quality issues must be carefully con- through human inspection—whichever is appropriate for the
sidered since any problem with the data quality will seriously problem’s context.

65606 VOLUME 7, 2019


M. Usama et al.: Unsupervised Machine Learning for Networking: Techniques, Applications, and Research Challenges

VI. CONCLUSIONS [15] A. Meshram and C. Haas, ‘‘Anomaly detection in industrial networks
We have provided a comprehensive survey of machine learn- using machine learning: A roadmap,’’ in Machine Learning for Cyber
Physical Systems. Berlin, Germany: Springer, 2017, pp. 65–72.
ing tasks, latest unsupervised learning techniques, and trends, [16] Z. M. Fadlullah et al., ‘‘State-of-the-art deep learning: Evolving machine
along with a detailed discussion of the applications of these intelligence toward tomorrow’s intelligent network traffic control sys-
techniques in networking related tasks. Despite the recent tems,’’ IEEE Commun. Surveys Tuts., vol. 19, no. 4, pp. 2432–2455,
4th Quart., 2017.
wave of success of unsupervised learning, there is a scarcity
[17] E. Hodo, X. Bellekens, A. Hamilton, C. Tachtatzis, and R. Atkinson.
of unsupervised learning literature for computer networking (2017). ‘‘Shallow and deep networks intrusion detection system: A taxon-
applications, which this survey aims to address. The few omy and survey.’’ [Online]. Available: https://arxiv.org/abs/1701.02145
previously published survey papers differ from our work in [18] M. A. Al-Garadi, A. Mohamed, A. Al-Ali, X. Du, and M. Guizani. (2018).
‘‘A survey of machine and deep learning methods for Internet of Things
their focus, scope, and breadth; we have written this paper (IoT) security.’’ [Online]. Available: https://arxiv.org/abs/1807.11023
in a manner that carefully synthesizes the insights from these [19] M. S. Mahdavinejad, M. Rezvan, M. Barekatain, P. Adibi, P. Barnaghi,
survey papers while also providing contemporary coverage of and A. P. Sheth, ‘‘Machine learning for Internet of Things data analysis:
A survey,’’ Digit. Commun. Netw., vol. 4, no. 3, pp. 161–175, 2018.
recent advances. Due to the versatility and evolving nature of
[20] R. Boutaba et al., ‘‘A comprehensive survey on machine learning for net-
computer networks, it was impossible to cover each and every working: Evolution, applications and research opportunities,’’ J. Internet
application; however, an attempt has been made to cover all Services Appl., vol. 9, p. 16, Jun. 2018.
the major networking applications of unsupervised learning [21] L. Cui, S. Yang, F. Chen, Z. Ming, N. Lu, and J. Qin, ‘‘A survey on
application of machine learning for Internet of Things,’’ Int. J. Mach.
and the relevant techniques. We have also presented concise Learn. Cybern., vol. 9, no. 8, pp. 1399–1417, 2018.
future work and open research areas in the field of network- [22] Y. LeCun, Y. Bengio, and G. Hinton, ‘‘Deep learning,’’ Nature, vol. 521,
ing, which is related to unsupervised learning, coupled with a pp. 436–444, May 2015.
brief discussion of significant pitfalls and challenges in using [23] L. Deng, ‘‘A tutorial survey of architectures, algorithms, and applications
for deep learning,’’ APSIPA Trans. Signal Inf. Process., vol. 3, pp. 1–29,
unsupervised machine learning in networks. May 2014.
[24] J. Qadir, ‘‘Artificial intelligence based cognitive routing for cognitive
ACKNOWLEDGEMENT radio networks,’’ Artif. Intell. Rev., vol. 45, no. 1, pp. 25–96, Jan. 2016.
The statements made herein are solely the responsibility of [25] N. Ahad, J. Qadir, and N. Ahsan, ‘‘Neural networks in wireless networks:
the authors. Techniques, applications and guidelines,’’ J. Netw. Comput. Appl., vol. 68,
pp. 1–27, Jun. 2016.
REFERENCES [26] I. Guyon, S. Gunn, M. Nikravesh, and L. A. Zadeh, Eds., Feature Extrac-
[1] R. W. Thomas et al., ‘‘Cognitive networks,’’ in Cognitive Radio, Soft- tion: Foundations and Applications, vol. 207. Heidelberg, Germany:
ware Defined Radio, and Adaptive Wireless Systems. Dordrecht, The Springer, 2008.
Netherlands: Springer, 2007, pp. 17–41. [27] A. Coates, A. Y. Ng, and H. Lee, ‘‘An analysis of single-layer networks
[2] S. Latif, F. Pervez, M. Usama, and J. Qadir. (2017). ‘‘Artificial intel- in unsupervised feature learning,’’ in Proc. Int. Conf. Artif. Intell. Statist.,
ligence as an enabler for cognitive self-organizing future networks.’’ 2011, pp. 215–223.
[Online]. Available: https://arxiv.org/abs/1702.02823 [28] M. J. S. M. S. Mohammad Lotfollahi and R. Shirali. (2017). ‘‘Deep
[3] J. Qadir, K.-L. A. Yau, M. A. Imran, Q. Ni, and A. V. Vasilakos, ‘‘IEEE packet: A novel approach for encrypted traffic classification using deep
access special section editorial: Artificial intelligence enabled network- learning.’’ [Online]. Available: https://arxiv.org/abs/1709.02656
ing,’’ IEEE Access, vol. 3, pp. 3079–3082, 2015. [29] W. Wang, M. Zhu, X. Zeng, X. Ye, and Y. Sheng, ‘‘Malware traffic classi-
[4] S. Suthaharan, ‘‘Big data classification: Problems and challenges in net- fication using convolutional neural network for representation learning,’’
work intrusion prediction with machine learning,’’ ACM SIGMETRICS in Proc. Int. Conf. Inf. Netw. (ICOIN), Jan. 2017, pp. 712–717.
Perform. Eval. Rev., vol. 41, no. 4, pp. 70–73, Apr. 2014.
[30] M. Yousefi-Azar, V. Varadharajan, L. Hamey, and U. Tupakula,
[5] S. Shenker, M. Casado, T. Koponen, and N. McKeown, ‘‘The future of
‘‘Autoencoder-based feature learning for cyber security applications,’’ in
networking, and the past of protocols,’’ Open Netw. Summit, vol. 20,
Proc. Int. Joint Conf. Neural Netw. (IJCNN), May 2017, pp. 3854–3861.
pp. 1–30, Oct. 2011.
[6] A. Malik, J. Qadir, B. Ahmad, K.-L. A. Yau, and U. Ullah, ‘‘QoS in [31] R. C. Aygun and A. G. Yavuz, ‘‘Network anomaly detection with stochas-
IEEE 802.11-based wireless networks: A contemporary review,’’ J. Netw. tically improved autoencoder based models,’’ in Proc. IEEE 4th Int. Conf.
Comput. Appl., vol. 55, pp. 24–46, Sep. 2015. Cyber Secur. Cloud Comput. (CSCloud), Jun. 2017, pp. 193–198.
[7] N. Feamster and J. Rexford. (2017). ‘‘Why (and how) networks should [32] M. K. Putchala, ‘‘Deep learning approach for intrusion detection system
run themselves,’’ [Online]. Available: https://arxiv.org/abs/1710.11583 (IDS) in the Internet of Things (IoT) network using gated recurrent neural
[8] J. Jiang et al., ‘‘Unleashing the potential of data-driven networking,’’ in networks (GRU),’’ Ph.D. dissertation, Wright State Univ., Dayton, OH,
Proc. Int. Conf. Commun. Syst. Netw. Cham, Switzerland: Springer, 2017. USA, 2017.
[9] A. Patcha and J.-M. Park, ‘‘An overview of anomaly detection tech- [33] A. Tuor, S. Kaplan, B. Hutchinson, N. Nichols, and S. Robinson, ‘‘Deep
niques: Existing solutions and latest technological trends,’’ Comput. learning for unsupervised insider threat detection in structured cyberse-
Netw., vol. 51, no. 12, pp. 3448–3470, Aug. 2007. curity data streams,’’ in Proc. Workshops 31st AAAI Conf. Artif. Intell.,
[10] T. T. T. Nguyen and G. Armitage, ‘‘A survey of techniques for Internet 2017, pp. 1–8.
traffic classification using machine learning,’’ IEEE Commun. Surveys [34] E. Aguiar, A. Riker, M. Mu, and S. Zeadally, ‘‘Real-time QoE
Tuts., vol. 10, no. 4, pp. 56–76, 4th Quart., 2008. prediction for multimedia applications in wireless mesh networks,’’
[11] M. Bkassiny, Y. Li, and S. K. Jayaweera, ‘‘A survey on machine-learning
in Proc. IEEE Consum. Commun. Netw. Conf. (CCNC), Jan. 2012,
techniques in cognitive radios,’’ IEEE Commun. Surveys Tuts., vol. 15,
pp. 592–596.
no. 3, pp. 1136–1159, Jul. 2013.
[12] M. A. Alsheikh, S. Lin, D. Niyato, and H. P. Tan, ‘‘Machine learning in [35] K. Piamrat, A. Ksentini, C. Viho, and J.-M. Bonnin, ‘‘QoE-aware admis-
wireless sensor networks: Algorithms, strategies, and applications,’’ IEEE sion control for multimedia applications in IEEE 802.11 wireless net-
Commun. Surveys Tuts., vol. 16, no. 4, pp. 1996–2018, 4th Quart., 2014. works,’’ in Proc. IEEE 68th Veh. Technol. Conf. (VTC-Fall), Sep. 2008,
[13] A. L. Buczak and E. Guven, ‘‘A survey of data mining and machine pp. 1–5.
learning methods for cyber security intrusion detection,’’ IEEE Commun. [36] K. Karra, S. Kuzdeba, and J. Petersen, ‘‘Modulation recognition using
Surveys Tuts., vol. 18, no. 2, pp. 1153–1176, 2nd Quart., 2016. hierarchical deep neural networks,’’ in Proc. IEEE Int. Symp. Dyn. Spectr.
[14] P. V. Klaine, M. A. Imran, O. Onireti, and R. D. Souza, ‘‘A survey Access Netw. (DySPAN), Mar. 2017, pp. 1–3.
of machine learning techniques applied to self-organizing cellular net- [37] M. D. Ming Zhang and L. Guo, ‘‘Convolutional neural networks for
works,’’ IEEE Commun. Surveys Tuts., vol. 19, no. 4, pp. 2392–2431, automatic cognitive radio waveform recognition,’’ IEEE Access, vol. 5,
4th Quart., 2017. pp. 11074–11082, 2017.

VOLUME 7, 2019 65607


M. Usama et al.: Unsupervised Machine Learning for Networking: Techniques, Applications, and Research Challenges

[38] J. Moysen and L. Giupponi. (2017). ‘‘From 4G to 5G: Self-organized [66] M. H. Bhuyan, D. K. Bhattacharyya, and J. K. Kalita, ‘‘Network anomaly
network management meets machine learning.’’ [Online]. Available: detection: Methods, systems and tools,’’ IEEE Commun. Surveys Tuts.,
https://arxiv.org/abs/1707.09300 vol. 16, no. 1, pp. 303–336, 1st Quart., 2014.
[39] X. Xie, D. Wu, S. Liu, and R. Li. (2017). ‘‘IoT data analytics using deep [67] A. McGregor et al., ‘‘Flow clustering using machine learning tech-
learning.’’ [Online]. Available: https://arxiv.org/abs/1708.03854 niques,’’ in Proc. Int. Workshop Passive Act. Netw. Meas. Berlin,
[40] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge, Germany: Springer, 2004, pp. 205–214.
MA, USA: MIT Press, 2016. [68] R. Xu and D. Wunsch, II, ‘‘Survey of clustering algorithms,’’ IEEE Trans.
[41] J. Schmidhuber, ‘‘Deep learning in neural networks: An overview,’’ Neu- Neural Netw., vol. 16, no. 3, pp. 645–678, May 2005.
ral Netw., vol. 61, pp. 85–117, Jan. 2015. [69] M. Rehman and S. A. Mehdi, ‘‘Comparison of density-based clustering
[42] Y. Bengio, ‘‘Learning deep architectures for AI,’’ Found. Trends Mach. algorithms,’’ Lahore College Women Univ., Lahore, Pakistan, Tech. Rep.,
Learn., vol. 2, no. 1, pp. 1–127, 2009. 2005.
[43] G. E. Hinton, S. Osindero, and Y.-W. Teh, ‘‘A fast learning algorithm for [70] Y. Chen and L. Tu, ‘‘Density-based clustering for real-time stream data,’’
deep belief nets,’’ Neural Comput., vol. 18, no. 7, pp. 1527–1554, 2006. in Proc. 13th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining,
[44] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, ‘‘Greedy layer- 2007, pp. 133–142.
wise training of deep networks,’’ in Advances in Neural Information [71] K. Leung and C. Leckie, ‘‘Unsupervised anomaly detection in network
Processing Systems, vol. 19. Cambridge, MA, USA: MIT Press, 2007, intrusion detection using clusters,’’ in Proc. 28th Australas. Conf. Com-
p. 153. put. Sci., vol. 38, 2005, pp. 333–342.
[45] C. Poultney, S. Chopra, and Y. L. Cun, ‘‘Efficient learning of sparse [72] J. Paparrizos and L. Gravano, ‘‘k-shape: Efficient and accurate clustering
representations with an energy-based model,’’ in Proc. Adv. Neural Inf. of time series,’’ in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2015,
Process. Syst., 2006, pp. 1137–1144. pp. 1855–1870.
[46] J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, Q. V. Le, and A. Y. Ng, [73] P. Mangiameli, S. K. Chen, and D. West, ‘‘A comparison of SOM neural
‘‘On optimization methods for deep learning,’’ in Proc. 28th Int. Conf. network and hierarchical clustering methods,’’ Eur. J. Oper. Res., vol. 93,
Mach. Learn. (ICML), 2011, pp. 265–272. no. 2, pp. 402–417, 1996.
[47] C. Doersch. (2016). ‘‘Tutorial on variational autoencoders.’’ [Online]. [74] P. Orbanz and Y. W. Teh, ‘‘Bayesian nonparametric models,’’ in Ency-
Available: https://arxiv.org/abs/1606.05908 clopedia of Machine Learning. Boston, MA, USA: Springer, 2010,
[48] T. Kohonen, ‘‘The self-organizing map,’’ Proc. IEEE, vol. 78, no. 9, pp. 81–89.
pp. 1464–1480, Sep. 1990. [75] B. Kurt, A. T. Cemgil, M. Mungan, and E. Saygun, ‘‘Bayesian nonpara-
[49] T. Kohonen, ‘‘The self-organizing map,’’ Neurocomputing, vol. 21, metric clustering of network traffic data,’’ Tech. Rep.
nos. 1–3, pp. 1–6, 1998. [76] X. Jin and J. Han, ‘‘Partitional clustering,’’ in Encyclopedia of Machine
[50] F. Rosenblatt, ‘‘The perceptron: A probabilistic model for information Learning, Boston, MA, USA: Springer, 2010, p. 766.
storage and organization in the brain,’’ Psychol. Rev., vol. 65, no. 6, p. 386, [77] S. R. Gaddam, V. V. Phoha, and K. S. Balagani,
1958. ‘‘K-means+ID3: A novel method for supervised anomaly detection
[51] S. S. Haykin, Neural Networks and Learning Machines, vol. 3. by cascading k-means clustering and ID3 decision tree learning
Upper Saddle River, NJ, USA: Pearson, 2009. methods,’’ IEEE Trans. Knowl. Data Eng., vol. 19, no. 3, pp. 345–354,
[52] G. A. Carpenter and S. Grossberg, Adaptive Resonance Theory. Boston, Mar. 2007.
MA, USA: Springer, 2010, pp. 22–35. [78] L. Yingqiu, L. Wei, and L. Yunchun, ‘‘Network traffic classification using
[53] J. Karhunen, T. Raiko, and K. Cho, ‘‘Unsupervised deep learning: A short K-Means clustering,’’ in Proc. 2nd Int. Multi-Symp. Comput. Comput. Sci.
review,’’ in Advances in Independent Component Analysis and Learning (IMSCCS), Aug. 2007, pp. 360–365.
Machines. New York, NY, USA: Academic, 2015, pp. 125–142. [79] M. Jianliang, S. Haikun, and B. Ling, ‘‘The application on intrusion
[54] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng, ‘‘Convolutional deep detection based on k-means cluster algorithm,’’ in Proc. Int. Forum Inf.
belief networks for scalable unsupervised learning of hierarchical rep- Technol. Appl. (IFITA), vol. 1, 2009, pp. 150–152.
resentations,’’ in Proc. 26th Annu. Int. Conf. Mach. Learn., 2009, [80] R. Chitrakar and H. Chuanhe, ‘‘Anomaly detection using support vector
pp. 609–616. machine classification with k-medoids clustering,’’ in Proc. 3rd Asian
[55] S. Baraković and L. Skorin-Kapov, ‘‘Survey and challenges of QoE Himalayas Int. Conf. Internet (AH-ICI), Nov. 2012, pp. 1–5.
management issues in wireless networks,’’ J. Comput. Netw. Commun., [81] R. Chitrakar and H. Chuanhe, ‘‘Anomaly based intrusion detection using
vol. 2013, Dec. 2013, Art. no. 165146. hybrid learning approach of combining k-medoids clustering and naive
[56] R. Pascanu, C. Gulcehre, K. Cho, and Y. Bengio. (2013). ‘‘How Bayes classification,’’ in Proc. 8th Int. Conf. Wireless Commun., Netw.
to construct deep recurrent neural networks.’’ [Online]. Available: Mobile Comput. (WiCOM), Sep. 2012, pp. 1–5.
https://arxiv.org/abs/1312.6026 [82] M. A. T. Figueiredo and A. K. Jain, ‘‘Unsupervised learning of finite
[57] M. Klapper-Rybicka, N. N. Schraudolph, and J. Schmidhuber, ‘‘Unsuper- mixture models,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 3,
vised learning in LSTM recurrent neural networks,’’ in Proc. Int. Conf. pp. 381–396, Mar. 2002.
Artif. Neural Netw.. Berlin, Germany: Springer, 2001, pp. 684–691. [83] M. E. Newman and E. A. Leicht, ‘‘Mixture models and exploratory
[58] G. E. Hinton, ‘‘Boltzmann machine,’’ Encyclopedia Mach. Learn., vol. 2, analysis in networks,’’ Proc. Nat. Acad. Sci. USA, vol. 104, no. 23,
no. 5, p. 1668, 2007. pp. 9564–9569, 2007.
[59] R. Salakhutdinov and G. Hinton, ‘‘Deep Boltzmann machines,’’ in Proc. [84] M. Bahrololum and M. Khaleghi, ‘‘Anomaly intrusion detection system
Int. Conf. Artif. Intell. Statist. (AISTATS), Clearwater Beach, FL, USA, using Gaussian mixture model,’’ in 3rd Int. Conf. Converg. Hybrid Inf.
2009, pp. 448–455. Technol. (ICCIT), vol. 1, Nov. 2008, pp. 1162–1167.
[60] K. Tsagkaris, A. Katidiotis, and P. Demestichas, ‘‘Neural network- [85] W. Chimphlee, A. H. Abdullah, M. N. M. Sap, S. Srinoy, and
based learning schemes for cognitive radio systems,’’ Comput. Commun., S. Chimphlee, ‘‘Anomaly-based intrusion detection using fuzzy rough
vol. 31, no. 14, pp. 3394–3404, 2008. clustering,’’ in Proc. Int. Conf. Hybrid Inf. Technol. (ICHIT), vol. 1,
[61] F. Henrique, T. Vieira, and L. L. Lee, ‘‘A neural architecture based on the Nov. 2006, pp. 329–334.
adaptive resonant theory and recurrent neural networks,’’ in Proc. IJCSA, [86] C. Marquez, M. Gramaglia, M. Fiore, A. Banchs, C. Ziemlicki, and
vol. 4, no. 3, 2007, pp. 45–56. Z. Smoreda, ‘‘Not all apps are created equal: Analysis of spatiotemporal
[62] D. Munaretto, D. Zucchetto, A. Zanella, and M. Zorzi, ‘‘Data-driven QoE heterogeneity in nationwide mobile service usage,’’ in Proc. 13th Int.
optimization techniques for multi-user wireless networks,’’ in Proc. Int. Conf. Emerg. Netw. Exp. Technol., 2017, pp. 180–186.
Conf. Comput., Netw. Commun. (ICNC), Feb. 2015, pp. 653–657. [87] M. Adda, K. Qader, and M. Al-Kasassbeh, ‘‘Comparative analysis of
[63] L. Badia, D. Munaretto, A. Testolin, A. Zanella, and M. Zorzi, clustering techniques in network traffic faults classification,’’ Int. J.
‘‘Cognition-based networks: Applying cognitive science to multimedia Innov. Res. Comput. Commun. Eng., vol. 5, no. 4, pp. 6551–6563,
wireless networking,’’ in Proc. IEEE 15th Int. Symp. World Wireless, 2017.
Mobile Multimedia Netw. (WoWMoM), Jun. 2014, pp. 1–6. [88] A. Vlăduţu, D. Comăneci, and C. Dobre, ‘‘Internet traffic classification
[64] N. Grira, M. Crucianu, and N. Boujemaa, ‘‘Unsupervised and semi- based on flows’ statistical properties with machine learning,’’ Int. J. Netw.
supervised clustering: A brief survey,’’ Rev. Mach. Learn. Techn. Process. Manage., vol. 27, no. 3, p. e1929, 2017.
Multimedia Content, vol. 1, pp. 9–16, Jul. 2004. [89] J. Liu, Y. Fu, J. Ming, Y. Ren, L. Sun, and H. Xiong, ‘‘Effective and
[65] P. Berkhin, ‘‘A survey of clustering data mining techniques,’’ in real-time in-app activity analysis in encrypted Internet traffic streams,’’
Grouping Multidimensional Data. Berlin, Germany: Springer, 2006, in Proc. 23rd ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining,
pp. 25–71. 2017, pp. 335–344.

65608 VOLUME 7, 2019


M. Usama et al.: Unsupervised Machine Learning for Networking: Techniques, Applications, and Research Challenges

[90] M. S. Parwez, D. Rawat, and M. Garuba, ‘‘Big data analytics for user- [117] J. Xu and C. R. Shelton, ‘‘Continuous time Bayesian networks for
activity analysis and user-anomaly detection in mobile wireless network,’’ host level network intrusion detection,’’ in Proc. Joint Eur. Conf. Mach.
IEEE Trans. Ind. Informat., vol. 13, no. 4, pp. 2058–2065, Aug. 2017. Learn. Knowl. Discovery Databases. Berlin, Germany: Springer, 2008,
[91] T. Lorido-Botran, S. Huerta, L. Tomás, J. Tordsson, and B. Sanz, pp. 613–627.
‘‘An unsupervised approach to online noisy-neighbor detection [118] N. Al-Rousan, S. Haeri, and L. Trajković, ‘‘Feature selection for classi-
in cloud data centers,’’ Expert Syst. Appl., vol. 89, pp. 188–204, fication of BGP anomalies using Bayesian models,’’ in Proc. Int. Conf.
Dec. 2017. Mach. Learn. (ICMLC), vol. 1, Jul. 2012, pp. 140–147.
[92] G. Frishman, Y. Ben-Itzhak, and O. Margalit, ‘‘Cluster-based load bal- [119] D.-P. Liu, M.-W. Zhang, and T. Li, ‘‘Network traffic analysis using refined
ancing for better network security,’’ in Proc. Workshop Big Data Anal. Bayesian reasoning to detect flooding and port scan attacks,’’ in Proc. Int.
Mach. Learn. Data Commun. Netw., 2017, pp. 7–12. Conf. Adv. Comput. Theory Eng. (ICACTE), Dec. 2008, pp. 1000–1004.
[93] G. R. Kumar, N. Mangathayaru, and G. Narsimha, ‘‘A feature clustering [120] M. Ishiguro, H. Suzuki, I. Murase, and H. Ohno, ‘‘Internet threat detection
based dimensionality reduction for intrusion detection (FCBDR),’’ Int. J. system using Bayesian estimation,’’ in Proc. 16th Annu. Comput. Secur.
Comput. Sci. Inf. Syst., vol. 12, no. 1, pp. 26–44, 2017. Incident Handling Conf., 2004, pp. 1–5.
[94] T. Wiradinata and A. S. Paramita, ‘‘Clustering and feature selection [121] D. Janakiram, V. A. Reddy, and A. P. Kumar, ‘‘Outlier detection in
technique for improving Internet traffic classification using K-NN,’’ J. wireless sensor networks using Bayesian belief networks,’’ in Proc. 1st
Adv. Comput. Netw., Singapore, vol. 4, no. 1, Mar. 2016. Int. Conf. Commun. Syst. Softw. Middleware (Comsware), Jan. 2006,
[95] C. M. Bishop, ‘‘Latent variable models,’’ in Learning in Graphical Mod- pp. 1–6.
els. Dordrecht, The Netherlands: Springer, 1998, pp. 371–403. [122] S. Haykin, K. Huber, and Z. Chen, ‘‘Bayesian sequential state estima-
[96] A. Skrondal and S. Rabe-Hesketh, ‘‘Latent variable modelling: A survey,’’ tion for MIMO wireless communications,’’ Proc. IEEE, vol. 92, no. 3,
Scandin. J. Statist., vol. 34, no. 4, pp. 712–745, 2007. pp. 439–454, Mar. 2004.
[97] C. M. Bishop, Neural Networks for Pattern Recognition. London, U.K.: [123] S. Ito and N. Kawaguchi, ‘‘Bayesian based location estimation system
Oxford Univ. Press, 1995. using wireless LAN,’’ in Proc. 3rd IEEE Int. Conf. Pervasive Comput.
[98] J. Josse and F. Husson, ‘‘Selecting the number of components in principal Commun. Workshops (PerCom), Mar. 2005, pp. 273–278.
component analysis using cross-validation approximations,’’ Comput. [124] S. Liu, J. Hu, S. Hao, and T. Song, ‘‘Improved EM method for Internet
Statist. Data Anal., vol. 56, no. 6, pp. 1869–1879, 2012. traffic classification,’’ in Proc. 8th Int. Conf. Knowl. Smart Technol.(KST),
[99] A. Hyvärinen and E. Oja, ‘‘Independent component analysis: Algorithms Feb. 2016, pp. 13–17.
and applications,’’ Neural Netw., vol. 13, no. 4, pp. 411–430, 2000. [125] H. Shi, H. Li, D. Zhang, C. Cheng, and W. Wu, ‘‘Efficient and robust
[100] Y.-X. Wang and Y.-J. Zhang, ‘‘Nonnegative matrix factorization: A com- feature extraction and selection for traffic classification,’’ Comput. Netw.,
prehensive review,’’ IEEE Trans. Knowl. Data Eng., vol. 25, no. 6, vol. 119, pp. 1–16, Jun. 2017.
pp. 1336–1353, Jun. 2013. [126] S. Troia, G. Sheng, R. Alvizu, G. A. Maier, and A. Pattavina, ‘‘Identifi-
[101] D. D. Lee and H. S. Seung, ‘‘Algorithms for non-negative matrix factor- cation of tidal-traffic patterns in metro-area mobile networks via matrix
ization,’’ in Proc. Adv. Neural Inf. Process. Syst., 2001, pp. 556–562. factorization based model,’’ in Proc. IEEE Int. Conf. Pervasive Comput.
[102] M. O. Duff, ‘‘Optimal learning: Computational procedures for Bayes- Commun. Workshops (PerCom Workshops), Mar. 2017, pp. 297–301.
adaptive Markov decision processes,’’ Ph.D. dissertation, Univ. Mas- [127] L. Nie, D. Jiang, and Z. Lv, ‘‘Modeling network traffic for traffic matrix
sachusetts Amherst, Amherst, MA, USA, 2002. estimation and anomaly detection based on Bayesian network in cloud
[103] M. J. Beal, Variational Algorithms for Approximate Bayesian Inference. computing networks,’’ Ann. Telecommun., vol. 72, nos. 5–6, pp. 297–305,
London, U.K.: Univ. of London, 2003. 2017.
[104] T. P. Minka, ‘‘A family of algorithms for approximate Bayesian infer- [128] J.-H. Bang, Y.-J. Cho, and K. Kang, ‘‘Anomaly detection of network-
ence,’’ Ph.D. dissertation, Massachusetts Inst. Technol., Cambridge, MA, initiated LTE signaling traffic in wireless sensor and actuator networks
USA, 2001. based on a hidden semi-Markov model,’’ Comput. Secur., vol. 65,
[105] H. Wang and D.-Y. Yeung. (2016). ‘‘Towards Bayesian deep learning: pp. 108–120, Mar. 2017.
A survey.’’ [Online]. Available: https://arxiv.org/abs/1604.01662 [129] X. Chen, K. Irie, D. Banks, R. Haslinger, J. Thomas, and M. West,
[106] C. DuBois, J. R. Foulds, and P. Smyth, ‘‘Latent set models for two-mode ‘‘Scalable Bayesian modeling, monitoring, and analysis of dynamic net-
network data,’’ in Proc. ICWSM, 2011, pp. 1–8. work flow data,’’ J. Amer. Stat. Assoc., vol. 113, no. 522, pp. 519–533,
[107] J. R. Foulds, C. DuBois, A. U. Asuncion, C. T. Butts, and P. Smyth, 2017.
‘‘A dynamic relational infinite feature model for longitudinal social net- [130] B. Mokhtar and M. Eltoweissy, ‘‘Big data and semantics management
works,’’ in Proc. AISTATS, vol. 11, 2011, pp. 287–295. system for computer networks,’’ Ad Hoc Netw., vol. 57, pp. 32–51,
[108] J.-A. Hernández and I. W. Phillips, ‘‘Weibull mixture model to char- Mar. 2017.
acterise end-to-end Internet delay at coarse time-scales,’’ IEE Proc.- [131] A. Furno, M. Fiore, and R. Stanica, ‘‘Joint spatial and temporal classifi-
Commun., vol. 153, no. 2, pp. 295–304, 2006. cation of mobile traffic demands,’’ in Proc. 36th Annu. IEEE Int. Conf.
[109] J. M. Agosta, J. Chandrashekar, M. Crovella, N. Taft, and D. Ting, Comput. Commun. (INFOCOM), May 2017, pp. 1–9.
‘‘Mixture models of endhost network traffic,’’ in Proc. IEEE INFOCOM, [132] M. Malli, N. Said, and A. Fadlallah, ‘‘A new model for rating users’
Apr. 2013, pp. 225–229. profiles in online social networks,’’ Comput. Inf. Sci., vol. 10, no. 2, p. 39,
[110] R. Yan and R. Liu, ‘‘Principal component analysis based network traffic 2017.
classification,’’ J. Comput., vol. 9, no. 5, pp. 1234–1240, 2014. [133] S. T. Roweis and L. K. Saul, ‘‘Nonlinear dimensionality reduction by
[111] X. Xu and X. Wang, ‘‘An adaptive network intrusion detection method locally linear embedding,’’ Science, vol. 290, no. 5500, pp. 2323–2326,
based on PCA and support vector machines,’’ in Proc. Int. Conf. Adv. Dec. 2000.
Data Mining Appl. Berlin, Germany: Springer, 2005, pp. 696–703. [134] E. Keogh and A. Mueen, ‘‘Curse of dimensionality,’’ in Encyclo-
[112] X. Guan, W. Wang, and X. Zhang, ‘‘Fast intrusion detection based on pedia of Machine Learning. Boston, MA, USA: Springer, 2010,
a non-negative matrix factorization model,’’ J. Netw. Comput. Appl., pp. 257–258.
vol. 32, no. 1, pp. 31–44, 2009. [135] P. Pudil and J. Novovičová, ‘‘Novel methods for feature subset selection
[113] Z. Albataineh and F. Salem, ‘‘New blind multiuser detection in DS- with respect to problem knowledge,’’ in Feature Extraction, Construction
CDMA based on extension of efficient fast independent component anal- Selection. Boston, MA, USA: Springer, 1998, pp. 101–116.
ysis (EF-ICA),’’ in Proc. 4th Int. Conf. Intell. Syst., Modelling Simulation, [136] L. Yu and H. Liu, ‘‘Feature selection for high-dimensional data: A fast
Jan. 2013, pp. 543–548. correlation-based filter solution,’’ in Proc. Int. Conf. ICML, vol. 3, 2003,
[114] N. Ahmed, S. S. Kanhere, and S. Jha, ‘‘Probabilistic coverage in wireless pp. 856–863.
sensor networks,’’ in Proc. 30th Anniversary IEEE Conf. Local Comput. [137] W. M. Hartmann, ‘‘Dimension reduction vs. variable selection,’’ in Proc.
Netw., Nov. 2005, pp. 1–8. Int. Workshop Appl. Parallel Comput. Berlin, Germany: Springer, 2004,
[115] V. Chatzigiannakis, S. Papavassiliou, M. Grammatikou, and B. Maglaris, pp. 931–938.
‘‘Hierarchical anomaly detection in distributed large-scale sensor net- [138] I. K. Fodor, ‘‘A survey of dimension reduction techniques,’’ Lawrence
works,’’ in Proc. 11th IEEE Symp. Comput. Commun. (ISCC), Jun. 2006, Livermore Nat. Lab., Berkeley, CA, USA, Tech. Rep. UCRL-ID-148494,
pp. 761–767. 2002.
[116] R. Gu, H. Wang, and Y. Ji, ‘‘Early traffic identification using Bayesian [139] J. B. Tenenbaum, V. de Silva, and J. C. Langford, ‘‘A global geometric
networks,’’ in Proc. 2nd IEEE Int. Conf. Netw. Infrastruct. Digit. Content, framework for nonlinear dimensionality reduction,’’ Science, vol. 290,
Sep. 2010, pp. 564–568. no. 5500, pp. 2319–2323, Dec. 2000.

VOLUME 7, 2019 65609


M. Usama et al.: Unsupervised Machine Learning for Networking: Techniques, Applications, and Research Challenges

[140] C. M. Bishop, M. Svensén, and C. K. I. Williams, ‘‘GTM: The generative [165] Y. Huang, Y. Li, and B. Qiang, ‘‘Internet traffic classification based on
topographic mapping,’’ Neural Comput., vol. 10, no. 1, pp. 215–234, min-max ensemble feature selection,’’ in Proc. Int. Joint Conf. Neural
1998. Netw. (IJCNN), Jul. 2016, pp. 3485–3492.
[141] T. Hastie and W. Stuetzle, ‘‘Principal curves,’’ J. Amer. Statist. Assoc., [166] L. Zhen and L. Qiong, ‘‘A new feature selection method for Internet
vol. 84, no. 406, pp. 502–516, 1989. traffic classification using ML,’’ Phys. Procedia, vol. 33, pp. 1338–1345,
[142] D. Lee. (2002). Estimations of Principal Curves. [Online]. Available: Jan. 2012.
http://www.dgp.toronto.edu/~dwlee/pcurve/pcurve_csc2515.pdf [167] J. Zhang, Y. Xiang, Y. Wang, W. Zhou, Y. Xiang, and Y. Guan, ‘‘Network
[143] J. B. Kruskal, ‘‘Nonmetric multidimensional scaling: A numerical traffic classification using correlation information,’’ IEEE Trans. Parallel
method,’’ Psychometrika, vol. 29, no. 2, pp. 115–129, Jun. 1964. Distrib. Syst., vol. 24, no. 1, pp. 104–117, Jan. 2013.
[144] L. van der Maaten and G. Hinton, ‘‘Visualizing data using [168] J. Erman, A. Mahanti, and M. Arlitt, ‘‘Qrp05-4: Internet traffic identifica-
t-SNE,’’ J. Mach. Learn. Res., vol. 9, pp. 2579–2605, tion using machine learning,’’ in Proc. IEEE Global Telecommun. Conf.
Nov. 2008. (GLOBECOM), Nov. 2006, pp. 1–6.
[145] J. Cao, Z. Fang, G. Qu, H. Sun, and D. Zhang, ‘‘An accurate traffic [169] J. Kornycky, O. Abdul-Hameed, A. Kondoz, and B. C. Barber, ‘‘Radio
classification model based on support vector machines,’’ Int. J. Netw. frequency traffic classification over WLAN,’’ IEEE Trans. Netw., vol. 25,
Manage., vol. 27, no. 1, p. e1962, 2017. no. 1, pp. 56–68, May 2016.
[146] W. Zhou, X. Zhou, S. Dong, and B. Lubomir, ‘‘A SOM and PNN [170] X. Liu, L. Pan, and X. Sun, ‘‘Real-time traffic status classification
model for network traffic classification,’’ Boletín Técnico, vol. 55, no. 1, based on Gaussian mixture model,’’ in Proc. IEEE Int. Conf. Data Sci.
pp. 174–182, 2017. Cyberspace (DSC), Jun. 2016, pp. 573–578.
[147] S. M. Erfani, S. Rajasegarar, S. Karunasekera, and C. Leckie, ‘‘High- [171] J. Erman, M. Arlitt, and A. Mahanti, ‘‘Traffic classification using clus-
dimensional and large-scale anomaly detection using a linear one-class tering algorithms,’’ in Proc. SIGCOMM Workshop Mining Netw. Data,
SVM with deep learning,’’ Pattern Recognit., vol. 58, pp. 121–134, 2006, pp. 281–286.
Oct. 2016. [172] T. T. Nguyen and G. Armitage, ‘‘Clustering to assist supervised machine
[148] M. Nicolau and J. McDermott, ‘‘A hybrid autoencoder and density estima- learning for real-time IP traffic classification,’’ in Proc. IEEE Int. Conf.
tion model for anomaly detection,’’ in Proc. Int. Conf. Parallel Problem Commun. (ICC), May 2008, pp. 5857–5862.
Solving Nature, 2016, pp. 717–726. [173] J. Erman, A. Mahanti, M. Arlitt, I. Cohen, and C. Williamson,
[149] S. T. Ikram and A. K. Cherukuri, ‘‘Improving accuracy of intrusion detec- ‘‘Offline/realtime traffic classification using semi-supervised learning,’’
tion model using PCA and optimized SVM,’’ J. Comput. Inf. Technol., Perform. Eval., vol. 64, nos. 9–12, pp. 1194–1213, Oct. 2007.
vol. 24, no. 2, pp. 133–148, 2016. [174] L. Bernaille, R. Teixeira, I. Akodkenou, A. Soule, and K. Salamatian,
[150] J. Moysen, L. Giupponi, and J. Mangues-Bafalluy, ‘‘A mobile network ‘‘Traffic classification on the fly,’’ ACM SIGCOMM Comput. Commun.
planning tool based on data analytics,’’ Mobile Inf. Syst., vol. 2017, Rev., vol. 36, no. 2, pp. 23–26, 2006.
Feb. 2017, Art. no. 6740585. [175] S. Zander, T. Nguyen, and G. Armitage, ‘‘Automated traffic classifi-
[151] S. A. Ossia et al. (2017). ‘‘A hybrid deep learning architecture for cation and application identification using machine learning,’’ in Proc.
privacy-preserving mobile analytics.’’ [Online]. Available: https://arxiv. 30th Anniversary IEEE Conf. Local Comput. Netw. (LCN), Nov. 2005,
org/abs/1703.02952 pp. 250–257.
[152] S. Rajendran, W. Meert, D. Giustiniano, V. Lenders, and S. Pollin. [176] T. P. Oliveira, J. S. Barbar, and A. S. Soares, ‘‘Computer network traffic
(2017). ‘‘Deep learning models for wireless signal classification with prediction: A comparison between traditional and deep learning neural
distributed low-cost spectrum sensors.’’ [Online]. Available: https://arxiv. networks,’’ Int. J. Big Data Intell., vol. 3, no. 1, pp. 28–37, 2016.
org/abs/1707.08908 [177] N. Shrivastava and A. Dubey, ‘‘Internet traffic data categorization using
[153] M. H. Sarshar, ‘‘Analyzing large scale Wi-Fi data using supervised and particle of swarm optimization algorithm,’’ in Proc. Symp. Colossal Data
unsupervised learning techniques,’’ Ph.D. dissertation, Dalhousie Univ., Anal. Netw. (CDAN), Mar. 2016, pp. 1–8.
Halifax, NS, Canada, 2017. [178] T. Bakhshi and B. Ghita, ‘‘On Internet traffic classification: A two-phased
[154] S. Ramaswamy, R. Rastogi, and K. Shim, ‘‘Efficient algorithms for machine learning approach,’’ J. Comput. Netw. Commun., vol. 2016,
mining outliers from large data sets,’’ ACM SIGMOD Rec., vol. 29, no. 2, May 2016, Art. no. 2048302.
pp. 427–438, 2000. [179] J. Yang, J. Deng, S. Li, and Y. Hao, ‘‘Improved traffic detection with
[155] J. Tang et al., ‘‘Enhancing effectiveness of outlier detections for low support vector machine based on restricted Boltzmann machine,’’ Soft
density patterns,’’ in Proc. Pacific–Asia Conf. Knowl. Discovery Data Comput., vol. 21, no. 11, pp. 3101–3112, 2017.
Mining. Berlin, Germany: Springer, 2002, pp. 535–548. [180] R. Gonzalez et al. (2017). ‘‘Net2Vec: Deep learning for the network.’’
[156] W. Jin, A. Tung, J. Han, and W. Wang, ‘‘Ranking outliers using sym- [Online]. Available: https://arxiv.org/abs/arXiv:1705.03881
metric neighborhood relationship,’’ in Proc. Adv. Knowl. Discovery Data [181] M. E. Aminanto and K. Kim. (2016). Deep Learning-Based Feature
Mining, 2006, pp. 577–593. Selection for Intrusion Detection System in Transport Layer. [Online].
[157] H.-P. Kriegel, P. Kröger, E. Schubert, and A. Zimek, ‘‘LoOP: Local outlier Available: http://caislab.kaist.ac.kr/publication/paper_files/2016/
probabilities,’’ in Proc. 18th ACM Conf. Inf. Knowl. Manage., 2009, 20160623_AM.pdf
pp. 1649–1652. [182] L. Nie, D. Jiang, S. Yu, and H. Song, ‘‘Network traffic prediction based
[158] Z. He, X. Xu, and S. Deng, ‘‘Discovering cluster-based local outliers,’’ on deep belief network in wireless mesh backbone networks,’’ in Proc.
Pattern Recognit. Lett., vol. 24, nos. 9–10, pp. 1641–1650, 2003. IEEE Wireless Commun. Netw. Conf. (WCNC), Mar. 2017, pp. 1–5.
[159] M. Goldstein and S. Uchida, ‘‘Behavior analysis using unsupervised [183] C. Zhang, J. Jiang, and M. Kamel, ‘‘Intrusion detection using hierarchical
anomaly detection,’’ in Proc. 10th Joint Workshop Mach. Perception neural networks,’’ Pattern Recognit. Lett., vol. 26, no. 6, pp. 779–791,
Robot. (MPR), 2014, pp. 1–6. 2005.
[160] M. Goldstein and S. Uchida, ‘‘A comparative evaluation of unsupervised [184] B. C. Rhodes, J. A. Mahaffey, and J. D. Cannady, ‘‘Multiple self-
anomaly detection algorithms for multivariate data,’’ PLoS ONE, vol. 11, organizing maps for intrusion detection,’’ in Proc. 23rd Nat. Inf. Syst.
no. 4, 2016, Art. no. e0152173. Secur. Conf., 2000, pp. 16–19.
[161] M. Goldstein and A. Dengel, ‘‘Histogram-based outlier score (HBOS): [185] H. G. Kayacik, A. N. Zincir-Heywood, and M. I. Heywood, ‘‘On the
A fast unsupervised anomaly detection algorithm,’’ in Proc. KI-Poster capability of an SOM based intrusion detection system,’’ in Proc. Int.
Demo Track, 2012, pp. 59–63. Joint Conf. Neural Netw., vol. 3, 2003, pp. 1808–1813.
[162] V. Chandola, A. Banerjee, and V. Kumar, ‘‘Anomaly detection: A survey,’’ [186] S. Zanero, ‘‘Analyzing TCP traffic patterns using self organizing maps,’’
ACM Comput. Surv., vol. 41, no. 3, p. 15, 2009. in Proc. Int. Conf. Image Anal. Process. Berlin, Germany: Springer, 2005,
[163] M. Shafiq, X. Yu, A. A. Laghari, L. Yao, N. K. Karn, and F. Abdessamia, pp. 83–90.
‘‘Network traffic classification techniques and comparative analysis using [187] P. Lichodzijewski, A. N. Zincir-Heywood, and M. I. Heywood, ‘‘Host-
machine learning algorithms,’’ in Proc. 2nd IEEE Int. Conf. Comput. based intrusion detection using self-organizing maps,’’ in Proc. IEEE Int.
Commun. (ICCC), Oct. 2016, pp. 2451–2455. Joint Conf. Neural Netw., 2002, pp. 1714–1719.
[164] Y. Dhote, S. Agrawal, and A. J. Deen, ‘‘A survey on feature selection [188] P. Lichodzijewski, A. N. Zincir-Heywood, and M. I. Heywood, ‘‘Dynamic
techniques for Internet traffic classification,’’ in Proc. Int. Conf. Comput. intrusion detection using self-organizing maps,’’ in Proc. 14th Annu. Can.
Intell. Commun. Netw. (CICN), Dec. 2015, pp. 1375–1380. Inf. Technol. Secur. Symp. (CITSS), 2002, pp. 1–5.

65610 VOLUME 7, 2019


M. Usama et al.: Unsupervised Machine Learning for Networking: Techniques, Applications, and Research Challenges

[189] M. Amini, R. Jalili, and H. R. Shahriari, ‘‘RT-UNNID: A practical solu- [212] R. Mitchell and I.-R. Chen, ‘‘A survey of intrusion detection in wire-
tion to real-time network-based intrusion detection using unsupervised less network applications,’’ Comput. Commun., vol. 42, no. 3, pp. 1–23,
neural networks,’’ Comput. Secur., vol. 25, no. 6, pp. 459–468, 2006. Apr. 2014.
[190] O. Depren, M. Topallar, E. Anarim, and M. K. Ciliz, ‘‘An intelligent [213] M. Ahmed, A. N. Mahmood, and J. Hu, ‘‘A survey of network anomaly
intrusion detection system (IDS) for anomaly and misuse detection in detection techniques,’’ J. Netw. Comput. Appl., vol. 60, pp. 19–31,
computer networks,’’ Expert Syst. Appl., vol. 29, no. 4, pp. 713–722, Jan. 2016.
2005. [214] L. Xiao, Y. Chen, and C. K. Chang, ‘‘Bayesian model averaging of
[191] V. Golovko and L. Vaitsekhovich, ‘‘Neural network techniques for intru- Bayesian network classifiers for intrusion detection,’’ in Proc. IEEE 38th
sion detection,’’ in Proc. Int. Conf. Neural Netw. Artif. Intell., 2006, Int. Comput. Softw. Appl. Conf. Workshops (COMPSACW), Jul. 2014,
pp. 65–69. pp. 128–133.
[192] A. P. Muniyandi, R. Rajeswari, and R. Rajaram, ‘‘Network anomaly [215] B. Al-Musawi, P. Branch, and G. Armitage, ‘‘BGP anomaly detection
detection by cascading k-means clustering and C4.5 decision tree algo- techniques: A survey,’’ IEEE Commun. Surveys Tuts., vol. 19, no. 1,
rithm,’’ Procedia Eng., vol. 30, pp. 174–182, Jan. 2012. pp. 377–396, 1st Quart., 2017.
[193] P. Casas, J. Mazel, and P. Owezarski, ‘‘Unsupervised network intrusion [216] B. AsSadhan, K. Zeb, J. Al-Muhtadi, and S. Alshebeili, ‘‘Anomaly detec-
detection systems: Detecting the unknown without knowledge,’’ Comput. tion based on LRD behavior analysis of decomposed control and data
Commun., vol. 35, no. 7, pp. 772–783, 2012. planes network traffic using SOSS and FARIMA models,’’ IEEE Access,
[194] S. Zanero and S. M. Savaresi, ‘‘Unsupervised learning techniques for an vol. 5, pp. 13501–13519, 2017.
intrusion detection system,’’ in Proc. ACM Symp. Appl. Comput., 2004, [217] A. Kulakov, D. Davcev, and G. Trajkovski, ‘‘Application of wavelet
pp. 412–419. neural-networks in wireless sensor networks,’’ in Proc. 6th Int. Conf.
[195] S. Zhong, T. M. Khoshgoftaar, and N. Seliya, ‘‘Clustering-based net- Softw. Eng., Artif. Intell., Netw. Parallel/Distrib. Comput., 1st ACIS Int.
work intrusion detection,’’ Int. J. Rel., Qual., Saf. Eng., vol. 14, no. 2, Workshop Self-Assembling Wireless Netw. (SNPD/SAWN), May 2005,
pp. 169–187, 2007. pp. 262–267.
[196] N. Greggio, ‘‘Anomaly Detection in IDSs by means of unsupervised [218] S. G. Akojwar and R. M. Patrikar, ‘‘Improving life time of wireless
greedy learning of finite mixture models,’’ Soft Comput., vol. 22, no. 10, sensor networks using neural network based classification techniques
pp. 3357–3372, 2017. with cooperative routing,’’ Int. J. Commun., vol. 2, no. 1, pp. 75–86, 2008.
[197] W. Wang and R. Battiti, ‘‘Identifying intrusions in computer networks [219] C. Li, X. Xie, Y. Huang, H. Wang, and C. Niu, ‘‘Distributed data mining
with principal component analysis,’’ in Proc. 1st Int. Conf. Availability, based on deep neural network for wireless sensor network,’’ Int. J. Distrib.
Rel. Secur. (ARES), Apr. 2006, pp. 1–8. Sensor Netw., vol. 11, no. 7, p. 157453, 2014.
[198] V. A. Golovko, L. U. Vaitsekhovich, P. A. Kochurko, and U. S. Rubanau, [220] E. Gelenbe, R. Lent, A. Montuori, and Z. Xu, ‘‘Cognitive packet net-
‘‘Dimensionality reduction and attack recognition using neural network works: QoS and performance,’’ in Proc. 10th IEEE Int. Symp. Modeling,
approaches,’’ in Proc. Int. Joint Conf. Neural Netw. (IJCNN), Aug. 2007, Anal. Simulation Comput. Telecommun. Syst. (MASCOTS), Oct. 2002,
pp. 2734–2739. pp. 3–9.
[199] L. A. Gordon, M. P. Loeb, W. Lucyshyn, and R. Richardson, ‘‘2005 [221] M. Cordina and C. J. Debono, ‘‘Increasing wireless sensor network
CSI/FBI computer crime and security survey,’’ Comput. Secur. J., vol. 21, lifetime through the application of som neural networks,’’ in Proc. 3rd
no. 3, p. 1, 2005. Int. Symp. Commun., Control Signal Process. (ISCCSP), Mar. 2008,
[200] Symantec. (2016). Internet Security Threat Report. Accessed: pp. 467–471.
Feb. 2, 2017. [Online]. Available: https://www.symantec.com/security- [222] N. Enami and R. A. Moghadam, ‘‘Energy based clustering self organiz-
center/threat-report ing map protocol for extending wireless sensor networks lifetime and
[201] C.-F. Tsai, Y.-F. Hsu, C.-Y. Lin, and W.-Y. Lin, ‘‘Intrusion detection coverage,’’ Can. J. Multimedia Wireless Netw., vol. 1, no. 4, pp. 42–54,
by machine learning: A review,’’ Expert Syst. Appl., vol. 36, no. 10, 2010.
pp. 11994–12000, 2009. [223] L. Dehni, F. Krief, and Y. Bennani, ‘‘Power control and clustering in
[202] W.-C. Lin, S.-W. Ke, and C.-F. Tsai, ‘‘CANN: An intrusion detection sys- wireless sensor networks,’’ in Proc. IFIP Annu. Medit. Ad Hoc Netw.
tem based on combining cluster centers and nearest neighbors,’’ Knowl.- Workshop. Boston, MA, USA: Springer, 2005, pp. 31–40.
Based Syst., vol. 78, pp. 13–21, Apr. 2015. [224] F. Oldewurtel and P. Mahonen, ‘‘Neural wireless sensor networks,’’ in
[203] J. Mazel, P. Casas, R. Fontugne, K. Fukuda, and P. Owezarski, ‘‘Hunting Proc. Int. Conf. Syst. Netw. Commun. (ICSNC), Oct. 2006, p. 28.
attacks in the dark: Clustering and correlation analysis for unsupervised [225] G. A. Barreto, J. C. M. Mota, L. G. M. Souza, R. A. Frota, and L. Aguayo,
anomaly detection,’’ Int. J. Netw. Manage., vol. 25, no. 5, pp. 283–305, ‘‘Condition monitoring of 3G cellular networks through competitive neu-
2015. ral models,’’ IEEE Trans. Neural Netw., vol. 16, no. 5, pp. 1064–1075,
[204] C. Sony and K. Cho, ‘‘Traffic data repository at the WIDE project,’’ in Sep. 2005.
Proc. USENIX Annu. Tech. Conf., FREENIX Track, 2000, pp. 263–270. [226] A. I. Moustapha and R. R. Selmic, ‘‘Wireless sensor network modeling
[205] E. E. Papalexakis, A. Beutel, and P. Steenkiste, ‘‘Network anomaly detec- using modified recurrent neural networks: Application to fault detection,’’
tion using co-clustering,’’ in Encyclopedia of Social Network Analysis IEEE Trans. Instrum. Meas., vol. 57, no. 5, pp. 981–988, May 2008.
and Mining. New York, NY, USA: Springer, 2014, pp. 1054–1068. [227] D. C. Hoang, R. Kumar, and S. K. Panda, ‘‘Fuzzy C-means clustering
[206] V. Miškovic et al., ‘‘Application of hybrid incremental machine learn- protocol for wireless sensor networks,’’ in in Proc. IEEE Int. Symp. Ind.
ing methods to anomaly based intrusion detection,’’ in Proc. 1st Int. Electron. (ISIE), Jul. 2010, pp. 3477–3482.
Conf. Elect., Electron. Comput. Eng. (IcETRAN), Vrnjačka Banja, Serbia, [228] E. I. Oyman and C. Ersoy, ‘‘Multiple sink network design problem in
Jun. 2014. large scale wireless sensor networks,’’ in Proc. IEEE Int. Conf. Com-
[207] N. F. Haq et al., ‘‘Application of machine learning approaches in intrusion mun. (ICC), vol. 6, Jun. 2004, pp. 3663–3667.
detection system: A survey,’’ Int. J. Adv. Res. Artif. Intell., vol. 4, no. 3, [229] W. Zhang, S. K. Das, and Y. Liu, ‘‘A trust based framework for secure data
pp. 9–18, 2015. aggregation in wireless sensor networks,’’ in Proc. 3rd Annu. IEEE Com-
[208] F. Hosseinpour, P. V. Amoli, F. Farahnakian, J. Plosila, and mun. Soc. Sensor Ad Hoc Commun. Netw. (SECON), vol. 1, Sep. 2006,
T. Hämäläinen, ‘‘Artificial immune system based intrusion detection: pp. 60–69.
Innate immunity using an unsupervised learning approach,’’ Int. J. Digit. [230] G. Kapoor and K. Rajawat, ‘‘Outlier-aware cooperative spectrum sensing
Content Technol. Appl., vol. 8, no. 5, p. 1, 2014. in cognitive radio networks,’’ Phys. Commun., vol. 17, pp. 118–127,
[209] G. K. Chaturvedi, A. K. Chaturvedi, and V. R. More, ‘‘A study of Dec. 2015.
intrusion detection system for cloud network using FC-ANN algorithm,’’ [231] T. Ristaniemi and J. Joutsensalo, ‘‘Advanced ICA-based receivers for
Tech. Rep., 2016. block fading DS-CDMA channels,’’ Signal Process., vol. 82, no. 3,
[210] C. Modi, D. Patel, B. Borisaniya, H. Patel, A. Patel, and M. Rajarajan, pp. 417–431, Mar. 2002.
‘‘A survey of intrusion detection techniques in cloud,’’ J. Netw. Comput. [232] M. S. Mushtaq, B. Augustin, and A. Mellouk, ‘‘Empirical study based on
Appl., vol. 36, no. 1, pp. 42–57, 2013. machine learning approach to assess the QoS/QoE correlation,’’ in Proc.
[211] D. J. Weller-Fahy, B. J. Borghetti, and A. A. Sodemann, ‘‘A survey of 17th Eur. Conf. Netw. Opt. Commun. (NOC), Jun. 2012, pp. 1–7.
distance and similarity measures used within network intrusion anomaly [233] M. Alreshoodi and J. Woods, ‘‘Survey on QoE\QoS correlation models
detection,’’ IEEE Commun. Surveys Tuts., vol. 17, no. 1, pp. 70–91, 1st for multimedia services,’’ Int. J. Distrib. Parallel Syst., vol. 4, no. 3, p. 53,
Quart., 2015. 2013.

VOLUME 7, 2019 65611


M. Usama et al.: Unsupervised Machine Learning for Networking: Techniques, Applications, and Research Challenges

[234] A. Testolin, M. Zanforlin, M. De Filippo De Grazia, D. Munaretto, [255] Q. Niyaz, W. Sun, and A. Y. Javaid. (2016). ‘‘A deep learning based DDoS
A. Zanella, and M. Zorzi, ‘‘A machine learning approach to QoE-based detection system in software-defined networking (SDN).’’ [Online].
video admission control and resource allocation in wireless systems,’’ Available: https://arxiv.org/abs/1611.07400
in Proc. 13th Annu. Medit. Ad Hoc Netw. Workshop (MED-HOC-NET), [256] C. G. Cordero, S. Hauke, M. Mühlhäuser, and M. Fischer, ‘‘Analyz-
Jun. 2014, pp. 31–38. ing flow-based anomaly intrusion detection using replicator neural net-
[235] S. Przylucki, ‘‘Assessment of the QoE in voice services based on the self- works,’’ in Proc. 14th Annu. Conf. Privacy, Secur. Trust (PST), Dec. 2016,
organizing neural network structure,’’ in Proc. Int. Conf. Comput. Netw. pp. 317–324.
Berlin, Germany: Springer, 2011, pp. 144–153. [257] Z. Chen, C. K. Yeo, B. S. Lee, and C. T. Lau, ‘‘A novel anomaly detection
[236] P. Ahammad, B. Kennedy, P. Ganti, and H. Kolam, ‘‘QoE-driven unsu- system using feature-based MSPCA with sketch,’’ in Proc. 26th Wireless
pervised image categorization for optimized Web delivery: Short paper,’’ Opt. Commun. Conf. (WOCC), Apr. 2017, pp. 1–6.
in Proc. ACM Int. Conf. Multimedia, 2014, pp. 797–800. [258] T. Matsuda, T. Morita, T. Kudo, and T. Takine, ‘‘Traffic anomaly detec-
[237] D. C. Mocanu, G. Exarchakos, and A. Liotta, ‘‘Deep learning for objec- tion based on robust principal component analysis using periodic traf-
tive quality assessment of 3D images,’’ in Proc. IEEE Int. Conf. Image fic behavior,’’ IEICE Trans. Commun., vol. 100, no. 5, pp. 749–761,
Process. (ICIP), Oct. 2014, pp. 758–762. 2017.
[238] B. Francisco, A. Ramon, P.-R. Jordi, and S. Oriol, ‘‘Distributed spectrum [259] I. S. Thaseen and C. A. Kumar, ‘‘Intrusion detection model using fusion of
management based on reinforcement learning,’’ in Proc. 14th Int. Conf. PCA and optimized SVM,’’ in Proc. Int. Conf. Contemp. Comput. Inform.
Cognit. Radio Oriented Wireless Netw. Commun., 2009, pp. 1–6. (IC3I), Nov. 2014, pp. 879–884.
[239] L. Xuedong, C. Min, X. Yang, B. Llangko, and L. C. M. Victo, ‘‘MRL- [260] B. Subba, S. Biswas, and S. Karmakar, ‘‘Enhancing performance of
CC: A novel cooperative communication protocol for QoS provisioning in anomaly based intrusion detection systems through dimensionality reduc-
wireless sensor networks,’’ Int. J. Sensor Netw., vol. 8, no. 2, pp. 98–108, tion using principal component analysis,’’ in Proc. IEEE Int. Conf. Adv.
2010. Netw. Telecommun. Syst. (ANTS), Nov. 2016, pp. 1–6.
[240] P. Geurts, I. El Khayat, and G. Leduc, ‘‘A machine learning approach to
[261] I. Z. Muttaqien and T. Ahmad, ‘‘Increasing performance of IDS by
improve congestion control over wireless computer networks,’’ in Proc.
selecting and transforming features,’’ in Proc. IEEE Int. Conf. Commun.,
4th IEEE Int. Conf. Data Mining (ICDM), Nov. 2004, pp. 383–386.
Netw. Satell. (COMNETSAT), Dec. 2016, pp. 85–90.
[241] K.-S. Hwang, S.-W. Tan, M.-C. Hsiao, and C.-S. Wu, ‘‘Cooperative
[262] N. Y. Almusallam et al., ‘‘Dimensionality reduction for intrusion detec-
multiagent congestion control for high-speed networks,’’ IEEE Trans.
tion systems in multi-data streams—A review and proposal of unsu-
Syst., Man, Cybern. B, Cybern., vol. 35, no. 2, pp. 255–268, Apr. 2005.
pervised feature selection scheme,’’ in Emergent Computation. Cham,
[242] K. Winstein and H. Balakrishnan, ‘‘TCP ex machina: Computer-
Switzerland: Springer, 2017, pp. 467–487.
generated congestion control,’’ ACM SIGCOMM Comput. Commun. Rev.,
vol. 43, no. 4, pp. 123–134, 2013. [263] Y. Kumar, H. Farooq, and A. Imran, ‘‘Fault prediction and reliability
[243] T. J. O’Shea and J. Hoydis. (2017). ‘‘An introduction to machine analysis in a real cellular network,’’ in Proc. 13th Int. Wireless Commun.
learning communications systems.’’ [Online]. Available: https://arxiv.org/ Mobile Comput. Conf. (IWCMC), Jun. 2017, pp. 1090–1095.
abs/1702.00832 [264] Z. Nascimento, D. Sadok, S. Fernandes, and J. Kelner, ‘‘Multi-objective
[244] T. J. O’Shea, T. Erpek, and T. C. Clancy. (2017). ‘‘Deep learn- optimization of a hybrid model for network traffic classification by
ing based MIMO communications.’’ [Online]. Available: https://arxiv. combining machine learning techniques,’’ in Proc. Int. Joint Conf. Neural
org/abs/1707.07980 Netw. (IJCNN), Jul. 2014, pp. 2116–2122.
[245] T. J. O’Shea, J. Corgan, and T. C. Clancy, ‘‘Unsupervised representa- [265] Z. Ansari, M. Azeem, A. V. Babu, and W. Ahmed. (2015).
tion learning of structured radio communication signals,’’ in Proc. 1st ‘‘A fuzzy approach for feature evaluation and dimensionality reduction
Int. Workshop Sens., Process. Learn. Intell. Mach. (SPLINE), Jul. 2016, to improve the quality of Web usage mining results.’’ [Online]. Available:
pp. 1–5. https://arxiv.org/abs/1509.00690
[246] T. Huang, H. Sethu, and N. Kandasamy, ‘‘A new approach to dimension- [266] M. A. Alsheikh, S. Lin, H.-P. Tan, and D. Niyato, ‘‘Toward a robust sparse
ality reduction for anomaly detection in data traffic,’’ IEEE Trans. Netw. data representation for wireless sensor networks,’’ in Proc. IEEE 40th
Service Manage., vol. 13, no. 3, pp. 651–665, Sep. 2016. Conf. Local Comput. Netw. (LCN), Oct. 2015, pp. 117–124.
[247] A. Zoha, A. Saeed, A. Imran, M. A. Imran, and A. Abu-Dayya, [267] K. Labib and V. R. Vemuri, ‘‘An application of principal component
‘‘A learning-based approach for autonomous outage detection and cov- analysis to the detection and visualization of computer network attacks,’’
erage optimization,’’ Trans. Emerg. Telecommun. Technol., vol. 27, no. 3, Ann. Telecommun., vol. 61, no. 1, pp. 218–234, 2006.
pp. 439–450, 2016. [268] J. Lokoč et al., ‘‘k-NN classification of malware in HTTPS traffic using
the metric space approach,’’ in Proc. Pacific–Asia Workshop Intell. Secur.
[248] A. Shirazinia and S. Dey, ‘‘Power-constrained sparse gaussian linear
Inform. Cham, Switzerland: Springer, 2016, pp. 131–145.
dimensionality reduction over noisy channels,’’ IEEE Trans. Signal Pro-
[269] M. Ancona, W. Cazzola, S. Drago, and G. Quercini, ‘‘Visualizing and
cess., vol. 63, no. 21, pp. 5837–5852, Nov. 2015.
managing network topologies via rectangular dualization,’’ in Proc. 11th
[249] S. Hou, R. C. Qiu, Z. Chen, and Z. Hu. (2011). ‘‘Svm and dimensionality
IEEE Symp. Comput. Commun. (ISCC), Jun. 2006, pp. 1000–1005.
reduction in cognitive radio with experimental validation.’’ [Online].
[270] G. Cherubin et al., ‘‘Conformal clustering and its application to botnet
Available: https://arxiv.org/abs/1106.2325
traffic,’’ in Proc. SLDS, 2015, pp. 313–322.
[250] C. Khalid, E. Zyad, and B. Mohammed, ‘‘Network intrusion detection [271] I. Marsh, ‘‘A lightweight measurement platform for home Internet mon-
system using L1-norm PCA,’’ in Proc. 11th Int. Conf. Inf. Assurance itoring,’’ in Proc. IEEE IFIP Netw. Conf. (IFIP Netw.) Workshops,
Secur. (IAS), Dec. 2015, pp. 118–122. 2017.
[251] E. Goodman, J. Ingram, S. Martin, and D. Grunwald, ‘‘Using bipartite [272] J. M. Lewis, V. R. De Sa, and L. Van Der Maaten, ‘‘Divvy: Fast and
anomaly features for cyber security applications,’’ in Proc. IEEE 14th Int. intuitive exploratory data analysis,’’ J. Mach. Learn. Res., vol. 14, no. 1,
Conf. Mach. Learn. Appl. (ICMLA), Dec. 2015, pp. 301–306. pp. 3159–3163, 2013.
[252] N. Patwari, A. O. Hero, III, and A. Pacholski, ‘‘Manifold learning visu- [273] G. Holmes, A. Donkin, and I. H. Witten, ‘‘Weka: A machine learning
alization of network traffic data,’’ in Proc. ACM SIGCOMM Workshop workbench,’’ in Proc. 2nd Austral. New Zealand Conf. Intell. Inf. Syst.,
Mining Netw. Data, 2005, pp. 191–196. 1994, pp. 357–361.
[253] D. López-Sánchez, A. G. Arrieta, and J. M. Corchado, ‘‘Deep neural [274] Q. Liao and S. Stanczak, ‘‘Network state awareness and proactive
networks and transfer learning applied to multimedia Web mining,’’ anomaly detection in self-organizing networks,’’ in Proc. IEEE Globecom
in Proc. Int. Symp. Distrib. Comput. Artif. Intell. Cham, Switzerland: Workshops (GC Wkshps), Dec. 2015, pp. 1–6.
Springer, 2017, p. 124. [275] S. Chernov, D. Petrov, and T. Ristaniemi, ‘‘Location accuracy impact on
[254] T. Ban, S. Pang, M. Eto, D. Inoue, K. Nakao, and R. Huang, cell outage detection in LTE-A networks,’’ in Proc. Int. Wireless Commun.
‘‘Towards early detection of novel attack patterns through the lens Mobile Comput. Conf. (IWCMC), Aug. 2015, pp. 1162–1167.
of a large-scale darknet,’’ in Proc. Int. IEEE Conf. Ubiquitous [276] A. Zoha, A. Saeed, A. Imran, M. A. Imran, and A. Abu-Dayya, ‘‘Data-
Intell. Comput., Adv. Trusted Comput., Scalable Comput. Commun., driven analytics for automated cell outage detection in self-organizing
Cloud Big Data Comput., Internet People, Smart World Congr. networks,’’ in Proc. 11th Int. Conf. Design Reliable Commun. Netw.
(UIC/ATC/ScalCom/CBDCom/IoP/SmartWorld), Jul. 2016, pp. 341–349. (DRCN), Mar. 2015, pp. 203–210.

65612 VOLUME 7, 2019


M. Usama et al.: Unsupervised Machine Learning for Networking: Techniques, Applications, and Research Challenges

[277] J. Turkka, F. Chernogorov, K. Brigatti, T. Ristaniemi, and J. Lempiäinen, [299] D. Puschmann, P. Barnaghi, and R. Tafazolli, ‘‘Adaptive clustering for
‘‘An approach for network outage detection from drive-testing dynamic IoT data streams,’’ IEEE Internet Things J., vol. 4, no. 1,
databases,’’ J. Comput. Netw. Commun., vol. 2012, Sep. 2012, pp. 64–74, Feb. 2017.
Art. no. 163184. [300] H. Assem, L. Xu, T. S. Buda, and D. O’Sullivan, ‘‘Machine learning as
[278] N. Kato et al., ‘‘The deep learning vision for heterogeneous network traf- a service for enabling Internet of Things and people,’’ Pers. Ubiquitous
fic control: Proposal, challenges, and future perspective,’’ IEEE Wireless Comput., vol. 20, no. 6, pp. 899–914, 2016.
Commun., vol. 24, no. 3, pp. 146–153, Jun. 2017. [301] J. Lee, M. Stanley, A. Spanias, and C. Tepedelenlioglu, ‘‘Integrating
[279] B. Mao et al., ‘‘Routing or computing? The paradigm shift towards intel- machine learning in embedded sensor systems for Internet-of-Things
ligent computer network packet transmission based on deep learning,’’ applications,’’ in Proc. IEEE Int. Symp. Signal Process. Inf. Technol.
IEEE Trans. Comput., vol. 66, no. 11, pp. 1946–1960, Nov. 2017. (ISSPIT), Dec. 2016, pp. 290–294.
[280] J. Qadir, N. Ahad, E. Mushtaq, and M. Bilal, ‘‘SDNs, clouds, and big [302] P. Lin, D.-C. Lyu, F. Chen, S.-S. Wang, and Y. Tsao, ‘‘Multi-style
data: New opportunities,’’ in Proc. 12th Int. Conf. Frontiers Inf. Technol. learning with denoising autoencoders for acoustic modeling in the Inter-
(FIT), Dec. 2014, pp. 28–33. net of Things (IoT),’’ Comput. Speech Lang., vol. 46, pp. 481–495,
[281] B. A. A. Nunes, M. Mendonca, X.-N. Nguyen, K. Obraczka, and Nov. 2017.
T. Turletti, ‘‘A survey of software-defined networking: Past, present, [303] R. Sommer and V. Paxson, ‘‘Outside the closed world: On using machine
and future of programmable networks,’’ IEEE Commun. Surveys Tuts., learning for network intrusion detection,’’ in Proc. IEEE Symp. Secur.
vol. 16, no. 3, pp. 1617–1634, 3rd Quart., 2014. Privacy (SP), May 2010, pp. 305–316.
[282] H. Kim and N. Feamster, ‘‘Improving network management with software [304] R. A. R. Ashfaq, X.-Z. Wang, J. Z. Huang, H. Abbas, and Y.-L. He,
defined networking,’’ IEEE Commun. Mag., vol. 51, no. 2, pp. 114–119, ‘‘Fuzziness based semi-supervised learning approach for intrusion detec-
Feb. 2013. tion system,’’ Inf. Sci., vol. 378, pp. 484–497, Feb. 2017.
[283] J. Ashraf and S. Latif, ‘‘Handling intrusion and DDoS attacks in software [305] L. Watkins et al., ‘‘Using semi-supervised machine learning to address the
defined networks using machine learning techniques,’’ in Proc. Nat. big data problem in DNS networks,’’ in Proc. IEEE 7th Annu. Comput.
Softw. Eng. Conf. (NSEC), Nov. 2014, pp. 55–60. Commun. Workshop Conf. (CCWC), Jan. 2017, pp. 1–6.
[284] D. J. Dean, H. Nguyen, and X. Gu, ‘‘Ubl: Unsupervised behavior learning [306] S. J. Pan and Q. Yang, ‘‘A survey on transfer learning,’’ IEEE Trans.
for predicting performance anomalies in virtualized cloud systems,’’ in Knowl. Data Eng., vol. 22, no. 10, pp. 1345–1359, Oct. 2010.
Proc. 9th Int. Conf. Auton. Comput., 2012, pp. 191–200. [307] E. Baştuğ, M. Bennis, and M. Debbah, ‘‘A transfer learning approach
[285] D. He, S. Chan, X. Ni, and M. Guizani, ‘‘Software-defined-networking- for cache-enabled wireless networks,’’ in Proc. 13th Int. Symp. Mod-
enabled traffic anomaly detection and mitigation,’’ IEEE Internet Things eling Optim. Mobile, Ad Hoc, Wireless Netw. (WiOpt), May 2015,
J., vol. 4, no. 6, pp. 1890–1898, Dec. 2017. pp. 161–166.
[286] K. K. Goswami. (2017). Intelligent Threat-Aware Response System [308] J. Konečnỳ, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh,
in Software-Defined Networks. [Online]. Available: http://scholarworks. and D. Bacon. (2016). ‘‘Federated learning: Strategies for improving
sjsu.edu/etd_theses/4801/ communication efficiency.’’ [Online]. Available: https://arxiv.org/abs/
[287] A. S. da Silva, J. A. Wickboldt, L. Z. Granville, and A. Schaeffer-Filho, 1610.05492
‘‘ATLANTIC: A framework for anomaly traffic detection, classification, [309] A. Gokhale and A. V. Bhagwat, ‘‘System and method for network address
and mitigation in SDN,’’ in Proc. IEEE/IFIP Netw. Oper. Manage. Symp. administration and management in federated cloud computing networks,’’
(NOMS), Apr. 2016, pp. 27–35. U.S. Patent 9 667 486, May 30, 2017.
[288] S.-H. Zhang, X.-X. Meng, and L.-H. Wang, ‘‘SDNForensics: A compre- [310] J. H. Abawajy and M. M. Hassan, ‘‘Federated Internet of Things and
hensive forensics framework for software defined network,’’ in Proc. Int. cloud computing pervasive patient health monitoring system,’’ IEEE
Conf. Comput. Netw. Commun. Technol., vol. 3, no. 4, 2017, p. 5. Commun. Mag., vol. 55, no. 1, pp. 48–53, Jan. 2017.
[289] P. Amaral, J. Dinis, P. Pinto, L. Bernardo, J. Tavares, and H. S. Mamede, [311] P. Massonet, L. Deru, A. Achour, S. Dupont, A. Levin, and M. Villari,
‘‘Machine learning in software defined networks: Data collection and ‘‘End-to-end security architecture for federated cloud and IoT networks,’’
traffic classification,’’ in Proc. IEEE 24th Int. Conf. Netw. Protocols in Proc. IEEE Int. Conf. Smart Comput. (SMARTCOMP), May 2017,
(ICNP), Nov. 2016, pp. 1–5. pp. 1–6.
[290] O. G. Aliu, A. Imran, M. A. Imran, and B. Evans, ‘‘A survey of self [312] I. Goodfellow et al., ‘‘Generative adversarial nets,’’ in Proc. Adv. Neural
organisation in future cellular networks,’’ IEEE Commun. Surveys Tuts., Inf. Process. Syst., 2014, pp. 2672–2680.
vol. 15, no. 1, pp. 336–361, 1st Quart., 2013. [313] W. Hu and Y. Tan. (2017). ‘‘Generating adversarial malware examples
[291] A. Imran, A. Zoha, and A. Abu-Dayya, ‘‘Challenges in 5G: How to for black-box attacks based on GAN.’’ [Online]. Available: https://arxiv.
empower SON with big data for enabling 5G,’’ IEEE Netw., vol. 28, no. 6, org/abs/1702.05983
pp. 27–33, Nov./Dec. 2014. [314] K. Grosse, N. Papernot, P. Manoharan, M. Backes, and
[292] X. X. Wang Li and V. C. M. Leung, ‘‘Artificial intelligence-based tech- P. McDaniel. (2016). ‘‘Adversarial perturbations against deep
niques for emerging heterogeneous network: State of the arts, opportuni- neural networks for malware classification.’’ [Online]. Available:
ties, and challenges,’’ IEEE Access, vol. 3, pp. 1379–1391, 2015. https://arxiv.org/abs/1606.04435
[293] A. Misra and K. K. Sarma, ‘‘Self-organization and optimization in het- [315] L. Breiman, ‘‘Statistical modeling: The two cultures (with comments and
erogenous networks,’’ in Interference Mitigation and Energy Manage- a rejoinder by the author),’’ Statist. Sci., vol. 16, no. 3, pp. 199–231,
ment in 5G Heterogeneous Cellular Networks. Hershey, PA, USA: IGI Aug. 2001.
Global, 2017, pp. 246–268. [316] I. Sturm, S. Lapuschkin, W. Samek, and K.-R. Müller, ‘‘Interpretable
[294] Z. Zhang, K. Long, J. Wang, and F. Dressler, ‘‘On swarm intelligence deep neural networks for single-trial EEG classification,’’ J. Neurosci.
inspired self-organized networking: Its bionic mechanisms, designing Methods, vol. 274, pp. 141–145, Dec. 2016.
principles and optimization approaches,’’ IEEE Commun. Surveys Tuts., [317] X. Zhu, C. Vondrick, C. C. Fowlkes, and D. Ramanan, ‘‘Do we need more
vol. 16, no. 1, pp. 513–537, 1st Quart., 2014. training data?’’ in Int. J. Comput. Vis., vol. 119, no. 1, pp. 76–92, 2016.
[295] S. Latif, J. Qadir, S. Farooq, and M. A. Imran, ‘‘How 5G wireless [318] P. Domingos, ‘‘A few useful things to know about machine learning,’’
(and concomitant technologies) will revolutionize healthcare?’’ Future Commun. ACM, vol. 55, no. 10, pp. 78–87, 2012.
Internet, vol. 9, no. 4, p. 93, 2017. [319] A. Amin et al., ‘‘Comparing oversampling techniques to handle the class
[296] Z. Wen, R. Yang, P. Garraghan, T. Lin, J. Xu, and M. Rovatsos, imbalance problem: A customer churn prediction case study,’’ IEEE
‘‘Fog orchestration for Internet of Things services,’’ IEEE Internet Com- Access, vol. 4, pp. 7940–7957, 2016.
put., vol. 21, no. 2, pp. 16–24, Feb. 2017. [320] G. P. Zhang, ‘‘Avoiding pitfalls in neural network research,’’ IEEE
[297] J. Gubbi, R. Buyya, S. Marusic, and M. Palaniswami, ‘‘Internet of Trans. Syst., Man, Cybern. C, Appl. Rev., vol. 37, no. 1, pp. 3–16,
Things (IoT): A vision, architectural elements, and future directions,’’ Jan. 2007.
Future Generat. Comput. Syst., vol. 29, no. 7, pp. 1645–1660, 2013. [321] A. Ng, ‘‘Advice for applying machine learning,’’ Stanford Univ., Stan-
[298] H.-Y. Kim and J.-M. Kim, ‘‘A load balancing scheme based on deep- ford, CA, USA, Tech. Rep., 2011. [Online]. Available: http://cs229.
learning in IoT,’’ Cluster Comput., vol. 20, no. 1, pp. 873–878, 2017. stanford.edu/materials/ML-advice.pdf

VOLUME 7, 2019 65613


M. Usama et al.: Unsupervised Machine Learning for Networking: Techniques, Applications, and Research Challenges

MUHAMMAD USAMA received the bachelor’s KOK-LIM ALVIN YAU received the B.Eng. degree
degree in telecommunication engineering from (Hons.) in electrical and electronics engineering
the Government College University, Faisalabad, from Universiti Teknologi PETRONAS, Malaysia,
Pakistan, in 2010, and the master’s degree from in 2005, the M.Sc. degree in electrical engineer-
the National University of Computer and Emerg- ing from the National University of Singapore,
ing Sciences, Islamabad. He is currently pursuing in 2007, and the Ph.D. degree in network engi-
the Ph.D. degree in electrical engineering with neering from the Victoria University of Welling-
the Information Technology University, Lahore, ton, New Zealand, in 2010. He is currently a
Pakistan. His research interests include adversarial Professor with the Department of Computing and
machine learning and computer networks. Information Systems, Sunway University. He is
also a Researcher, Lecturer, and Consultant in cognitive radio, wireless
networks, applied artificial intelligence, applied deep learning, and rein-
forcement learning. He was a recipient of the 2007 Professional Engineer
Board of the Singapore Gold Medal for being the Best Graduate of the
M.Sc. degree, in 2006 and 2007, respectively. He serves as a TPC member
JUNAID QADIR is currently an Associate Pro- and a Reviewer for major international conferences, including the ICC,
fessor with the Information Technology Univer- VTC, LCN, GLOBECOM, and AINA. He also served as the Vice General
sity (ITU)-Punjab, Lahore, Pakistan, where he Co-Chair of the ICOIN’18, the Co-Chair of the IET ICFCNA’14, and the
is also the Director of the IHSAN Lab that Co-Chair (Organizing Committee) of the IET ICWCA’12. He serves as
focuses on deploying ICT for development, and is an Editor for the KSII Transactions on Internet and Information Systems,
engaged in systems and networking research. Prior an Associate Editor for the IEEE ACCESS, a Guest Editor for the Special
to joining ITU, he was an Assistant Professor Issues of the IEEE ACCESS, the IET Networks, the IEEE Computational Intel-
with the School of Electrical Engineering and ligence Magazine, and the Journal of Ambient Intelligence and Humanized
Computer Sciences (SEECS), National University Computing (Springer), and a regular Reviewer for over 20 journals, includ-
of Sciences and Technology (NUST), Pakistan. ing the IEEE journals and magazines, the Ad Hoc Networks, and the IET
At SEECS, he directed the Cognet Lab, which focused on cognitive net- Communications.
working and the application of computational intelligence techniques in
networking. His research interests include the application of algorithmic,
machine learning, and optimization techniques in networks. In particular, he
is interested in the broad areas of wireless networks, cognitive networking,
software-defined networks, and cloud computing. He has been a recipient of
the Highest National Teaching Award in Pakistan and the Higher Education
Commission’s (HEC) Best University Teacher Award (2012–2013). He is
a member of the ACM. He serves as an Associate Editor for the IEEE
ACCESS, the IEEE Communication Magazine, and the Springer Nature Big
Data Analytics.

AUNN RAZA received the B.S. degree in com-


puter science from the National University of Sci-
ences and Technology (NUST), Pakistan, in 2016.
He is currently pursuing the Ph.D. degree with the
Data Intensive Applications and Systems (DIAS)
Laboratory, EPFL, Switzerland. His research inter-
ests include distributed, high-performance, and
data management systems. In particular, he is
interested in scalability and adaptivity of data man-
agement systems and their designs for modern
hardware.

YEHIA ELKHATIB is currently a Lecturer (Assis-


tant Professor) of distributed systems with the
School of Computing and Communications, Lan-
HUNAIN ARIF received the B.S. degree in caster University, U.K., and a Visiting Professor
computer science from the National University with the Ecole de Technologie Superieure, Mon-
of Sciences and Technology (NUST), Pakistan, treal. He works to enable distributed applications
in 2016, and the M.S. degree in computer science to traverse infrastructural boundaries. In the con-
(with a specialization in data analytics) from the text of cloud computing, this entails looking into
Swinburne University of Technology, Australia, interoperability and migration challenges, as well
in 2019. He is currently a CRM System Ana- as, related decision support issues. He is the Cre-
lyst with BUPA Health Insurance. His role entails ator and Chair of the International Cross-Cloud Workshop Series. Beyond
the deployment of the latest business intelligence the cloud, he works on border-free network architectures in intent-driven
and data analytics tools and processes in their systems, systems of systems, and information centric networks. He also
CRM/ERP Systems. He is also currently involved with the Microsoft in works on advocating network-awareness which involves measuring net-
BUPA’s Employee Exchange Program for the performance enhancement of worked systems, evaluating network protocols, and proposing new network
their latest CRM, Microsoft Dynamics 365. management strategies.

65614 VOLUME 7, 2019


M. Usama et al.: Unsupervised Machine Learning for Networking: Techniques, Applications, and Research Challenges

AMIR HUSSAIN received the B.Eng. degree ALA AL-FUQAHA received the Ph.D. degree in
(Hons.) and the Ph.D. degree (in novel neural net- computer engineering and networking from the
work architectures and algorithms for real-world University of Missouri-Kansas City, Kansas City,
applications) from the University of Strathclyde, MO, USA, in 2004. His research interests include
Glasgow, U.K., in 1992 and 1997, respectively. the use of machine learning in general and deep
Following the postdoctoral and senior academic learning, in particular, in support of the data-driven
positions held at the Universities of West of and self-driven management of large-scale deploy-
Scotland (1996–1998), Dundee (1998–2000), and ments of the IoT and smart city infrastructure and
Stirling (2000–2018) respectively, he joined Edin- services, wireless vehicular networks (VANETs),
burgh Napier University, U.K., in 2018, as the cooperation and spectrum access etiquette in cog-
founding Director of the Cognitive Big Data and Cybersecurity (CogBiD) nitive radio networks, and management and planning of software defined
Research Lab, managing over 25 academic and research staff. He is invited networks (SDN). He is an ABET Program Evaluator (PEV). He serves on the
Visiting Professor with leading universities and research & innovation cen- editorial boards and technical program committees of multiple international
ters world-wide, including with Taibah Valley, Taibah University, Medina, journals and conferences.
Saudi Arabia. His research interests are cross-disciplinary and industry
focused, aimed at pioneering brain-inspired and cognitive Big Data tech-
nology for solving complex real-world problems. He has coauthored three
international patents, over 400 publications (with nearly 150 journal papers),
and over a dozen Books. He has led major multi-disciplinary research
projects, funded by the national and European research councils, local and
international charities and industry, and has supervised over 30 PhDs until
now. Amongst other distinguished roles, he is the General Chair for the
IEEE WCCI 2020 (the world’s largest and top IEEE technical event in
Computational Intelligence, comprising IJCNN, FUZZ-IEEE, and the IEEE
CEC), the Vice-Chair of the Emergent Technologies Technical Committee
of the IEEE Computational Intelligence Society, and the Chapter Chair of
the IEEE UK & Ireland, Industry Applications Society Chapter. He is the
founding Editor-in-Chief of the Cognitive Computation Journal (Springer)
and the BMC Big Data Analytics Journal. He has been appointed as an
Associate Editor of several other world-leading journals, including the IEEE
TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, the Information
Fusion Journal, (Elsevier), the IEEE TRANSACTIONS ON EMERGING TOPICS
IN COMPUTATIONAL INTELLIGENCE, and the IEEE Computational Intelligence
Magazine.

VOLUME 7, 2019 65615

You might also like