0% found this document useful (0 votes)

9 views

Machine_Learning_with_Computer_Networks_Techniques

This article, accepted for publication in IEEE Access, serves as a primer on the intersection of machine learning (ML) and computer networks, detailing techniques, datasets, and models relevant to this field. It discusses the applications of ML in optimizing network operations and managing complexities, while also addressing the challenges faced in integrating ML with networking. The paper provides an overview of ML concepts, tools, and frameworks, along with insights into future research directions and open challenges in the domain.

Uploaded by

yusrafaisalcs

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

Machine_Learning_with_Computer_Networks_Techniques

Uploaded by

yusrafaisalcs

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 52

This article has been accepted for publication in IEEE Access.

This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3384460

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2023.0322000

Machine Learning with Computer Networks:

Techniques, Datasets and Models
HAITHAM AFIFI (Member, IEEE), SABRINA POCHABA , ANDREAS BOLTRES , DOMINIC
LANIEWSKI , JANEK HABERER , PAELEKE LEONARD , REZA POORZARE , DANIEL
STOLPMANN , NIKOLAS WEHNER , ADRIAN REDDER , ERIC SAMIKWA , and MICHAEL
SEUFERT (Member, IEEE),
Corresponding author: Haitham Afifi (e-mail: haitham.afifi@ ieee.org).
This work is a result of a cooperation and continuous knowledge exchange between participants of MaLeNe
Workshop 2022

ABSTRACT Machine learning has found many applications in network contexts. These include
solving optimisation problems and managing network operations. Conversely, networks are essential
for facilitating machine learning training and inference, whether performed centrally or in a dis-
tributed fashion. To conduct rigorous research in this area, researchers must have a comprehensive
understanding of fundamental techniques, specific frameworks, and access to relevant datasets.
Additionally, access to training data can serve as a benchmark or a springboard for further
investigation. All these techniques are summarized in this article; serving as a primer paper and
hopefully providing an efficient start for anybody doing research regarding machine learning for
networks or using networks for machine learning.

INDEX TERMS Computer networking, Datasets, Machine learning, Metrics, Tools

I. INTRODUCTION typically requiring human intelligence, such as learn-

In recent years, the ever-growing interconnection of ing, reasoning, problem-solving, perception, language
businesses and people and their increased reliance on understanding, and decision-making. ML is a subfield
networked services has prompted computer network of AI that concentrates on developing algorithms and
architectures to continually grow in size and complexity. statistical models. These models enable computers to
Moreover, with the increased efficiency and convenience perform tasks without explicit programming. In other
of network-based services and businesses, the expecta- words, it involves using statistical techniques to en-
tions of enterprises and people with respect to network able machines to learn from data and improve their
performance indicators such as latency, throughput, performance over time. ML models repeatedly show
reliability and resilience are steadily growing. Conse- their potential for delivering high-quality output (e.g.
quently, conventional algorithmic and heuristic-based classifications/decisions, regression values and generated
approaches for network management tasks are starting artifacts) in highly complex environments with non-
to fall behind the expected levels of performance, as trivial decision boundaries. Generally, for that sort of
they fail to deliver timely and nuanced decisions in environment, the proposed ML model greatly reduces
the face of the complex environment they are operat- the compute resources needed to generate an adequate
ing in. Meanwhile, Machine Learning (ML) has shown response and/or generates outputs that are much "bet-
remarkable results in various problem domains such as ter" than what existing models could deliver. That
discovering new antibiotic drugs [1], generating high- being said, for more complex problem domains, most
fidelity images from arbitrary text prompts [2] and even ML approaches require substantial amounts of compute
finding new mathematical conjectures [3]. Such successes resources and training data. Since most ML models aim
usually become very visible even beyond the research at generalizing from specific records of data, the quality
community, and thus ML has soared in popularity in of these data samples is essential to the overall model
the past few years. Generally, Artificial Intelligence (AI) performance. This often means that large amounts of
refers to machines or systems that can perform tasks data records are required to depict a sufficiently repre-

VOLUME 11, 2023 1

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3384460

Afifi et al.: A Primer in Machine Learning with Computer Networks: Techniques, Datasets and Models

sentative portion of the problem’s data domain. Also, overwhelmed by the possibilities the intersection of ML
more sophisticated models can quickly explode in terms and computer networking provides. The key points of
of parameter/compute operation count and thus often the paper are the following:
require specialized training hardware (i.e. memory and
• It first introduces the most relevant concepts and
compute). Nevertheless, the continuous improvement of
model architectures of ML and then puts them
used hardware as well as the increased attention towards
into the context of the different networking problem
training data acquisition, preparation and generation
domains and the latest advancements therein,
has paved the way for ML to enter into more and more
• It exposes the currently open problems within
application domains.
computer networking and introduces a selection of
Computer networking is a highly complex problem
different tools, data sets, and approaches that have
domain with a plethora of tasks and problems that,
been popular among the research community and
to this day, are solved predominantly through hand-
might serve as a starting point for future work,
crafted, algorithmic, or heuristic methods. These meth-
• It covers several techniques for utilizing networks
ods have to respect a wide range of topologies, network
to improve ML efficiency, such as reducing re-
types and scopes, configurations, hardware and protocol
source requirements via Split Learning (SL) and
stacks, traffic patterns, and other sources of variation.
distributed training via Federated Learning (FL)
Furthermore, there are many different ways to assess
or incorporating the right inductive biases into ML
network performance, and in many cases, minimum
models to improve their ability to generalize from
performance guarantees and security policies add special
limited data,
constraints to the optimization problem. Additionally,
• It discusses challenges related to networks for ML,
contemporary networks use specialized hardware to de-
such as resource constraints, security concerns, and
liver optimized performance, e.g. for forwarding packets
the lack of understanding of how ML models make
at line speed. Oftentimes, this hardware does not easily
decisions (and how techniques such as Explainable
allow ML models to replace existing functionality, e.g.
Artificial Intelligence (XAI) may help in gaining
because certain types of computations are not supported
understanding),
or because the storage is not available for more complex
• It comprehensively provides pointers for further
ML models. Finally, while network administrators and
study on related surveys and research.
networking researchers do monitor their networks in
action, the amount of useful ML training data in net- The organization of the paper is visualized in Figure 1,
working – data that is not noisy nor incomplete, publicly and the remainder is organized as follows: Section II
available, and diverse enough to cover large parts of the explains the basic concepts and categories of ML and
problem’s underlying data domain – is only a fraction of relates common networking problems to them. Sec-
what other problem domains have at their disposal. As a tion III introduces the ML subfield of deep learning,
consequence, optimizing network performance has so far which has been responsible for most of the recent ML
been largely beyond the reach of ML research. However, breakthroughs, elaborating on the most common model
given the increased visibility of ML, researchers are architectures and how and why they are suited to
beginning to take on the aforementioned challenges of specific tasks within computer networking. Thereafter,
the networking community on ML, and combining ML Section IV shed light on the variety of accessible data
and networking in research seems more attractive than sets, tools, and frameworks that ease the development
ever. Furthermore, computer network infrastructures and training of ML-powered networking systems. Sec-
have been used recently to improve the performance of tion V discusses explainability in Artificial Intelligence
existing ML approaches, e.g. by distributing the train- (XAI), which is rightfully gaining traction because
ing process or the data collection to improve resource many recently tapped application domains (including
utilization or training speed. computer networks) come with amounts of complexity
ML is a very active and rapidly expanding research and risk that disqualify fully black-box ML models for
field that includes an abundance of learning techniques, widespread adoption. Section VI broadens the scope
model types, tools and frameworks, practices, and ap- presented up until now and introduces ML techniques
plication possibilities. Although we focus here on ML and paradigms such as distributed and parallel learning.
models, some applications require considering the whole These techniques leverage existing networking concepts
running system, i.e., AI system, to properly evaluate and technology and seem useful, if not mandatory, for
and understand the output, instead of focusing solely many problems in the networking domain. Section VII
on the ML models [4]. This paper is intended as and Section VIII give an overview of related survey
a primer/practical guide for researchers who are keen papers and open challenges in the concerned areas, and
on quickly applying ML to problems in computer net- finally, Section IX concludes this paper by summarizing
working and/or leveraging networking techniques to the content presented in this paper and providing per-
improve the performance of their ML systems but feel spectives on the open challenges and questions of ML in
2 VOLUME 11, 2023

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3384460

Afifi et al.: A Primer in Machine Learning with Computer Networks: Techniques, Datasets and Models

Section 1
Introduction

Supervised Learning

Unsupervised Learning
Section 2
An Overview of Machine Learning
Further ML Methods

Reinforcement Learning Neural Networks

Section 3
Deep Neural Network Architectures
Deep Learning - The Cool Kid of ML
Datasets
Deep Reinforcement Learning
Data Generation
Section 4
Machine Learning Tools
Datasets, Tools and Frameworks
Data Logging and Parameter Tuning
Taxonomy of XAI Methods
Testbeds
Section 5
Specific XAI Methods
Explainable Artificial Intelligence
Centralized ML
Libraries
Distributed ML
Section 6
Parameter Server
Networks for Machine Learning
Federated Learning

All-Reduce
Section 7
Further Readings Split Learning and Inference

Section 8
Challenges and Future Directions

Section 9
Conclusion

FIGURE 1: Overview of the organization of this paper.

networking and vice versa. ML paradigms with a focus on the most popular ones.
We then briefly touch on some additional branches of
II. AN OVERVIEW OF MACHINE LEARNING ML that are relevant to computer networking.
AI is the discipline of machines that solve problems
by perceiving the environment and using some form A. SUPERVISED LEARNING
of knowledge model in order to derive solutions and Supervised learning is the first of the three main types of
conclusions. ML is an integral part of AI, of which ML and encompasses models that predict target values
it is considered a major subfield [5]. ML models are yi for given data points xi . The starting point for the
statistically and computationally derived from evidence learning problem is a data set that consists of input-
in the form of historical data or experience instead of output data points D = (x1 , y1 ), (x2 , y2 ), ..., (xN , yN ).
explicitly programming a machine for a task. The three The goal is to learn a function h mapping from the
traditional ML paradigms are supervised, unsupervised, input domain to the target domain such that ŷi = h(xi )
and Reinforcement Learning (RL). Methods can be for all data points. Both input and output domains can
categorized into these paradigms by the type of feedback take various shapes, such as boolean or scalar values,
the learning system receives. In supervised learning, euclidean vectors or more complex representations such
exact feedback is available in the form of data labels. as graphs. Depending on the type of output domain,
In unsupervised learning, on the other hand, data is supervised learning is generally divided into classifi-
only partially labeled or completely unlabeled. Finally, cation and regression problems. Examples of popular
in RL, implicit feedback is available for observed data in network applications that use supervised learning are
terms of a so-called reward function that labels data by traffic prediction [6] and classyfing security attacks [7].
a numerical value. We will now discuss the three main
VOLUME 11, 2023 3

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3384460

Afifi et al.: A Primer in Machine Learning with Computer Networks: Techniques, Datasets and Models

1) Classification possible, a random subset of features is considered

In classification problems, the output domain is finite, for each split at each node when constructing the
e.g. true/false, sunny/cloudy/rainy or the set of digits decision trees. The results of all individual decision
0-9. Examples from the networking domain include trees are aggregated to a final prediction. The class
anomaly detection [8] ("Given the current network with the majority vote among the trees is chosen as
monitoring data, is the network showing abnormal be- the final prediction. Figure 2c visualizes the decision
havior?") and failure prediction [9] ("Is this network boundary of a random forest with three individual
node going to fail?"). The most fundamental models for trees for the same data set used for the SVM and
classification are explained in the following paragraphs. decision tree examples. Compared to single decision
trees, random forests are known to improve the
• Support Vector Machines (SVMs) [10] aim at con- prediction accuracy as well as to reduce overfitting
structing a so-called maximum margin separator - [12].
a decision boundary that divides samples of two • The k-Nearest Neigbors (KNN) [13] algorithm is a
different classes with a maximum possible distance simple technique to assign class labels to new data
to the boundary. This situation is depicted in Fig- points by examining the class labels of its k-nearest
ure 2a. The solid black line represents the maxi- neighbors with known labels. Given the features
mum margin separator and the two dashed lines of the input data point, these k-nearest neighbors
visualize the margins to both classes. The nearest are determined by calculating a distance metric
data points to the separator are called the support in the input space, e.g., the Euclidean distance,
vectors (red circles), as they support the position Manhattan distance, or Minkowski distance. The
of the decision boundary. Generally, the larger the class label of the majority of these neighbors is
margin, the better the generalization of the model, then inherited for the new data point. Figure 2d
as it reduces the risk of misclassifying new, unseen visualizes the decision boundary for the known data
data. Since the decision boundary is a separating set using the Minkowski distance and k = 5.
hyperplane, the classification task fails for data that
is not linearly separable. However, SVMs can also 2) Regression
be used for non-linearly separable data by applying In regression problems, the output domain is continuous,
the kernel trick. It transforms the data into a e.g., Rn (n ≥ 1). Examples from the networking domain
higher-dimensional space where it becomes linearly include network performance prediction [14] ( "How
separable and a separating hyperplane is calculated. will the network perform in the future, given certain
When that linear hyperplane is transformed back network conditions and traffic?") and traffic prediction
into the original space, it becomes a non-linear or [15] ("How much / which type of traffic will be generated
even incoherent hypersurface. in the near future?"). In principle, any function fw with
• Decision Trees [11] are structured like an inverted learnable parameters w can serve as a regression model.
tree, with a root node at the top, branching out However, the structure of fw and the optimization proce-
into internal nodes, and ending in leaf nodes at dure used to update the learnable parameters are crucial
the bottom. The data is split at the root node and to finding good function parameters efficiently. The most
the internal nodes based on a threshold value for fundamental regression methods will be explained in
a feature. The splitting process continues until a the following paragraphs. All of the aforementioned
stopping criterion is met, such as reaching a max- classification methods can also be used for regression
imum depth or minimum number of samples in a with slight modifications.
leaf node. Leaf nodes represent the final predictions • Support Vector Regression (SVR) [16] is an ex-
of the decision tree. The majority class in each leaf tension of SVMs for regression tasks. It aims to
node is used as the prediction. Figure 2b visualizes find a function f that approximates the relationship
a simple example decision tree (right side) with a between input features and continuous target values
root node and two leaf nodes for the same data as with a certain degree of error tolerance. The error
in the SVM example (left side). Data points where tolerance (ϵ) defines an ϵ-tube around f. Inside
Feature2 ≤ −0.103 are assigned Class 1, all others this tube, errors from the regression model are not
are assigned Class 2. This decision boundary is penalized. The algorithm maximizes the number
visualized as the color step from green to blue in the of training data points inside this tube. ϵ is a
left plot. Decision Trees are a simple yet powerful parameter defined by the user. Similar to SVMs,
tool to reach conclusions from input data with a the kernel trick can be applied to create non-linear
high degree of human explainability (see Section V). SVRs.
• Random Forests [12] create a collection of decision • Decision Trees [11] can be used for regression tasks
trees, each trained on a different subset of the train- by using the average value of the samples in each
ing data. To achieve as little inter-tree correlation as leaf node as the prediction value.
4 VOLUME 11, 2023

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3384460

Afifi et al.: A Primer in Machine Learning with Computer Networks: Techniques, Datasets and Models

2.0 2.0
1.5 1.5
1.0 1.0 Feature 2 <= -0.189
samples = 30
0.5 0.5 Yes No
Feature 2

Feature 2
0.0 0.0 samples = 13 Feature 2 <= 0.223
class = Class 1 samples = 17
0.5 0.5 Yes No
1.0 Class 1 1.0 Feature 2 <= -0.023 samples = 13

1.5 Class 2 1.5 Class 1 samples = 4 class = Class 2

Support Vectors Class 2 Yes No
2.0 2.0
3 2 1 0 1 2 3 3 2 1 0 1 2 3 samples = 2 samples = 2
Feature 1 Feature 1 class = Class 2 class = Class 1

(a) SVM (b) Decision Tree

2.0 2.0
1.5 1.5
1.0 1.0
0.5 0.5
Feature 2

Feature 2
0.0 0.0
0.5 0.5
1.0 1.0
1.5 Class 1 1.5 Class 1
Class 2 Class 2
2.0 2.0
3 2 1 0 1 2 3 3 2 1 0 1 2 3
Feature 1 Feature 1

(c) Random Forest (d) k-Nearest Neighbor

FIGURE 2: Visualization of different classification methods based on a data set containing 30 samples and two
features. The samples contain 15 samples of two different classes that are perfectly linearly separable. A Gaussian
noise with a mean of 0 and a standard deviation of 0.8 is added, making classification errors likely.

• Random Forests [12] for regression use the average and dimensionality reduction, differ in their use case.
of the individual trees’ predictions as the final Supervised learning has been used for tasks such as
prediction value. anomaly detection, intrusion detection [17] and data
• The KNN [13] method for regression calculates the traffic analyses [18].
label for the new data point by calculating the
average target value of its k-nearest neighbors. 1) Clustering
• The most popular regression method is least-
squares fitting, in which the model is updated to Clustering approaches use the data points’ feature values
minimize the squared L2 norms of the difference to find regularities in the data domain and thus divide
between the predicted values and their associated them into multiple semantically meaningful categories.
labels. This is known as the Mean Squared Error Clustering approaches such as k-means or Density-
(MSE). Based Spatial Clustering of Applications with Noise
In linear regression, this line is represented by a (DBSCAN) [19] differ in the way cluster affiliation is cal-
linear function, while in logarithmic regression, it culated, for example, through data density or neighbor
is represented by a logarithmic function. In other connectivity via measurable distance between the data
words, least-squares method fits a line to the data points. Within the networking domain, data grouping
points in a way that minimizes the sum of the can serve as a useful starting point for further analysis
squared vertical distances between the line and the and action in a variety of problem settings, such as
points. anomaly detection and resolution [20], task classification
for scheduling [21], or traffic characterization for traffic
engineering [22].
B. UNSUPERVISED LEARNING In general, there are different metrics to evaluate the
As opposed to supervised learning, in unsupervised performance of ML algorithms. Table 3 shows the most
learning, the data comes without output/target val- common metrics, which appear in the literature, used for
ues. Consequently, ML models are tasked with find- supervised learning (with an emphasis on classification
ing the underlying regularities in the data domain by metrics that are typically used for evaluating traffic
inferring them from the given training data. The two prediction) and unsupervised learning (with an emphasis
main types of unsupervised learning, namely clustering on clustering metrics as seen in intrusion detection as
VOLUME 11, 2023 5

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
6
Model Type Description Advantages Disadvantages
Support Vector Machine (SVM) Regression and A supervised learning algo- Can handle high-dimensional Can be computationally ex-
classification rithm that finds a hyperplane data, nonlinear relationships, pensive, sensitive to hyperpa-
that separates the data into and outliers. Can use different rameters, and difficult to inter-
different classes or predicts a kernels to customize the model. pret.
continuous value based on the
input features.
Support Vector Regression (SVR) Regression A type of SVM that pre- Can handle high-dimensional Can be computationally ex-
dicts a continuous value based data, nonlinear relationships, pensive, sensitive to hyperpa-
on the input features. It uses and outliers. Can use different rameters, and difficult to inter-
an epsilon-insensitive loss func- kernels to customize the model. pret.
tion to measure the error be-
tween the predicted and actual
values.
Decision Trees Regression and A supervised learning algo- Easy to understand, inter- Prone to overfitting, instabil-
classification rithm that splits the data into pret, and visualize. Can handle ity, and bias. Sensitive to noise
smaller subsets based on some both numerical and categori- and missing values.
criteria until the leaf nodes are cal data. Can capture complex
reached. The leaf nodes rep- nonlinear relationships.
resent the class labels or the
predicted values.
Random Forests Regression and An ensemble learning algo- Can improve the accuracy and Prone to overfitting, espe-
classification rithm that combines multiple robustness of decision trees. cially with noisy data. Can be
decision trees and aggregates Can handle both numerical computationally expensive and
their predictions using major- and categorical data. Can cap- slow to train and test. Less in-
ity voting or averaging. It uses ture complex nonlinear rela- terpretable than single decision
bootstrap sampling and fea- tionships. Can estimate feature trees.
ture selection to introduce ran- importance.
domness and reduce correla-
tion among the trees.
k-Nearest Neigbors (KNN) Regression and A lazy learning algorithm that Simple and intuitive to im- Sensitive to noise, outliers,
classification predicts the class label or plement and understand. No and irrelevant features. Can be
TABLE 1: Summary of supervised learning

the value of a new instance training required. Can handle computationally expensive and
based on the similarity with both numerical and categorical slow to test. Requires storage
its k nearest neighbors in the data. Can adapt to new data of the entire training data. Dif-
training data. It uses a dis- dynamically. ficult to choose the optimal
tance metric such as Euclidean value of k.
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and

or Manhattan to measure the

content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3384460

similarity.
Least Squares Regression A method of fitting a linear Simple and fast to implement Sensitive to noise, outliers, and
model to the data by min- and solve. Provides a closed- nonlinearity. Prone to overfit-
imizing the sum of squared form solution for linear models. ting or underfitting, depend-
errors between the observed Can handle multiple features ing on the complexity of the
and predicted values. It can be and multicollinearity. model. May suffer from numer-
solved analytically using nor- ical instability or singularity is-
mal equations or iteratively us- sues.
ing gradient descent or other
algorithms.

VOLUME 11, 2023

Afifi et al.: A Primer in Machine Learning with Computer Networks: Techniques, Datasets and Models

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3384460

Afifi et al.: A Primer in Machine Learning with Computer Networks: Techniques, Datasets and Models

well as node(s) selection for data collection). learning paradigms, model architectures and problem
domains that come with some notion of uncertainty.
2) Dimensionality Reduction Since its comprehensive introduction would exceed this
This type of learning analyzes the statistical proper- paper’s scope, we point the interested reader to [30] for
ties of the data in order to reduce the number of a high-level overview, and [31] for an extended overview
dimensions that sufficiently describes the data. This of the core concepts of probabilistic ML.
is particularly useful when dealing with more complex
learning problems, as theoretical results show that the 2) Hybrid Learning Approaches
amount of data points needed to learn an accurate Many ML contributions do not fully fall into one of
model scales exponentially with the dimensionality of the aforementioned learning paradigms but rather com-
the input data domain [23] (this phenomenon has been bine their ideas and create new sources for learning
coined the "curse of dimensionality"). While approaches signals. Some of these "hybrid" learning approaches are
like Decision Trees or Random Forest can reduce the popular enough to earn their own description. In semi-
dimensionality of the relevant portions of the data supervised learning, typically, only parts of the training
by considering only the most meaningful features, ap- data are labeled [27]. To train a model in a supervised
proaches like Principal Component Analysis (PCA) [24] or unsupervised manner, the auxiliary information is
find a reduced-cardinality combination of new features. extracted by respectively using the other learning type.
Like clustering, this type of unsupervised learning is Self-supervised learning, on the other hand, tackles
beneficial as a preparative step before further analysis shortcomings of supervised learning approaches (i.e., the
or model training, especially since, in many real-world need for large amounts of data and vulnerability to
scenarios, it has been observed that the given data lies on adversarial inputs) by using parts or representations of
manifolds of much lower dimensionality than the actual the input data as labels [32]. For example, in [33], a
input space (the presumed general rule for this is called model is trained to predict future video frames by only
the manifold hypothesis [25]). feeding it the first few frames of a video and using the
remaining frames as "comparison" labels.
Table 1 summarizes supervised methods, while Table 2
summarizes unsupervised methods. Further details can D. REINFORCEMENT LEARNING
be found in [26], [27]. Regardless which method is used, In the spectrum of traditional learning paradigms for in-
it is important to watch out for over- and underfitting. telligent agents, Reinforcement Learning (RL) is located
Overfitting is a condition where a statistical model between the two extreme domains of fully supervised
begins to describe the random error in the data rather and unsupervised learning. RL is particularly suitable
than the relationships between variables. This problem for decision, control, and optimization problems where
occurs when the model is too complex. data and observations are received sequentially [34]. As
Underfitting, on the other hand, is the inverse of such, RL can be applied to various challenging problems
overfitting. It means that the statistical or ML model is in network science [35]–[37]. Especially, Deep RL (DRL)
too simplistic to accurately capture the patterns in the methods as to be discussed in Section III-C have seen
data. A sign of underfitting is that there is a high bias tremendous success in solving resource allocation prob-
and low variance detected in the current model used. lems in computer networking [38].
The implementation of RL is based on an RL agent
C. FURTHER ML METHODS that receives performance feedback called rewards as
There are various other branches of ML that are of use the agent interacts with an environment over time [39].
in computer networks, see [28], [29]. Here, we discuss two The algorithm designer typically crafts the reward as
additional ML frameworks that are presumably relevant a function of the agent’s sequential observations. The
in the networking domain. rewards, however, do not provide exact instructive feed-
back on how to change the agent’s behavior, thence RL
1) Probabilistic ML is placed in the spectrum of learning paradigms. In this
Oftentimes, neither all relevant information is known section, we will describe the basics of RL and the most
or attainable prior to making a decision, nor is the fundamental algorithms. Throughout, we will directly
environment that reacts to the taken decision purely refer to applications in computer networking for almost
deterministic [5]. Uncertainty may exist in the input all mentioned algorithms.
data, in the decision model parameters and output The interaction of an RL agent with its environment
values and even in the architecture of the decision is described by a Markov Decision Process (MDP) as
model itself. [30]. In all of these cases, probability illustrated in Figure 3. Whenever one seeks to solve a
theory provides a unified framework to cope by using problem using RL, the first step (arguably the most
probability distributions to model uncertain quantities. important) is to define the problem as an MDP. Based
This framework is in principle, applicable to all ML on this MDP one then chooses or designs a suitable RL
VOLUME 11, 2023 7

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3384460

Afifi et al.: A Primer in Machine Learning with Computer Networks: Techniques, Datasets and Models

TABLE 2: Summary of unsupervised learning

Method Task Example Advantage Disadvantage
K-means clustering Exclusive clustering Market segmentation, doc- Simple and fast, easy to in- Sensitive to outliers and
ument clustering, image terpret initial centroids, fixed num-
segmentation ber of clusters
Fuzzy k-means clustering Overlapping clustering Customer loyalty analysis, Allows for uncertainty and More computationally ex-
medical diagnosis ambiguity, more flexible pensive, harder to interpret
than hard clustering
Hierarchical clustering Agglomerative or divisive Phylogenetic analysis, so- No need to specify num- More computationally ex-
clustering cial network analysis ber of clusters, produces a pensive, sensitive to noise
dendrogram that shows the and outliers
hierarchy of clusters
Principal component analy- Dimensionality reduction Data compression, feature Reduces complexity and Loses some information,
sis extraction noise, preserves most of the may not be optimal for
variance some tasks
Association rule mining Finding frequent patterns Market basket analysis, Reveals interesting and May generate too many or
or rules web usage mining useful relationships in too few rules, may need do-
data, can handle large main knowledge to evaluate
datasets rules
Autoencoder Generative model Image denoising, anomaly Can learn complex and May overfit or underfit the
detection nonlinear representations data, may need careful tun-
of data, can generate new ing of parameters
data similar to the input
data

TABLE 3: Common metrics of ML

Metric Definition Supervised Unsupervised
Accuracy for classification tasks, it measures the proportion of ✓
correct predictions made by the model.
F1-score for classification tasks, it is a balance between precision ✓
and recall.
AUC-ROC for classification tasks, it measures the trade-off be- ✓
tween the true positive rate and false positive rate.
Root Mean Squared Error (RMSE1) for regression tasks, it measures the difference between ✓
predicted and actual values.
Normalized Mutual Information (NMI) for clustering tasks, it measures the similarity between ✓
two partitions of a network.
Jaccard Similarity for community detection tasks, it measures the similar- ✓
ity between two sets of nodes.

action an typically stated as a four-tuple (S, A, p, r). The design

Environement
of the reward signal r is arguably the most important
state sn part of defining an MDP. We will discuss some best
Agent
π p(sn+1 | sn , an ) practices for reward design in Section III-C. Excellent
state sn+1 introductory material for RL is the course by David
Silver.1 For background on partially observable MDPs,
state sn+1 see the web page of Anthony R. Cassandra.2 .
reward r(sn , an , sn+1 )
The goal in reinforcement learning is to find a policy
FIGURE 3: Illustration of the Markov Decision Process (decision rule) π : S → A that maps states to actions
(MDP) feedback loop. so as to optimize an objective function. Since all future
states can potentially be influenced by an action at a
current time step via the transition probabilities p(s′ |
s, a), it is natural to consider an objective function that
algorithm to find a solution to the MDP. In general, an
captures the whole trajectory of future rewards. The
MDP can be considered as a system that can assume
most common objective is the P∞discounted infinite horizon
states s from a state space S. The MDP transitions
accumulated reward R := n=1 γ n−1 r(sn , an , sn+1 ) with
to a new state s′ according to a controlled transition
discount factor γ ∈ (0, 1). The intuition is that higher
probability distribution p(s′ | s, a) after the agent has
taken an action a from an action space A. Once the 1 https://www.deepmind.com/learning-resources/introduction-
system transitions to the new state, the agent receives a to-reinforcement-learning-with-david-silver
feedback/reward signal r(s, a, s′ ). An MDP is therefore 2 http://www.pomdp.org

8 VOLUME 11, 2023

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3384460

Afifi et al.: A Primer in Machine Learning with Computer Networks: Techniques, Datasets and Models

weight is given to rewards in the near future, and Q-Learning is guaranteed to converge to the optimal Q-
the weights of future rewards decay geometrically. For function if all states and actions are explored infinitely
background on average and total cost MDPs see [40, often [40, Section 6.6.1]. Q-Learning has been success-
Chapter 4 & 5]. fully applied to various problems in computer networks,
Given a policy π, the associated action-value function e.g., network self-organization [45], network slicing [46]
(also called Q-function) is defined as or virtual network embedding [47].

Qπ (s, a) := Eπ [R | s1 = s, a1 = a] . (1) 2) RL with Function Approximation

The rise of RL as a powerful tool for decision-making is
The Q-function is a fundamental object in RL and
largely due to the effective use of function approxima-
describes what accumulated reward R one can expect if
tion. When the state S becomes large or continuous, the
we are in state s, take action a, and follow the policy π for
traditional algorithms become impractical. Function ap-
all future states. Furthermore, for finite action spaces,
proximation solves this problem by enabling RL agents
the Q-function can directly be used to implement a
to infer information about unseen state-action pairs from
policy by setting π(s) = argmaxa∈A Q(s, a). This makes
observed state-action pairs. The approximation may be
RL algorithms that seek to find or approximate the
used in policy or value space or in both, policy and
optimal Q-function3 attractive since they immediately
value space. For example, a Q-function Q(s, a) can be
lead to simple, implementable policies.
approximated by a function Qθ (s, a) with parameters
θ. This is the basis of deep Q-learning to be explained
1) Basic RL Algorithms
in Section III-C. Traditionally, function approximation
RL algorithms can be roughly divided into three groups, was an important part of RL even before the rise of deep
value-based methods, policy-based methods, and actor- neural networks [48].
critic methods. Value-based methods seek to find or In Section III-C, we will discuss RL with deep neural
approximate value functions like the Q-function. Policy- networks as function approximators. Here, we highlight
based methods instead seek to optimize a policy π di- the traditional class of stochastic policy gradient algo-
rectly. Value-based methods, therefore, yield an implicit rithms with policy function approximation for MDPs
policy, whereas policy-based methods yield an explicit with finite action space. Define a stochastic policy
policy. Actor-critic methods combine learning in value- πθ (s, a) with parameters θ; πθ (s, a) maps states to a
and policy-space and use a learned value function to distribution on A.
“guide” the training of an explicit policy. See [41, p. 36] The stochastic policy gradient theorem [49] has given
for an illustration of the actor-critic feedback loop. rise to a large class of algorithms, where Qπθ (s, a)
Before stating some examples of popular RL algo- is replaced by a suitable estimator. E.g., the REIN-
rithms, we have to distinguish some typical MDP and FORCE algorithm [50] uses a Monte-Carlo estima-
RL settings: tor, actor-critic algorithms now add approximation in
1) Continuous state (e.g. S = Rn ) vs. finite state value space with an additional function approximator
MDPs (e.g. S = {1, . . . , d}). Qw (s, a) with parameters w in place of Qπθ (s, a) [51],
2) Continuous action vs. finite action MDPs. the famous Advantage-Actor-Critic (A2C) algorithm
3) Model-based vs. model-free RL problem. [52] uses an approximation Aw (s, a) of the advantage
Model-based RL usually focuses on offline planning of function Aπθ (s, a) := Qπθ (s, a)−Vπθ (s), where Vπθ (s) :=
value functions and policies, where either the transition Ea∼πθ [Qπθ (s, a)]. These algorithms have been used for
function p is given or where p will be approximated [42]. various scheduling and resource allocation tasks in data
Model-free RL methods instead seek to determine what centers [53], wireless networks [54], edge computing [55]
action to take in a given state without knowledge of p, or vehicular networks [56].
e.g., solely by observing MDP transitions (s, a, r, s′ ).
3) Exploration and Curiosity in RL
The traditional algorithms for model-based RL in
finite state and action MDPs are value and policy RL methods trade-off exploration vs. exploitation during
iteration [40, Chapter 2]. Some recent applications of training, i.e., agents either explore some random action
modern value iteration algorithms in the context of or exploit their current best guess of the optimal action
networking are age-of-information minimization in wire- for the current state. On the other hand, some methods
less broadcast networks [43] and multi-agent routing purely focus on exploration as a metric for learning.
[44]. The most well-known model-free RL algorithm is Such methods may seek to explore as many unseen
tabular (simulation-based) Q-Learning, which seeks to states during training as possible. Another approach is
find the optimal Q-function. Under simple conditions, to explore promising states, e.g., those parts of the state
space where the current approximation of a certain value
3 The optimal Q-function is given by the solution to Bellman’s function is particularly bad. Such methods are known as
equation [41, Section 5.6] curiosity-driven RL [57], [58].
VOLUME 11, 2023 9

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3384460

Afifi et al.: A Primer in Machine Learning with Computer Networks: Techniques, Datasets and Models

local action ain image classification challenge with deep Convolutional

Environment
Neural Network (CNN) (see Section III-B). Since then,
global state sn the rate of progress concerning deep NN architectures,
Agent i p(sn+1 | sn , an ) paradigms and learning techniques has skyrocketed, and
global state sn+1 the development of specialized hardware such as high-
end GPUs or TPUs has led to deep learning models
local state sin+1 with millions or even billions of parameters [72]. As a
local reward ri (sn , an ) consequence, a growing proportion of ML applications
FIGURE 4: Markov game feedback loop over discrete nowadays employ deep NN architectures. For networking
time n. applications, NNs are a powerful tool to learn non-linear
relationships for complex problems. We shall now begin
with a review of the most important NN background.
4) Multi-agent RL
Multi-Agent RL (MARL) problems are formulated as A. NEURAL NETWORKS
Markov games [59], where depending on the local reward Given its central role in almost any cognitive process,
structure, the agents may cooperate or compete. An neuroscientists have long tried to understand the inner
illustration is given in Figure 4 from the perspective workings and mechanisms of the brain. In [73], a mathe-
of some agent i in a Markov game environment. Most matical model for a neuron was introduced that has since
notably, the environment typically transitions to a new inspired an emerging class of ML model architectures:
state as a function of all local actions and all local states. Artificial NN [5]. In NNs, a neuron j receives inputs
There are several additional properties to classify MARL ai from nodes i = 1, . . . , n and a bias input a0 = 1,
settings, such as whether the setting is decentralized, and first computes
Pn a weighted sum using link weights
or whether or to what extent agents transition inde- wij : a′j = i=0 wij ai . Then, it computes the output
pendently [60]. We will discuss the two most common (also called activation) oj = g(a′j ) using an activation
deep MARL algorithms in Section III-C. For a survey function g [74]. If multiple such neurons (mostly called
of MARL algorithm and various applications and chal- perceptrons in the ML community) are connected in a
lenges in computer networks, see [61] and [62]. directed and acyclic manner, they form a so-called feed-
forward network that is usually arranged in layers. In
5) RL with constraints such layered feed-forward networks, each neuron receives
Contrained RL (CRL) [63] is a paradigm for constraint the outputs of the neurons of the previous layer, with
MDPs. The goal is to ensure that the agent’s actions the first layer receiving the overall model input and
do not violate any environmental constraints. A set of the last layer providing the overall model output. The
constraints can be specified as hard (absolute and must resulting compound function expressed with the network
always be satisfied) or soft (desired but can be violated can be highly complex, in fact, it is shown in [75] that
if necessary) constraints. Safe RL (SRL) [64], on the with as little as one intermediate neuron layer and by
other hand, aims to learn policies that minimize the choosing any squashing activation function (e.g., a non-
likelihood of unsafe actions while maximizing the long- decreasing function converging towards 0 and 1 on its
term expected reward for safe actions. Accordingly, CRL respective ends), NNs can theoretically approximate any
and SRL both focus on ensuring that the agent’s actions continuous function uniformly on any compact set to
do not violate certain constraints, but the formulation an arbitrary degree of accuracy. This statement is also
of constraints takes slightly different approaches. Both known as the Universal Approximation Theorem (UAT).
approaches are used when resources (such as bandwidth, The UAT becomes even more interesting once we view
computation, and energy) are limited [65]–[67] or when the entire NN as a function hw (x) of the input vector x
some applications may impose additional constraints and the NN weights w [5]. The UAT implies that there
(such as throughput and latency) [68], [69]. exists a weight parameter configuration that sufficiently
approximates the function which describes the desired
III. DEEP LEARNING - THE COOL KID OF ML solution to a problem. Hence, many learning problems
Deep Learning is a subfield of ML that aims at facili- can be viewed as a problem of function approximation to
tating the learning of complex data representations by find the right NN weights. The most common technique
learning hierarchies of simpler intermediate representa- to update the NN weights is gradient descent in com-
tions. [5], [70]. The resulting "stacking" of model blocks bination with the so-called backpropagation algorithm
(predominantly Neural Network (NN) layers) is what [76]. For example, to update a NN using an input-output
gives Deep Learning its name. While the term Deep tuple (x, y), backpropagation calculates the derivative
∂ 2
Learning has been around for decades, it only started ∂w |y−hw (x)| of the output error with respect to the NN
to gain widespread traction in 2012 with the widely weights. This calculation is done sequentially starting
visible success of AlexNet [71] winning a widely popular from the output layer by applying the chain rule on the
10 VOLUME 11, 2023

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3384460

Afifi et al.: A Primer in Machine Learning with Computer Networks: Techniques, Datasets and Models

above derivative. Calculated gradients are then used to 2) Convolutional Neural Network (CNN)
update the NN weights iteratively. Various algorithms In many problem domains, data exists on a grid-like
have been proposed throughout the last decade to use structure where spatial patterns carry the same semantic
the aforementioned calculated gradients most effectively. information regardless of their location in the grid (also
The most well-known algorithm is the ADAM optimizer referred to as translation invariance). Examples include
[77], which adaptively selects the stepsize for individual images (2D grids) but also time-series data (1D grids).
NN weights based on the calculated gradient informa- To exploit this symmetry, Convolutional Neural Network
tion. See [78] and the reference therein for various other (CNN) utilize spatial convolution, which applies the
gradient-based methods. Note that most tools (such as same learnable spatial parametric kernels (i.e., small
PyTorch and TensorFlow) offer these optimimzers as matrices with learnable individual entries) on evenly
black boxes without having to deal with the implemen- spaced patches of the input grid [70]. The re-usage of
tation details. a set of such kernels across multiple image positions is
called weight sharing and greatly reduces the number of
B. DEEP NEURAL NETWORK ARCHITECTURES parameters needed to learn and extract the patterns of
the input data.
The UAT seems to advocate that rather simple feed-
forward NN architectures can be used for any problem
3) Recurrent Neural Network (RNN)
that might be solvable with ML. In practice, however,
the findings of the UAT are greatly humbled by the For dealing with sequential data such as time series,
excessive amounts of training data, the size of the NN Recurrent Neural Network (RNN) elements such as the
models, and the required time for training necessary Long Short-Term Memory (LSTM) [84] or the Gated
to achieve satisfactory results on a complex task. Fur- Recurrent Unit (GRU) [85] have proven very useful.
thermore, for many tasks, it can be observed that the The commonality between all RNNs is feeding a portion
members of the underlying data domain are semantically of the output back into the RNN block for subsequent
composable into simpler entities, spanning a hierarchy of computations, enabling NN architectures with recurrent
concepts. As a consequence, researchers have started to elements to capture sequential dependencies within the
add more structure to their models. data [70].
Different model architectures have proven effective for
4) Graph Neural Network (GNN)
different tasks. An overview of common deep learning
model architectures is given in Table 4. In the following Recently, Graph Neural Networks (GNNs) have
subsections, we briefly present the most popular model emerged as powerful architectures for handling graph-
archetypes and refer to the provided references for fur- structured data. Utilizing permutation-invariant aggre-
ther reading. Interestingly, all of the model archetypes gation/pooling operations and permutation-equivariant
introduced below are derivable from the same basic message passing operations to learn patterns in the
mathematical framework and only differ in the shape data while respecting the graph topology rather than
of data and the assumptions made about regularities in assuming any specific ordering of its nodes and edges
the data [79]. [86].

5) Generative Adversarial Network (GAN)

1) Multilayer Perceptrons (MLP)
Generative Adversarial Networks (GANs) [87] have
Standard Deep Neural Networkss (DNNs) consisting emerged as a powerful tool for generating realistic
of multiple hidden layers of neurons are also called data samples, including images [88], videos [89], and
Multilayer Perceptronss (MLPs). Together with non- audio [90], but also network traffic [91]. GANs consist
linear activation functions, they have long become a of two NNs: a generator that creates synthetic samples,
standard tool for processing vector-shaped inputs as and a discriminator that tries to distinguish between real
their hierarchy of non-linear function approximators is and fake samples. These two networks are trained simul-
widely applicable across many problem domains [80]. taneously in an adversarial setting, where the generator
However, given that MLPs make rather few assumptions tries to fool the discriminator, while the discriminator
about the input and output data despite being shaped tries to correctly identify the real samples. In the context
as a vector, for many tasks these models often perform of computer networks, GANs are useful for generating
unfavorably compared to more specialized models of synthetic network traffic patterns that mimic real-world
similar size. A detailed introduction to MLPs is given traffic. This is useful for testing and evaluating network
in [70]. MLP have already been used in computer and performance metrics, intrusion detection systems, and
wireless networks, e.g., for channel decoding [81], in network security protocols. See Section IV-B3 for exam-
resource allocation [82], and in intrusion detection [83]. ple applications of GANs being used to generate data.

VOLUME 11, 2023 11

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3384460

Afifi et al.: A Primer in Machine Learning with Computer Networks: Techniques, Datasets and Models

TABLE 4: Overview on deep learning architectures.

Data Architecture Task Use Case Examples
Tabular (2D) Multilayer Perceptrons (MLP) Classification/Regression Anomaly Detection
1D Convolutional Neural Net- Classification/Regression Estimate network congestion
Time Series (3D) work (CNN)
Recurrent Neural Network Generative/Forecasting Forecast network traffic
(RNN)
Transformers Representation Learn- Unsupervised representation
ing learning of network traffic
Graph Classification Overall network performance
Graph Graph Neural Network (GNN) Node Classification Congestion at node
Edge Classification Packet delay on network link
Any Generative Adversarial Net- Generative Modeling Packet stream generation
work (GAN)

6) Transformers 8) Generative AI (GenAI)

In many ML domains with complex long-range depen- Generative AI (genAI) is a broader concept that can
dencies within data points, the attention mechanism [92] apply to any type of data [102]. It uses ML models,
and its implementation in the Transformer architecture such as GPT, GAN, or/and others, to learn the patterns
[93] has proven to be extremely powerful. Works like [94], and structure of the given training data, and can then
[95] show that Transformers can outperform both CNNs be used to generate realistic and novel outputs that
and RNNs in problems with spatial and temporal data are similar but not identical to the data. Additionally,
as each token/component of the input can relate to any and closely related, genAI can be used with retrieval-
other component. While we refer to [93] for a detailed augemented generation (RAG) [103] to automate col-
explanation of the attention mechanism, it is worth lecting information from the network, analyze it, and
noting that transformers are a special case of GNNs push new configuration if necessary [104]–[106]. This
operating on a fully-connected computation graph [96]. removes the cumbersome of learning a new documen-
This implies that for large inputs, using the transformers tation or writing new scripts, and simplifies the user
is computive intensive. interaction. Recent GenAI models have shown impres-
sive and/or human-like capabilities in an unprecedented
range of downstream tasks. As a consequence, several
7) Large Language Model (LLM)
network industrial companies have started to develop
Large Language Models (LLMs) have recently gained or adjust commercial products that leverage GenAI e.g.
significant attention in the field of Natural Language Processing (NLP). threat intelligence reports [107], security
to generate
These models are trained on vast amounts of text policies, incident response plans [104], and proactively
data and can generate human-like text based on identify and fix network issues [105].
their input. One popular type of LLMs is the
generative pre-trained transformer (GPT), which uses C. Deep Reinforcement Learning (DRL)
a transformer-based architecture and is pre-trained
Deep Reinforcement Learning (DRL) refers to the use
on large amounts of text data using a self-supervised
of DNNs as function approximators for RL algorithms.
learning approach. During pre-training, the model learns
The general idea of RL with function approximation
to predict the next word in a sentence, which enables
has been briefly described in Section II-D2. With the
it to generate coherent and contextually relevant text.
advent of deep learning libraries such as Keras and
GPTs can be fine-tuned on specific NLP tasks, such
TensorFlow (see Section IV), as well as standardized
as text classification, summarization, and translation,
APIs such as Gymnasium (formerly known as OpenAI
by adding a task-specific output layer and training on
Gym), training of DRL algorithms has become very
a smaller dataset. Unlike traditional NLP models that
accessible. However, the success of RL with DNNs4
rely on hand-crafted features, LLMs learn to represent
relies on some key techniques. This subsection focuses
the meaning of words and phrases in a continuous vector
on the most important DRL algorithms and the tools
space, enabling them to perform a wide range of NLP
and techniques to train DRL models.
tasks. In the context of computer networks, LLMs and
First, we will explain two key techniques for DRL
GPTs have been used, for example, to generate synthetic
based on Deep Q-Learning also known as Deep Q-
network traffic [97], to explain decisions in intrusion and
anomaly detection systems [98], [99] and for managing 4 Historically, it was known that training RL with a NN could
networks [100]. For an overview of applications, tech- potentially lead to systematic overestimation of utility values (such
niques, and challenges, we refer to [101]. as the Q-function) and thus to failed learning [108].

12 VOLUME 11, 2023

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3384460

Afifi et al.: A Primer in Machine Learning with Computer Networks: Techniques, Datasets and Models

Networks (DQN) [109]. DQN is a DRL algorithm for vice coordination [116], scheduling for large-scale net-
MDPs with finite action space. DQN seeks to approxi- worked control systems [117], acoustic sensor networks,
mate the optimal Q-function by a DNN Qθ (s, a) with [118], adaptability of wireless sensor networks [119] and
parameters θ. Specifically, a DQN takes a state s as other applications in communications and networking
input and outputs Qθ (s, a) for every action a of the finite [120].
number of actions. The key techniques introduced for
DQN are an experience replay buffer and a so-called 1) General Advice for Training DRL Agents
target network. During training, DQN interacts with its Training a DRL Agent to successfully solve a given
environment, generating data tuples (s, a, r, s′ ). These problem can be a challenging task. In this section, we
data tuples are stored in an experience replay buffer. provide some general advice from our experience in the
During training, DQN samples a mini-batch from this hope to ease this task.
memory and applies a stochastic gradient descent step 1) It is good practice to normalize the states and
of the average squared Bellman error of the samples from actions, e.g., [−1, 1]d , d ∈ N. Linear scaling al-
the mini-batch. This rather simple technique reduces the ways makes this possible when the state space
bias of Q-Learning towards its recent interaction with S is bounded in real dimensional space. When
the environment and thereby helps to stabilize training. S is unbounded, let’s say Rd , one needs to use,
In NN terminology, the right-hand side of the Bellman e.g., a scaled version of the hyperbolic tangent or
loss, i.e., r + γ maxa′ Qθ (s′ , a′ ) is the training target for the inverse stereographic projection. Such nonlinear
Qθ (s, a) given the data tuple (s, a, r, s′ ). In other words, transformations, however, change the environment,
the DQN itself is used to compute its training targets. and the resulting policies may perform poorly in
The idea behind target networks is to use a separate the actual environment if the normalization is not
′
target network Qθ (s, a) to compute the aforementioned chosen carefully. Ideally, one should aim at linear
training targets. The target parameters θ′ are then scaling throughout the state space’s “expected”
chosen to track the actual training parameters slowly. dominant part. As the action space is typically
With this, target networks provide more stable training bounded, action normalization is less problematic.
targets, which has been shown to generally improve 2) Reward normalization should be used even more
DRL training, see [109], [110]. However, more recent carefully than state normalization. In general,
theoretical and numerical studies suggest that gradient changing the reward changes the perception of an
clipping is superior to the use of target networks [111]. agent about the environment and results in different
DQN is also an integral component of the Deep learned policies.
Deterministic Policy Gradient (DDPG) algorithm [110], 3) The design of the reward signal is an integral part
which is one of the most well-known actor-critic algo- of the design of an MDP. One has to craft a reward
rithms for continuous action spaces. In DDPG, a critic function that incentivizes the desired behavior to
is trained using the DQN algorithm, while a determin- get an algorithm to learn the desired goal. Some
istic policy is trained to maximize the approximated additional comments in no particular order: Make
Q-function. DQN and DDPG, in turn, are the basis it easy for an agent to distinguish good from bad
for the two common deep MARL algorithms Indepen- scenarios; Continuous rewards or dense rewards
dent Deep Q-Learning [112] and Multi-Agent DDPG typically make it easier for algorithms to learn; If
(MADDPG) [113]. However, only DDPG has a truly possible, avoid sparse rewards and instead shape the
distributed version that can be run with nearly arbitrary rewards to give gradual feedback; Strictly positive
communication delays over a communication network. rewards incentive agents to avoid terminal states;
This is known as the Distributed DDPG (3DPG) algo- Strictly negative rewards incentive agents to reach
rithm [114]. terminal states.
Another important technique for successful DRL 4) It should be avoided to train DRL models with
training was proposed as part of the deep actor-critic drop-outs. Drop-outs is a regularization technique
algorithm Asynchronous-Advantage-Actor-Critic (A3C) that was introduced in [121] to train NN models
[52]. The asynchronous part refers to using several agents with less overfitting while improving the general-
in parallel simulated environments to improve and speed ization. However, this leads to increased training
up DRL training. In other words, the training progress variance, which is generally undesirable for the
of several agents on the same problem is combined to training of DRL.
enhance the training performance. This is especially
important for complex tasks since multiple parallel pro- 2) Algorithm Categorization
cessors can significantly reduce the overall training time. The sheer amount of available DRL algorithms can
The success of DRL has been demonstrated across var- be overwhelming for starters in the field, making it
ious sub-areas in computer networks like management of challenging to find appropriate algorithms for a given
satellite-terrestrial networks [115], multi-objective ser- problem. To ease the algorithm selection, we provide
VOLUME 11, 2023 13

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3384460

Afifi et al.: A Primer in Machine Learning with Computer Networks: Techniques, Datasets and Models

categorizations of widely used single-agent and multi- • Data selection: This involves selecting the relevant
agent DRL algorithms in Figure 5 and Figure 6, respec- features or variables from the dataset that are most
tively. We note that the tree structures are simplified. important for the problem at hand. This step is
For example, the model-based algorithms in Figure 5 important to reduce the dimensionality of the data;
can further be classified into value-based, policy-based, making it easier to improve the performance of the
actor-critic and on/off-policy algorithms. Furthermore, model; it removes irrelevant or redundant features,
only selected and widely used algorithms are shown. it can help to speed up the training process and
These categorizations should serve as starting points. reduce the computational resources required for
The final algorithm selection for a specific problem analysis.
should also consider additional factors such as sampling • Data transformation: This involves transforming
efficiency, algorithm stability and exploration strategy. the data into a format suitable for the ML algorithm
Single-Agent-DRL Algorithm Categorization: Single- being used, such as converting categorical variables
agent-DRL algorithms can be coarsely categorized by into numerical values using one-hot encoding; it
their supported action space (discrete/continuous), if generates a vector, whose length corresponds to the
they are model-based or model-free, and if they are number of categories in the dataset. Data points
value-based, policy-based or a combination of both belonging to the category are assigned 1, otherwise
- called actor-critic. Considering the tree-structure in 0.
Figure 5, it can be seen that some algorithms (e.g., • Data splitting: This involves splitting the dataset
A2C/A3C, SAC, PPO) can be used for both, discrete into a training set to train the model, a validation
and continuous action spaces, while others, such as DQN set to evaluate its performance during training, and
and DDPG, are only compatible with one of them. a test set to evaluate the models’ performance after
MARL algorithms can generally be categorized based training.
on the same factors as single-agent DRL algorithms.
It is important to note that the specific preprocessing
However, additional multi-agent based factors can be
steps required may vary depending on the dataset, the
included. These are mainly centralized/decentralized
problem being addressed, and the type of ML model
learning and cooperative/independent learning. To pre-
used. The preprocessing steps should be chosen carefully
serve clarity, some of the traditional single-agent based
to ensure that the data is suitable for training and
factors have been omitted in Figure 6.
that the model can accurately represent the underlying
relationships in the data. In the following, we present the
IV. DATASETS, TOOLS AND FRAMEWORKS
most popular network domain datasets in the literature
Now that we have discussed what ML is and its potential
for different applications.
applications, we will introduce here the most popular
datasets in the field of networks, as well as emulators,
and simulators that can be used to run ML experiments. 1) Mobile network throughput datasets
Since ML models parameters are learned from data, the A common problem in networking research is replicat-
datasets used are crucial in accomplishing the intended ing realistic network conditions, especially throughputs.
task, such as network latency prediction or decision- Dynamic Adaptive Streaming over HTTP (DASH) is
making for traffic routes. Additionally, ML models need one such exemplary research area. Depending on the
to be tested before being applied in a productive environ- mobile network, different datasets containing traces of
ment. Thus, well-known network tools and frameworks real-world measurements have been created in order to
can aid in prototyping, tracking, and evaluating these allow for a better comparison between different research
models. approaches.
For 3G mobile networks, the dataset by Riiser et
A. DATASETS al. [122] is widely used [123]. It contains 86 traces
Datasets are usually not plug-and-play and require from measurements conducted on commute paths in
preprocessing. The type of preprocessing required for Oslo, Norway, using six different mobility patterns (cf.
the datasets depends on the specific problem being Table 5). Besides the download throughput, it also
addressed and the type of data being used. In general, contains the GPS latitude and longitude coordinates of
preprocessing includes the following steps: the measurement device.
• Data cleaning: This involves removing any missing, For 4G mobile networks, the dataset by Van Der Hooft
inconsistent, or irrelevant data to ensure the quality et al. [124] (we call it 4G_a in this paper) is commonly
of the data being used for training. used [125]. It contains traces of 40 measurements with
• Data normalization: This involves transforming the different mobility patterns (cf. Table 5) conducted in
data into a common scale, such as normalizing the Ghent, Belgium. It is similar to the 3G dataset by
values between 0 and 1, to ensure that no variable containing the download throughput, as well as the GPS
has an undue influence on the model. coordinates of the measurement device.
14 VOLUME 11, 2023

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3384460

Afifi et al.: A Primer in Machine Learning with Computer Networks: Techniques, Datasets and Models

Single-Agent
RL

Discrete Continuous
Action Space Action Space

Model-based Model-free Model-based Model-free

- MCTS - MPC
- I2A - GPS
Value-based Policy-based Actor-Critic Value-based Policy-based Actor-Critic

Off-Policy: On-Policy: On-Policy: Off-Policy: On-Policy: On-Policy:

- DQN - REINFORCE - A2C/A3C - NAF - TRPO - A2C/A3C
- DDQN - TRPO - PPO
- PPO Off-Policy: - DPPO Off-Policy:
- DPPO - SAC - DDPG
- TD3
- SAC

FIGURE 5: Categorization of well-used Single-Agent Deep Reinforcement Learning (DRL) algorithms.

Multi-Agent
RL

Discrete Continuous
Action Space Action Space

Centralized Decentralized Centralized Decentralized

Learning Learning Learning Learning

Cooperative Independent Cooperative Independent Cooperative Independent Cooperative Independent

Learner Learner Learner Learner Learner Learner Learner Learner

- Centralized - IQL-CT - MADDPG - Multi-Agent - CMAMPC - IDDPG-CT - MAPO - DMAMPC

Multi-Agent MCTS - CMAPO - MADDPG
MCTS - IQL - Centralized
- Centralized Multi-Agent
Multi-Agent DDPG
DDPG
- QMIX
- VDN

FIGURE 6: Categorization of well-used Multi-Agent RL (MARL) algorithms.

Another widely used [126] dataset for 4G networks there are different signal parameters measured in the
was created by Raca et al. [127] (we call it 4G_b in this dataset like SINR, RSSI, and RSRP as well as GPS data,
paper). A total of 135 measurements were conducted time, data rate, etc., and it is published at CRAWDAD5 .
in Ireland. In comparison to the 4G_a dataset, this Raca et al. also created a widely used [129] dataset for
one is larger, also contains different mobility patterns 5G networks [130]. It contains 83 traces of measurements
(cf. Table 5), and contains significantly more metrics, in Ireland with two different mobility patterns (cf.
such as the download and upload throughput, additional Table 5). The measurement setup and dataset structure
channel-related metrics, context-related metrics, and are comparable to their 4G_b dataset. In addition, it
cell-related metrics. also contains ping statistics.
In Farthofer et al. [128] an LTE dataset for the use of Table 5 provides a comparative overview of the pre-
ML is described. The dataset is measured on an Austrian sented datasets.
highway and contains over 2000 measurement points per
month over a time period of two years. Additionally, 5 https://www.crawdad.org/

VOLUME 11, 2023 15

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3384460

Afifi et al.: A Primer in Machine Learning with Computer Networks: Techniques, Datasets and Models

TABLE 5: Overview of throughput datasets for 3G, 4G, and 5G mobile networks.
3G (Year 2013) [122] 4G_a (Year 2016) [124] 4G_b (Year 2018) [127] 5G (Year 2020) [130]

Measurement method HTTP video streaming HTTP file download TCP file download, TCP file download
Youtube streaming Netflix video streaming
Amazon video streaming
Number of traces 86 40 135 83
Trace lengths (min;max) 195 s; 12224 s 166 s; 758 s 383 s; 11141 s 266 s; 7752 s
DL throughput (min; max) 0; approx. 9 Mb/s 0; approx. 111 Mb/s 0; approx. 173 Mb/s 0; approx. 533 Mb/s
UL throughput (min; max) / / 0; approx. 4 Mb/s 0; approx. 7 Mb/s
Measurement interval 1s 1s 1s 1s
Mobility patterns Metro, Tram, Train, Foot, Bicycle, Bus, Static, Pedestrian, Car, Static, Car
Bus, Ferry, Car Tram, Train, Car Bus, Train
Logged metrics • GPS latitude, longitude • GPS latitude, longitude 19 metrics, e.g.: 25 metrics, e.g.:
• DL throughput • DL throughput • GPS longitude, latitude • GPS longitude, latitude
• DL & UL throughput • DL & UL throughput
• RSSI, RSRP, RSRQ • RSSI, RSRP, RSRQ
• CQI, SNR • CQI, SNR

2) Routing collection of network traces that were collected by the

As routing is an important part of networking, having a CAIDA project [133]. CAIDA is a non-profit research
real-world dataset for it can be beneficial for training ML organization that collects and analyzes data about the
models, evaluating their performance and testing their internet to gain insights into its structure and behavior.
robustness in the case of network failures and other real- It includes data from a variety of sources, including
world issues. In the following, we discuss some of these routers, switches, and end hosts. The data includes
datasets. information about the network topology, routing algo-
The Abilene dataset [131] is a real-world network trace rithms, and network traffic patterns. Additionally, it
that captures the communication patterns of a back- contains information about the routes taken by packets,
bone network. It was collected by the Abilene project, the number of packets sent and received, and the size
which was a collaboration between researchers from the of the packets. The data is collected using active and
University of California, San Diego, and the University passive measurements:
of Kansas. The Abilene dataset provides information Active measurement involves actively sending test
about the communication patterns of a backbone net- packets or requests to a network, and then analyzing
work that connects several research institutions in the the resulting responses to gain insight into the network
United States. It includes information about the network behavior and performance. Examples of active mea-
topologies, routing algorithms, and traffic patterns of the surement techniques include pinging, tracerouting, and
network. It contains information about the routes taken bandwidth testing. Active measurements are often more
by packets, the number of packets sent and received, and accurate and provide more detailed information about
the size of the packets. The dataset is commonly used the network, but they can also introduce more overhead
for research in the areas of network routing, network into the network and disrupt normal network traffic.
management, and network performance evaluation. Passive measurement involves observing network traf-
The Global Environment for Network Innovations fic without actively generating any test traffic. Pas-
(GENI) dataset [132] provides network traces from sive measurements are typically less disruptive to the
real-world deployments. GENI is a large-scale research network and do not introduce any overhead, but they
infrastructure that provides a platform for conducting provide a more limited view of the network’s behavior
experiments and evaluating new network technologies and performance. Examples of passive measurement
and protocols. The dataset includes network traces that techniques include network traffic analysis, packet cap-
were collected from a variety of testbeds and networks, ture and analysis, and log file analysis.
including campus networks, data centers, and wide-area The RocketFuel dataset [134] is a collection of network
networks. It includes information about the network topology data that was collected by the RocketFuel
topology, routing algorithms, and traffic patterns of project. The RocketFuel project was a research effort
the network. It also contains information about the aimed at studying the structure and behavior of the
routes taken by packets, the number of packets sent Internet at the level of individual routers and links.
and received, and the size of the packets. The dataset is The project collected data from several large Internet
commonly used for research in the areas of network rout- Service Providers (ISPs) and used it to create a high-
ing, network management, and network performance resolution map of the Internet. It includes information
evaluation. about the network topology and the paths that packets
The Cooperative Association for Internet Data Anal- take through the network. It also includes information
ysis (CAIDA) anonymized internet traces dataset is a about the capacities of the links between routers, as well
16 VOLUME 11, 2023

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3384460

Afifi et al.: A Primer in Machine Learning with Computer Networks: Techniques, Datasets and Models

as information about the location and characteristics of tions, especially the download bandwidth, are commonly
the routers themselves. simulated using the network traces from the datasets
The Internet Topology Zoo [135] is a collection of presented in Section IV-A1 [138]. In the following,
network topology datasets that provide information we present four commonly used DASH video datasets.
about the physical structure of different networks. The Table 7 provides an overview of their most important
datasets in the Internet Topology Zoo come from a properties.
variety of sources, including measurements of the in- The DASH dataset [140] is an old (2012) but still
ternet, testbeds, and simulations. Its datasets provide widely used dataset, e.g., to test new QoE-schemes [139].
information about the connections between nodes in a It contains 6 videos of different genres split into segments
network, the capacities of the links between nodes, and ranging from 1 - 15 seconds in length.
the characteristics of the nodes themselves, such as their The Distributed DASH (D-DASH) dataset [141] was
locations and capabilities. One of the main strengths of published in 2013 and is intended to be used in real-
the Internet Topology Zoo is its comprehensive cover- world testbeds. It contains one video that is distributed
age of different types of networks, including wide-area on servers in Klagenfurt, Paris, and Prague. This enables
networks, data centers, and other large-scale networks. a client to choose the requested location for each segment
This makes it a valuable resource for researchers and individually.
practitioners working in the field of network routing The ultra high definition HEVC DASH dataset [142]
and network management, as it provides a diverse set was published in 2014 and includes one video. In con-
of datasets for evaluating and comparing different algo- trast to the DASH and D-DASH datasets, the video
rithms and technologies. is encoded with the newer and more efficient H.265
To sum up, GENI and Abilene datasets primarily fo- (HEVC) video codec. Furthermore, it is encoded in UHD
cus on network infrastructure, providing researchers ac- resolution, at 30 and 60 Frames per Second (FPS), and
cess to national research networks. Conversely, CAIDA at 8 and 10 bits.
and RocketFuel are designed to facilitate the measure- The multi-codec DASH dataset [143] is a rather new
ment and analysis of network traffic and topology. The dataset from 2018. It consists of 10 videos that are
Internet Topology Zoo, meanwhile, is a collection of encoded with four different video codecs: H.264, H.265,
publicly available network topologies that researchers VP9, and AV1. In addition, three different video FPS
can use for various purposes. Thus, the size of the are included: 24, 30, and 60.
network varies depending on the scope and focus of the
dataset. The GENI and Abilene datasets tend to cover 4) Mobility and Autonomous Vehicles
larger networks compared to CAIDA and RocketFuel, In the context of mobility or autonomous driving using
which prioritize measurement and analysis tools [136]. a wireless network infrastructure (let it be cellular or
CAIDA and RocketFuel datasets use passive measure- V2X), most of the studies in the literature discussing
ments, while GENI and Abilene datasets use both active solutions and their results do not make the datasets
and passive measurements. The Internet Topology Zoo publicly available for scrutiny by third parties. As such,
is a collection of network topologies and does not involve the results are difficult to verify and validate properly.
any measurements. Further comparison is shown in Nevertheless, many studies rely on simulated datasets.
Table 6 An open-source traffic simulation software called Simu-
lation of Urban Mobility (SUMO)6 provides datasets for
3) Dynamic Adaptive Streaming over HTTP (DASH) simulating realistic urban traffic scenarios. Among the
Video streaming via Dynamic Adaptive Streaming over datasets are road networks, traffic demand patterns, and
HTTP (DASH) is a large research area in networking. vehicle behavior models, which can be customized for
Recent rate adaptation algorithms often aim to optimize different traffic scenarios and urban environments [144].
the user’s Quality of Experience (QoE) under the given Although SUMO is popular among researchers and
network conditions, such as a constrained bandwidth practitioners in the industry, another software called
[137]. While these algorithms were initially conventional Multi-Agent Transport Simulation (MATSim)7 is often
heuristics, DRL-based approaches have recently shown used in academic research. While SUMO focuses on
excellent performance and are now considered state-of- macroscopic traffic flow modeling, MATSim uses an
the-art [138]. In order to benchmark different solutions, agent-based approach to model individual travel be-
publicly available DASH datasets are often used [138]. havior [145]. As a result, MATSim can capture more
Typically, the datasets contain videos that are encoded complex individual decision processes, while SUMO is
under a controlled set of parameters, e.g., resolution better suited for overall traffic flow modeling. Another
and bitrate, and split into segments of certain lengths. open-source software is CityFlow8 , which includes a
The solutions are commonly evaluated via simulations 6 https://sumo.dlr.de/docs/Data/Scenarios.html
where the videos from the DASH datasets are streamed 7 https://www.matsim.org/open-scenario-data

over simulated networks [139]. Realistic network condi- 8 https://github.com/cybercore-co-ltd/track2_aicity_2021

VOLUME 11, 2023 17

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3384460

Afifi et al.: A Primer in Machine Learning with Computer Networks: Techniques, Datasets and Models

TABLE 6: Overview of available routing datasets

Point of comparison Abilene [131] Global Environment for Cooperative Association RocketFuel [134] Zoo [135]
Network Innovations for Internet Data Analysis
(GENI) [132] (CAIDA) [133]
Scope Abilene network; a ma- various types of networks, Provides data specifically Provides data about Provides data about a va-
jor US backbone network including the wide-area, about the Internet, col- the ISP topologies and riety of different types of
in the early 2000s, used data centers, and campus lected through a combina- other types of networks. networks, including data
for research and education networks. The dataset is tion of active and passive It was last updated in centers and other large-
network–Last updated in updated on regular basis. measurement techniques. 2004 for measuring ISP scale networks, obtained
Nov 2022 [131] The age of these datasets topologies [134] from a variety of sources.
varies, with some datasets It is being updated on reg-
being ongoing and others ular basis with the most
being completed or one- recent update in April
time snapshots 2021.
Data Collection network measurements measurements, active and passive mea- measurements and simu- measurements,
and simulations. simulations, and testbeds. surement techniques. lations. simulations, and testbeds.
Data Availability Available for research and Available for research and Available for research pur- Available for research and Available for research and
educational purposes. educational purposes. poses, but access may educational purposes. educational purposes.
be restricted for some
datasets.
Data Format standard graph format. raw data and processed raw data and processed standard graph format. standard graph format
data. data.
Data Accuracy may vary depending on may vary depending on Strive to provide accurate may vary depending on may vary depending on
the source of the data and the source of the data and and up-to-date data. the source of the data and the source of the data and
the methods used to ob- the methods used to ob- the methods used to ob- the methods used to ob-
tain it. tain it. tain it. tain it.
Scale relatively small-scale net- data about a large-scale, very large and complex a limited number of indi- mix of small- and large-
work diverse testbed networks vidual networks scale networks

TABLE 7: Overview of the DASH datasets.

DASH (Year 2012) [140] D-DASH (Year 2013) [141] HEVC-DASH (Year 2014) [142] Multi-Codec DASH (2018) [143]
Number of Videos 6 1 1 10
Video Content Animation, Sport, Movie Sports Movie Animation, Moving head, Moving cars,
Nature, Moving jockey, Moving horses,
Composite, Rotating wind vanes,
Moving yacht
Video lengths 586 s - 5848 s 5848 s 142 s 20 s - 888 s
Resolutions 320x240 - 1920x1080 320x240 - 1920x1080 1280x720 - 3840x2160 256x144 - 3840x2160
Bitrates 50 kbit/s - 8 Mbit/s 100 kbit/s - 6 Mbit/s (Video), 1.8 Mbit/s - 18 Mbit/s (Video), 100 kbit/s - 20 Mbit/s
64 kbit/s - 165 kbit/s (Audio) 128 kbit/s (Audio)
Segment Lengths 1, 2, 4, 6, 10, 15 s 2, 4, 6, 8, 10, 15 s 2, 4, 6, 10, 20 s 2, 4 s
Encoder H.264 H.264 H.265 (Video), H.264, H.265, VP9, AV1
HEAAC (Audio)

range of features that are not available in SUMO, such as layer, protocols like HTTPS, DNS over TLS (DoT), and
real-time simulation and the ability to model pedestrian QUIC [154] do not yield sufficient unencrypted data
and bicycle traffic [146]. that reliably identify services and traffic types. Instead,
There exist other alternatives, yet commercial, that new techniques have to be developed which make use of
also provide mobility datasets, such as Aimsun9 , Vis- the available unencrypted data. For encrypted network
sim10 and TransModeler11 We present in Table 8 an traffic analytics, a common approach is therefore to
overview of these datasets, while a comparison of the extract packet sizes, directions, and inter-arrival times
use cases for some of these datasets was shown in [147]. as well as potential additional information like port
numbers to build features. These features describe the
5) (Encrypted) Network Traffic Analytics network traffic of a specific service or traffic type [155]–
Another common task for ML in networking is network [158]. These features are then fed to ML models to
traffic analytics. This includes the task of traffic/service learn specific patterns exhibited by different traffic types
classification, i.e., identifying an active service or traffic or services. Beyond traffic classification, this type of
type in the network. Examples for such a task are analytics is often used for security-related tasks like in-
distinguishing between video and web traffic, between trusion detection or fingerprinting of websites, browsers,
services like YouTube and Netflix, or even between devices, and operating systems, and for estimating the
different Android apps. Due to pervasive encryption QoE of services [159], [160]. Due to the prevalence of
with for example TLS on application and transport those topics, there also exists a variety of datasets for
the different network traffic analytic tasks.
9 https://www.aimsun.com/
10 https://www.ptvgroup.com/ An overview on the topic of traffic classification along
11 https://www.caliper.com/transmodeler/default.htm with a list of existing works (and solutions) and datasets
18 VOLUME 11, 2023

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3384460

Afifi et al.: A Primer in Machine Learning with Computer Networks: Techniques, Datasets and Models

TABLE 8: Overview of mobility frameworks

Points of comparison SUMO [148] MATSim Aimsun [150] PTV Vissim TransModeler CityFlow
[149] [151] [152] [153]
Focus urban urban with urban and small-scale small towns urban
large-scale non-urban urban and large
transport cities
systems
Modeling microscopic macroscopic microscopic microscopic macroscopic macroscopic
individual travel traffic flow and and
behavior modeling microscopic microscopic
Output inter-modal informa- data on in- data on in- data on in- data on individual ve-
tion on traffic flow dividual vehi- dividual vehi- dividual vehi- individual hicle behav-
and congestion cles cles cles vehicles, GIS ior
mapping
and 3D
visualization
Availabilty open-source open-source commercial commercial commercial open-source

is provided in [161]. However, many of these datasets are B. DATA GENERATION

quite old and, thus, outdated. A dataset for encrypted Using real-world data is often challenging due to its
network traffic classification of YouTube and Netflix limited availability and applicability. In this section, we
is provided in [162]. Here, the authors collected three explore the use of simulators, emulators, and synthetic
classes of flows, namely, web flows, YouTube flows, methods, such as GANs, for generating data that can
and Netflix flows for the most popular websites and be used to train ML models. These approaches have
videos, while using different end devices, browsers, and the potential to help when datasets are not available
operating systems. A new dataset for app traffic clas- or suited, and can enable the creation of diverse and
sification is Mirage [163]. This dataset was generated complex datasets.
using three different mobile devices, which were used
by real experimenters (students) once or twice a day. 1) Simulation Tools
Overall, each experimenter generated 12 captures of a One of the paramount parts of designing a new scheme
duration of 5 to 10 minutes. Experimenters were hereby or protocol is the evaluation process. There are various
instructed to use the app as they would usually do in methods, including real-world experiments, simulation,
their day-to-day life. The resulting datasets consists of emulation, or analytical models in order to perform
40 Android apps from 16 different categories. Another detailed investigation of the newly designed scheme.
new dataset for traffic classification of mobile apps is Nevertheless, each method has its advantages and disad-
AppClassNet [164]. It was designed as ImageNet for vantages. When employing practical tests, the accuracy
encrypted network traffic analytics and, therefore, is of the results can be faultless, however, in contrast,
significantly larger in terms of tested apps and available complexity and cost can increase. On the other hand,
samples than other datasets. The corresponding public modeling a new protocol based on conceptualization
dataset contains 500 apps with a volume of around 10 is beneficial for having an analytics model, yet the
TB and stems from passive measurements. complexity is still a flaw. Moreover, the accuracy can
For security tasks, in particular network intrusion be declined as the lack of capability for reflecting real-
detection, a comprehensive survey can be found in [165]. world scenarios might be highlighted. Considering the
In this survey, the authors describe over 30 datasets and above challenges, using simulation and emulation envi-
list the corresponding attacks and data types. A variety ronments can strike a good balance between complexity,
of datasets for different security tasks is also provided by cost, and accuracy. They might not depict real-world
the Canadian Institute for Cybersecurity, University of conditions minutely, but even so, they are eminent tools
New Brunswick (UNB) [166]. They provide more than that can assist a researcher or developer to expand novel
25 datasets from different categories. These categories schemes. A thorough analysis of different simulators for
include IoT, dark web, DNS, IDS, traffic classification networking can be found in [168], [169].
(web and apps using Tor or VPN), malware, and opera-
tional technology. A dataset for fingerprinting of devices Network Simulator 3
and operating system in the potential presence of VPN One of the most powerful simulation tools in networking
is provided in [167]. The dataset contains around 20000 is ns-3 (Network Simulator - 3)12 [170], which is a
examples suitable for fingerprinting browsers, operating discrete-event open-source simulator under the GNU
systems, and apps. GPLv2 license. This tool comes with various modules
12 https://www.nsnam.org

VOLUME 11, 2023 19

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3384460

Afifi et al.: A Primer in Machine Learning with Computer Networks: Techniques, Datasets and Models

TABLE 9: Overview of the encrypted network traffic datasets.

YouTube/Netflix Mirage AppClassNet Security
(Year 2021) [162] (Year 2019) [163] (Year 2022) [164] (Year 2016) [167]
Number of Classes Flow Types (3) Apps (40) Apps (500) OS (3), Browser (5), Ap-
plication (8)
Scope Flow Classification App Classification App Classification Fingerprinting
Data Availability Available for research and Available for research and Available for commercial Available for research and
educational purposes educational purposes purposes educational purposes
Data Format Raw data Raw data Preprocessed (Numpy) Raw data

such as Wi-Fi, LTE, or even a recently released mmWave higher layer protocols like IP, UDP, and TCP.
(millimeter wave) [171] module to ease the way for In terms of ML, Veins-Gym [174] exposes an OM-
researchers to have somehow reliable test environments NeT++ simulation as an OpenAI Gym environment,
for the newly developed approaches and reduce develop- analogous to ns3-gym for ns-3. Despite its name, Veins-
ment time for various kinds of research interests. Indeed, Gym can be used not only in combination with the Veins
ns-3 can assist researchers in network performance eval- framework but also with any OMNeT++ simulation.
uation, however, if it is extended through open-source An overview with examples of how to use different
AI frameworks, the procedure would be more beneficial ML frameworks such as TensorFlow in OMNeT++ can
to ML problems as by default it does not support be found at https://github.com/ComNetsHH/omnetpp-
ML approaches. An attempt to do so was employed ml.
in ns3-gym [172], an extension of ns-3 connecting the
module to the OpenAI Gym toolkit. This connection is
done utilizing Zero MQ sockets through the IPC (Inter- 2) Emulators
Process Communications) method. Moreover, the capa- A network emulator, unlike a simulator, creates a vir-
bility and adaptability of OpenAI Gym to reinforcement tual copy of a physical device, including all hardware
learning can be favorable, as it is a widespread library and software configurations, to functionally replace it.
such as TensorFlow and Scikit-Learn. ns3-gym aims at Hence, emulation is more accurate than simulation, but
ameliorating the process of network prototyping that also more expensive in terms of computation resources.
employs reinforcement learning. This module enhances There are many network emulating tools, including but
the feature of scalability, which is important for having not limited to:
several instances in ns-3 and making the conversion and • Mininet15 : a Python-based tool focused on emulat-
deployment of ns-3 scripts feasible in the OpenAI Gym. ing software-defined networks (SDNs) using Open-
Furthermore, debugging and exploitation of the module Flow switches.
could be kept at a level that is uncomplicated for users • GNS316 : supports a wide range of network devices
as it is such a conventional module for ns-3, having two and protocols using virtual machines and real de-
main blocks of OpenAI Gym and ns-3 that interact with vices,
each other. Another interface extension that bridges the • Mahi17 : a lightweight network emulator that is
ns-3 and python-side ML implementation is ns3-ai [173], designed to emulate low-bandwidth networks with
which claims to greatly increase the interaction speed high latency.
by facilitating communication through a shared memory • WANEM18 : a Linux-based tool that can be used to
block. emulate various network conditions such as latency,
packet loss, and bandwidth limitations in WAN.
OMNeT++ • TENSE19 : a VM tool that can be used to generate
Another popular discrete-event network simulator is emulated network traffic for security evaluation
OMNeT++13 , which can be used free of charge for purposes. It can generate various types of traffic,
academic and educational purposes under a license with such as HTTP, FTP, SMTP, etc.
rights similar to the GPL14 , but requires a paid license • CORE20 : similar to GNS3 but with further emula-
for commercial use. While OMNeT++ itself only con- tion capabilities beyond traditional networks, such
tains the core simulation framework, various models can as SDN and virtualization technologies.
be added via external frameworks. The most important
one is the INET Framework, which is maintained by the 15 http://mininet.org/
OMNeT++ core team and provides models for network 16 https://docs.gns3.com/
standards like IEEE 802.3 and IEEE 802.11 as well as 17 https://manpages.org/mahimahi/
18 https://github.com/PJO2/wanem
13 https://omnetpp.org/ 19 https://github.com/vmware/te-ns
14 https://omnetpp.org/intro/license 20 http://coreemu.github.io/core/

20 VOLUME 11, 2023

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3384460

Afifi et al.: A Primer in Machine Learning with Computer Networks: Techniques, Datasets and Models

• FlowEmu21 : a modular network link emulator with dresses, source and destination ports, and protocol type.
an flow-based programming inspired user interface Additionally, they can reduce the amount of data needed
that integrates TensorFlow for writing custom ML to be generated by generating a single flow instead of
modules. multiple individual packets.
Many of these tools have been used for training and The authors in [177] propose different preprocessing
evaluating ML algorithms. For example, SDWAN-gym22 approaches, for transforming IP addresses of flows into
and IROKO [175] are Python-based platforms built on a continuous feature, since GANs can only process
top of Mininet for training and evaluating reinforcement continuous features. Then, they use domain knowledge,
learning algorithms in software-defined WANs and data such as packet size, inter-arrival time, and flow duration
centers, respectively. It is often the case that emulated distributions, to evaluate the quality of the generated
data is mixed with real data for a large reliable dataset. data. Another example is MAIGAN (Massive Attack
There exist many datasets that adopt this approach in Generator via GAN) [178] that generates synthetic net-
cybersecurity, such as the Canadian Institute for Cy- work traffic that mimics various types of cyber attacks,
bersecurity database23 . They provide the “CICIDS2017” including Distributed Denial of Service (DDoS) attacks,
dataset, labeled network flows with full packet payloads port scanning attacks, and brute-force attacks, which is
in PCAP format, for ML and deep learning purposes. able to bypass black-box ML-based detection models24 .
Also, they provide the AndMal 2020 dataset to identify To handle the problem of imbalanced traffic classifica-
and classify Android malware based on ML. tion, i.e., data used for training the classification model
contains a disproportionate number of samples from one
3) Synthetic class compared to the others, ITC-GAN [179] uses a
modified GAN architecture with a class-balancing loss,
Synthetic data is needed because it can help to overcome
based on the inter-packet time characteristics, which
the lack of up-to-date real-world data and privacy con-
helps to balance the number of samples from each class
straints, which limit the development of new models.
during training and remove the bias in the classification
In addition, synthetic data can provide an efficient
model.
mechanism to surmount the lack of labeled datasets
The work in [180] discusses other GANs models for
and post-processing overhead. In the context of network
network traffic generation, including Facebook Chat
traffic analysis, synthetic data can be used, for example,
GAN [181] that generates chat message sequences based
to train ML models to detect cyber-attacks and resolve
on Facebook Messenger data, ZipNet GAN [182] that
network congestion as well as other performance issues.
generates compressed network packets using Huffman
SynGAN (Synthetic Generative Adversarial Net-
coding and PcapGAN [183] that generates network
work) [176] is a packet-level GAN designed to generate
packet captures (pcap files) by learning from real-world
synthetic traffic data. It generates synthetic packets
pcap files.
that closely resemble real-world traffic by simultaneously
training the generator and discriminator networks. The
C. MACHINE LEARNING TOOLS
generator network takes random noise as input and pro-
Selecting the right tools to solve ML problems can be
duces synthetic network traffic data as output, while the
challenging. This section gives an overview of the most
discriminator network distinguishes between synthetic
commonly used Python-based tools for classic ML, deep
and real data. Adversarial training ensures that the
learning, and reinforcement learning. Furthermore, we
synthetic data produced by SynGAN is representative
provide guidelines for when to use which tool.
of real network traffic.
To make sure that the generated data satisfies certain
1) Classic Machine Learning
constraints, PAC-GAN (Projection Adversarial Con-
Scikit-learn25 , also known as sklearn, is a machine-
straint GAN) [91] uses a projection operator to map
learning library for Python that provides a wide range
the generated data onto a feasible set that satisfies the
of tools for data analysis and modeling. It is built on top
desired constraints. In addition to the standard GAN
of other popular Python libraries such as NumPy26 and
loss, PAC-GAN uses a constraint loss to ensure that the
SciPy27 and is designed to be easy to use and efficient.
generated data is not only realistic but also satisfies the
Sklearn provides a variety of ML algorithms for various
desired constraints.
tasks such as classification, regression, clustering, and
Another type of traffic generator is flow-based GANs
dimensionality reduction. These algorithms are carefully
that, unlike packet generators, focus on individual pack-
designed to be intuitive, fast, and efficient, allowing for
ets and generate flows of packets that share common
quick prototyping and testing of ML models. Some of
characteristics, such as source and destination IP ad-
24 https://github.com/JayWalker512/PacketGAN
21 https://github.com/ComNetsHH/FlowEmu 25 https://scikit-learn.org/stable/
22 https://github.com/amitnilams/sdwan-gym 26 https://numpy.org
23 https://www.unb.ca/cic/datasets/index.html 27 https://scipy.org

VOLUME 11, 2023 21

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3384460

Afifi et al.: A Primer in Machine Learning with Computer Networks: Techniques, Datasets and Models

the commonly used algorithms available in sklearn in- such as TensorFlow. This feature facilitates the easy
clude support vector machines, decision trees, k-nearest implementation of advanced deep-learning models, in-
neighbors, logistic regression, and random forests. It cluding those with conditional logic, loops, and dynamic
also includes tools for model selection, preprocessing, inputs. PyTorch is designed to be modular and flexi-
and evaluation that enable researchers to preprocess ble, with a wide range of building blocks (e.g., layers,
data and select the right model for a problem. Some activations, loss functions, and optimizers) that can be
of the popular tools available in sklearn include cross- used to create custom deep-learning models with vari-
validation, grid search, PCA, and feature selection. A ous complexity levels. Furthermore, PyTorch supports
key advantage of sklearn is its wide range of algorithms distributed training, allowing for the efficient use of
for different tasks that are designed to be easily utilized multiple GPUs and machines. PyTorch is suitable for
by beginners to get started with ML. building complex deep learning models, and its dynamic
computational graph makes it easy to write and debug
2) Deep Learning codes.
Keras28 : Keras is an open-source library for deep learn-
ing in Python that provides a high-level API for build- 3) Reinforcement Learning
ing, training, and evaluating DL models. A key advan- Many tools for RL are built on top of classical ML and
tage of Keras is its simplicity due to its high-level API, deep learning tools, to support various algorithms.
which abstracts away the complexities of building and OpenAI Gym / Gymnasium: OpenAI Gym31 (lately
training deep learning models. Keras provides a user- continued as Gymnasium32 by the Farama Founda-
friendly API that makes it easy to create and experiment tion) is an open-source Python library that provides
with different neural network architectures and offers a a standardized API for the interaction between RL
wide range of pre-trained models. Keras provides a wide algorithms and environments. Additionally, it includes
range of pre-built layers, optimizers, and other building a wide range of environments of different complexities,
blocks that can be easily combined to create complex including classic control tasks, Atari games, robotic
models. Keras can also be run on top of several backends, simulations, as well as physical simulations. This allows
including TensorFlow, Theano, and CNTK. Keras is researchers to reproducibly benchmark RL algorithms
best suited for building simple to medium-complexity on a standardized set of environments. Furthermore,
neural networks and deep learning models. Gym can be extended by custom environments, allowing
TensorFlow29 : TensorFlow is an open-source library users to easily compare the performance of different RL
for ML and deep learning developed by Google. It is algorithms for customized problems.
widely used for a variety of tasks, such as image and One challenge of RL research is that different im-
speech recognition, natural language processing, and for plementations of the same RL algorithm can have sig-
training and deploying large-scale ML models. The key nificantly different performances in the same environ-
advantages of TensorFlow are flexibility and scalabil- ment, making RL algorithms highly sensitive not only
ity. It allows for the building and training of complex to hyperparameters but also to small implementation
models, including deep neural networks, and can run details [184].
on a variety of platforms, including CPUs, GPUs, and Stable Baselines3: Stable Baselines3 (SB3) [185] is
TPUs. TensorFlow can be used for distributed (ML) an open-source Python library that contains reference
training across multiple machines, making it well-suited implementations of seven widely used DRL algorithms.
for large-scale ML tasks. The TensorFlow ecosystem Tab. 10 lists all supported algorithms. The performance
includes a wide range of tools and libraries for tasks of those algorithms has been thoroughly tested. The li-
such as data pre-processing, model visualization, and brary is compatible with the OpenAI Gym/Gymnasium
deployment. TensorFlow is a good choice for building API, enabling users to train RL agents in just a few
complex deep-learning models that require a high degree lines of code. Moreover, the library supports custom
of customization and is suitable for both research and Gym environments, custom policies for the algorithms,
production environments. TensorBoard, as well as data logging customization
PyTorch30 : PyTorch is an open-source ML library for through custom callbacks.
Python that is primarily developed by Facebook’s AI Additional RL algorithms are implemented in the Sta-
research group. A distinctive feature of PyTorch is its ble Baselines3 Contrib (SB3-Contrib)33 package. These
dynamic computational graph, which allows for more are implementations of newly published algorithms.
flexibility in building and modifying models compared They are less tested and therefore considered experimen-
to the static computation graph used in other libraries, tal.
28 https://keras.io/ 31 https://www.gymlibrary.dev
29 https://www.tensorflow.org/ 32 https://gymnasium.farama.org
30 https://pytorch.org/ 33 https://sb3-contrib.readthedocs.io/en/master/

22 VOLUME 11, 2023

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3384460

Afifi et al.: A Primer in Machine Learning with Computer Networks: Techniques, Datasets and Models

RL Baselines3 Zoo: RL Baselines3 Zoo34 is a Python fers high scalability by supporting both single-machine
library that provides pre-trained agents and a set of and distributed training, and offers tools for managing,
optimized hyperparameters for the algorithms from SB3 tracking, and visualizing the results of experiments.
and the Gym environments. Moreover, it provides useful Because it is built on the Ray platform, it is also
helper scripts for training and evaluating agents, for seamlessly compatible with other Ray libraries and tools
tuning hyperparameters, and for plotting results. for distributed computing and parameter tuning.
CleanRL: CleanRL [186] is a DRL framework that When to use which RL library? An important question
provides thoroughly benchmarked single-file Python im- to answer in this primer is when to use which of the
plementations of eight DRL algorithms (c.f. Tab. 10). presented RL libraries? CleanRL is recommended to be
Its goal is to provide researchers full control over an used either to fully understand how an algorithm is
algorithm in a single file, making it easier to 1) fully implemented or by RL researchers to quickly prototype
understand all implementation details, and 2) quickly new ideas, since its design decision to separate each
prototype novel DRL features. In addition, it pro- algorithm into its own file lets the researcher focus on
vides support for TensorBoard. In comparison to SB3, the algorithm instead of the complex software architec-
CleanRL does not provide a high-level user-friendly API ture of other RL algorithm libraries with intertwined
for model training. It is instead tailored to provide modular implementations. SB3 is primarily intended to
a development environment for DRL researchers with offer well-tested baseline implementations of important
implementations that are easy to read, debug, modify, DRL algorithms as a benchmark baseline for new RL
and study. The desired workflow is to first prototype new developments. However, along with its extensions SB3-
RL ideas in CleanRL and afterwards port it to a library contrib and Zoo, it is recommended to be used if a
offering a higher-level API like SB3. high-level interface for fast training of well-established
OpenAI SpinningUp: OpenAI SpinningUp35 is a great and well-tested RL algorithms on single-agent environ-
resource for aspiring researchers and practitioners that ments are desired and no scalability via distributed
are excited to apply DRL to their problems but are learning is required. RLlib offers a production-ready
overwhelmed by the implementation complexity of algo- framework for large scale projects. It is recommended
rithms in frameworks like Stable Baselines3. It provides to be used for multi-agent environments, as well as
detailed explanations of the most important concepts when high scalability via distributed learning, e.g., on
of DRL, as well as explanations and implementations clusters, is required. Furthermore, RLlib includes tested
of key DRL algorithms. The algorithm implementations implementations of cutting-edge RL algorithms. Con-
specifically focus on simplicity with the aim of being easy cerning environments and the interaction of agents with
to follow for people new to the field. This simplicity is them, Gymnasium and PettingZoo are arguably the
achieved by narrowing down the implementations to the most recognized and thus important standard APIs for
core concepts of the algorithms, and by omitting more single-agent and multi-agent RL, respectively. Creating
complex features that can significantly improve the algo- environment interfaces that adhere to their API defini-
rithm’s performance. As a result, OpenAI SpinningUP tions makes it much easier to experiment with different
should be primarily seen as a resource for education that RL algorithms and variations, since many implemen-
should not be used in production systems. tations expect Gymnasium or PettingZoo environment
PettingZoo: PettingZoo36 is an open-source Python instances.
library that contains a set of environments for multi- Table 10 provides an overview of the algorithms
agent reinforcement learning. While it is similar to implemented in the different RL frameworks. Besides
OpenAI Gym/Gymnasium in its functionality and API, the listed algorithms, RLlib supports further algorithms
the application scenario of MARL is different from the specifically for MARL.
one of single-agent RL. Among others, it contains multi-
agent environments of Atari games and classic games D. DATA LOGGING AND PARAMETER TUNING
like chess and Go. Furthermore, it can be extended by Prototyping is an essential aspect of ML development,
custom environments. and the ability to log and monitor experiments is cru-
Ray RLlib: Ray RLlib [187] is an open-source Python cial for efficient iteration. Creating individual names
library for RL. Out of the RL libraries presented in this for logs and artifacts might work after ten runs but
section, it is the most comprehensive one. It supports a becomes overwhelming after hundreds of runs. When
wide range of performance-tested RL algorithms, offers trying to compare different runs or when looking for
a high-level user-friendly API to train agents, supports the respective parameters of a run, having to look into
single-agent, multi-agent, and custom environments, of- log files is cumbersome. Fortunately, a range of tools
has been developed to do this task, organizing every
34 https://github.com/DLR-RM/rl-baselines3-zoo run with its parameters, visualizing runs with plots and
35 https://spinningup.openai.com the option to filter and search for specific runs, which
36 https://pettingzoo.farama.org will be introduced in this section. These tools provide
VOLUME 11, 2023 23

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3384460

Afifi et al.: A Primer in Machine Learning with Computer Networks: Techniques, Datasets and Models

TABLE 10: Overview of the implemented single-agent need the most fundamental logging and visualization
algorithms of different RL frameworks. options.
SB3 SB3 Con. CleanRL Spin.UP RLlib
A2C [52] + - - - + 2) Tuning Tools
DDPG [110] + - + + + In ML, hyperparameters are often used to control the
DQN [109] + - + - +
HER [188] + - - - - behavior of the ML model. These hyperparameters in-
PPO [189] + - + + +
SAC [190] + - + + +
clude the usual parameters for training such as batch
TD3 [191] + - + + + size, which optimizer to use, learning rate and other
ARS [192] - + - - +
QR-DQN [193] - + - - - optimizer specific parameters, but can also include the
RecurrentPPO [194] - + - - - structure of the model like number, type and topology of
TQC [195] - + - - -
TRPO [196] - + - + - layers, or other custom parameters specific to the appli-
Maskable PPO [197] - + - - -
Categorical DQN [198] - - + - - cation. Selecting optimal hyperparameters is critical to
PPG [199] - - + - - achieving the best possible performance of a ML model.
RND [58] - - + - -
VPG [49] - - - + + However, searching for the optimal hyperparameters can
A3C [52] - - - - +
AlphaZero [200] - - - - +
be a complex and time-consuming task, especially for
Behavior Cloning [201] - - - - + large datasets and complex models. Tuning tools help
CQL [202] - - - - +
CRR [203] - - - - + to automate this process by systematically searching for
Dreamer [204] - - - - + optimal hyperparameters based on a user-defined search
IMPALA [205] - - - - +
R2D2 [206] - - - - + space and optimization criteria. This can save significant
Rainbow [207] - - - - +
SlateQ [208] - - - - + time and resources in the ML development process and
DD-PPO [209] - - - - + help to achieve better model performance. Below is a list
of the most popular tools
• NNI (Neural Network Intelligence) [216]: an open-
a suite of features beyond just logging and monitoring, source toolkit developed by Microsoft for automat-
including the ability to perform hyperparameter tuning. ing and optimizing the hyperparameter tuning pro-
Additionally, most of them easily integrate with popular cess of deep learning models. It provides a frame-
ML frameworks such as TensorFlow, PyTorch, Keras, work for designing and conducting experiments us-
and Scikit-learn. ing various search algorithms and techniques, such
as grid search, random search, Bayesian optimiza-
1) Data Logging tion, evolutionary optimization, and tree-structured
TensorBoard is by far the most popular data logging Parzen estimator (TPE). It also supports dis-
and visualization tool in ML. It is widely used in contributed training and can scale up to thousands of
junction with deep learning frameworks like TensorFlow nodes for high-performance computing.
and PyTorch. However, there are other popular data • Optuna [217]: an open-source hyperparameter op-
logging and visualization tools as well, such as Weights & timization framework for ML. It provides a flexible
Biases (WandB) and Comet.ml. The popularity of these and modular platform for automating the process
tools may vary depending on the specific use case and of selecting optimal hyperparameters for a given
developer preferences. We present in Table 11 a list of model architecture. Optuna uses various algorithms
the most popular tools. These tools all support different to search the hyperparameter space, including TPE,
ML frameworks and offer real-time monitoring and Covariance Matrix Adaptation Evolution Strategy
visualization of ML models. They also handle various (CMA-ES), Non-Dominated Sorting Genetic Algo-
data types and provide customizable views. Weights & rithm II (NSGA-II), and adaptive sampling. It also
Biases and Comet.ml provide experiment tracking and supports distributed optimization across multiple
collaboration features that allow for easy collaboration nodes for faster and more efficient tuning.
and sharing of results, while Visdom and TensorWatch • Ray Tune [218]: the hyperparameter tuning compo-
are great options for debugging and developing since nent of the Ray framework. It handles the execution
they offer configurable logging options and real-time of experiments including parameter studies with
tensor viewing. possibly multiple repetitions as well as scheduling
It’s important to keep in mind that the developer’s the runs for parallel execution. For hyperparameter
individual demands and preferences will determine the tuning, it supports a wide variety of approaches.
visualization tool that fits best. Some developers might These include basic strategies such as grid or ran-
favor a tool that integrates well with their preferred ML dom search, but also more advanced approaches
framework, whereas others would value flexibility and such as Bayesian optimization or Population Based
customization possibilities. Also, some developers could Training [219]. While some algorithms are imple-
need more sophisticated features like collaboration tools mented internally, it relies heavily on third-party
or hyperparameter optimization, while others might only optimization libraries such as Hyperopt [220] and
24 VOLUME 11, 2023

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3384460

Afifi et al.: A Primer in Machine Learning with Computer Networks: Techniques, Datasets and Models

TABLE 11: List of Visualization and Logging Tools

Tool Features and Capabilities Integration Additional Features
with ML
Frameworks
TensorBoard Web-based visualization tool for monitoring and analyzing ML TensorFlow -
[210] models. Features include histograms of weights and biases, and Pytorch
visualization of the computational graph, and tools for com-
paring models.
Weights & Bi- Platform for ML experimentation, including experiment track- TensorFlow, Experiment tracking and col-
ases [211] ing, visualization, and collaboration. Provides monitoring of PyTorch, laboration tools
metrics, model performance, and hyperparameter optimiza- Keras,
tion via sweeps. XGBoost,
and more
Comet.ml [212] Platform for ML experimentation, including experiment track- TensorFlow, Experiment tracking and col-
ing, visualization, and collaboration. Features include ex- PyTorch, laboration tools
periment management, collaborative workspace, and model Keras, Scikit-
versioning. Learn, and
more
MLflow [213] Open-source platform for ML experimentation, including ex- TensorFlow, Experiment tracking and
periment tracking, visualization, and model deployment. Fea- PyTorch, model management
tures include experiment management, model registry, and Keras, Scikit-
reproducible code management. Learn, and
more
Visdom [214] Web-based visualization tool for deep learning. Provides real- PyTorch Flexible logging and custom vi-
time plotting of various types of data, including images, video, sualizations
and text, and supports PyTorch. Features include custom
visualizations, support for multiple windows, and flexible
logging.
TensorWatch Debugging and visualization tool for deep learning. Allows TensorFlow, Real-time tensor inspection
[215] visualization of data flow graphs, histograms, and confusion PyTorch, and automatic model visual-
matrices, and supports TensorFlow, PyTorch, and MXNet. MXNet ization
Features include real-time inspection of tensors, automatic
model visualization, and integration with Jupyter Notebook.

Optuna [217], and provides a unified interface to including Scikit-learn, Keras, and PyTorch.
them. •Scikit-Optimize [223]: a Python library for se-
• Keras Tuner [221]: a library customized for Keras quential model-based optimization that aims to
that provides an easy-to-use API for defining a efficiently explore and exploit the hyperparame-
hyperparameter search space, choosing search al- ter search space while minimizing the number of
gorithms such as random search and Bayesian model evaluations. It provides a simple and flex-
optimization, and running hyperparameter search ible API for defining the hyperparameter search
processes. Furthermore, Keras Tuner is easy to inte- space and selecting optimization algorithms, in-
grate with other Keras workflows and can optimize cluding Bayesian optimization and gradient-based
both single-node and distributed hyperparameters. optimization. Scikit-Optimize also supports parallel
• HyperOpt [222]: a Python library for hyperpa- evaluation of the search process, making it scalable
rameter optimization that uses a combination of to large hyperparameter search spaces and parallel
random search and Bayesian optimization to ef- computing environments. In addition to hyperpa-
ficiently explore and exploit the hyperparameter rameter optimization, Additionally, it can be used
search space. It provides an easy-to-use API for for function optimization and global optimization
defining the hyperparameter search space, selecting tasks. Furthermore, it integrates easily with popular
optimization algorithms, and executing the hyper- ML frameworks such as Scikit-learn and Keras,
parameter search process. HyperOpt uses a Tree- while including features such as early stopping and
structured Parzen Estimator (TPE) algorithm to warm-starting to further improve the efficiency of
model the relationship between hyperparameters the hyperparameter search process.
and model performance and to guide the search Note that Table 12 presents only the most commonly
for better hyperparameters. HyperOpt also allows used algorithms for each tool. While other algorithms
for the parallelization of the search process, making may be added, the mileage may vary depending on
it scalable to large hyperparameter search spaces the specific use case and requirements. Overall, the
and parallel computing environments. It can be choice of which tool to use depends on the specific
used with a variety of machine-learning frameworks, requirements and use case. For example, if there is a
VOLUME 11, 2023 25

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3384460

Afifi et al.: A Primer in Machine Learning with Computer Networks: Techniques, Datasets and Models

TABLE 12: Comparison of Hyperparameter Tuning Tools

Tool Common Algorithms ML Libraries Scalability Ease of Use Community
Ray Tune Hyperband, PBT, Bayesian TensorFlow, PyTorch, oth- High Simple API Large
ers
Hyperopt Bayesian TensorFlow, PyTorch, oth- Medium User-friendly Large
ers
Optuna TPE, CMA-ES, NSGA-II TensorFlow, PyTorch, oth- Medium User-friendly Large
ers
Keras Tuner Random search, Bayesian, Hyper- Keras Low Very user-friendly Small
band
Scikit-Optimize Bayesian, gradient-based TensorFlow, PyTorch, oth- Low Requires Small
ers programming
experience
NNI Grid search, random search, evo- TensorFlow, PyTorch, High Simple API Large
lutionary optimization, TPE MXNet

need for scalability and distributed training, Ray Tune been developed and maintained by the Communication
is a good choice. If there is a need for a general-purpose Systems Group at ETH Zurich.
optimization library, then Scikit-Optimize might be a FIT IoT Lab [230] is an open-access testbed for
good choice. IoT experiments provided by the French Institute of
Technology. It contains over 1500 nodes offering a wide
E. TESTBEDS range of low-power wireless devices that can be used
As previously outlined, it is hard to replicate realistic to test and evaluate various IoT applications, protocols,
network conditions, and using existing datasets might and algorithms. In addition, its large-scale infrastructure
not always fit the problem. While simulation tools can and easy-to-use web interface provide a flexible and
help with that, there is also the possibility of using exist- convenient platform for IoT experimentation.
ing testbeds or building your own. Access is usually open D-Cube [231] is a testbed by Graz University of Tech-
or free to researchers for the existing real-world testbeds, nology. It contains about 50 nodes with two platforms,
but you might have to schedule your experiments and nRF52840 and TelosB, and provides a set of predefined
wait depending on utilization. In the following, some scenarios. These scenarios allow researchers to evaluate
popular real-world testbeds and some devices one could protocol performance and compare it against each other
use to build a testbed will be introduced. There are two easily.
types of testbeds, wired ones, and wireless ones. The CLOVES [232] is a part of the IoT Testbed at the Uni-
wireless ones are wireless sensor networks without any versity of Trento. It contains 275 indoor devices spread
routers or switches, and communication is broadcasted. over 8000 square meters. Communication is possible
Testbeds are relatively versatile and can be used to using ultra-wideband or narrowband, and all nodes are
either test ML applications that rely on networks like remotely accessible.
Distributed ML or to test ML algorithms that do traffic Next, we are going to introduce some wired testbeds.
routing, for example. Note that some of them also provide wireless capabili-
ties.
1) Real-world Testbeds PlanetLab [233] was founded in 2002 by researchers
For a more extensive overview, [224], [225], [226], and from several universities, including Princeton University,
[227] provide surveys that either include a section about the University of California at Berkeley, and Stanford
testbeds or are entirely about testbeds. We present University. While it was shut down in March 2020, Plan-
a selection of popular testbeds, starting with wireless etLab Europe37 continues to operate. It is a collection
testbeds. of interconnected computers located at over 250 sites
FlockLab [228] is an experimental platform that en- in more than 40 countries across Europe and beyond,
ables researchers to test and evaluate the performance of available for researchers to use in their experiments.
wireless sensor networks (WSN) and IoT systems. It is a PlanetLab Europe provides researchers with virtual ma-
flexible, open-source testbed that provides a controlled chines, storage, and network connectivity. In addition,
and repeatable environment for the evaluation of various researchers can deploy their software on the nodes and
applications. An advantage of FlockLab is its flexibility, create custom network topologies to simulate various
as it can be used to test and evaluate a wide range network scenarios.
of wireless sensor networks and IoT systems [228]. It EmuLab38 [234] is a network testbed developed by
supports various wireless technologies, such as Zigbee, the University of Utah that provides users with a vir-
Z-Wave, and LoRaWAN, and it can be easily extended tual network environment to test and evaluate various
to support new technologies. FlockLab is widely used 37 https://www.planet-lab.eu/

in the field of WSNs and IoT systems [229], and it has 38 https://www.emulab.net/

26 VOLUME 11, 2023

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3384460

Afifi et al.: A Primer in Machine Learning with Computer Networks: Techniques, Datasets and Models

networking systems and applications. Emulab allows Raspberry Pi can run TensorFlow Lite and other ML
researchers to create and configure network topologies, frameworks, enabling researchers to run pre-trained
deploy software and network services, and generate models and perform basic ML tasks. It can also be used
different types of network traffic to test and evaluate as an edge device for collecting and preprocessing data
various networking scenarios. before sending it to the cloud for further analysis.
GENI (Global Environment for Network Innova- The Intel Movidius Neural Compute Stick41 is a
tions) [235] is a US national-scale network testbed USB device that provides on-device AI inference for
that provides researchers with a virtual laboratory for various applications in networked systems. It features
developing and testing new networking technologies a Myriad 2 VPU, which can run deep neural networks
and applications. It comprises a large-scale network of with low power consumption. The Neural Compute Stick
interconnected computing resources, including servers, can accelerate computer vision, speech recognition, and
routers, switches, and other network devices. natural language processing tasks in networked devices.

2) Building Your Own Testbed V. EXPLAINABLE ARTIFICIAL INTELLIGENCE

While ML and especially DL models are powerful tools
When seeking greater control over a testbed, building
for network service providers, they come with the major
a customized one emerges as a viable option. Fortu-
drawback that their reasoning is difficult to understand
nately, there are several cost-effective devices available
for humans due to their black-box characteristics [237].
for this purpose, with some even incorporating machine
This lack of understanding may result in stakeholders,
learning accelerators [236]. These accelerators enable
e.g., network service providers, not deploying ML models
the deployment of machine learning models for training
in production environments as they do not trust their
and inference within the testbed and offer a variety
reasoning and, thus, fear outages or revenue losses. To
of communication approaches. In this section, we will
alleviate these concerns, Explainable AI (XAI) is well-
provide a list of the most common and popular devices
suited as it helps to understand the underlying reasoning
used for this purpose, along with detailed explanations
of ML models. This reasoning is achieved by intelli-
of their respective advantages.
gently relating inputs and outputs. The thereby learned
NVIDIA Jetson39 is a series of embedded computing
transformation function or only some parts of it become
boards designed for IoT and ML applications. They
interpretable. Usually, this interpretability comes in the
include NVIDIA GPUs and CPUs, as well as a vari-
form of mathematical functions or as heatmaps describ-
ety of interfaces and sensors for connecting to other
ing the influence of the inputs on the model’s decision.
devices. Jetson boards are designed to be low-power
In addition, a quantification of a model’s uncertainty
and compact, making them suitable for portable and
is fundamental for risk assessment during deployment,
battery-powered applications. They can be used for
thereby paving the way for Responsible AI.
various tasks, including image and video processing,
There are plenty of use cases to apply XAI in commu-
deep learning, and robotic control.
nication networks [238]. These use cases include network
Google Coral40 includes a range of hardware and planning and engineering [239], resource allocation [240],
software products, such as the Coral Dev Board, the [241], performance management [128], [242], and security
Coral USB Accelerator, and the Edge TPU software. management [243], [244]. Most of these works use the
The Coral Dev Board is a single-board computer that is methods presented in this chapter to make their models
designed to be small and low-power, making it suitable explainable.
for use in portable and battery-powered devices. It
has a system-on-a-chip (SoC) that includes a Google A. TAXONOMY OF XAI METHODS
Edge TPU, which is a custom-built Tensor Processing A general overview of XAI techniques is provided in [245]
Unit (TPU) for running ML/DL models. The Coral and an extensive survey on XAI methods as well as
USB Accelerator is a small USB device that can add a taxonomy for XAI methods in general can be found
Edge TPU capabilities to existing devices. The Edge in [246]. XAI methods can be classified into techniques
TPU software provides a set of libraries and tools for which explain a model locally or globally. A local
developing and deploying ML models. explanation technique provides model explanations for
The Raspberry Pi boards are equipped with a vari- a single input, e.g., why is a specific packet routed
ety of interfaces and peripherals, such as USB ports, that way, while a global explanation technique provides
Ethernet, HDMI, and a 40-pin expansion header. They general explanation strategies of a model, e.g., how does
also have high CPU and memory capacity, which makes the model route packets in general.
them powerful enough to run various applications. The Further, XAI methods can be classified into post-hoc
explainers and interpretable models. Post-hoc explainers
39 https://www.nvidia.com/de-de/autonomous-machines/
embedded-systems/ 41 https://www.intel.com/content/www/us/en/developer/
40 https://coral.ai/ articles/tool/neural-compute-stick

VOLUME 11, 2023 27

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3384460

Afifi et al.: A Primer in Machine Learning with Computer Networks: Techniques, Datasets and Models

are utilized to explain various already trained black- explain the black box of a ML model very well, it comes
box models, e.g., neural networks or ensemble models. with the drawback that it needs high computational
Ensemble models like Random Forest are composed of power. Thus, it is only feasible for models with fewer
multiple smaller models jointly determining the output. input parameters [250].
This makes interpretation difficult. Interpretable, trans- A well-working method for getting an explanation
parent, or glass-box models provide an explanation for of classification models in a model-agnostic fashion is
how the model obtains the output by design. Prevalent a method named Local Interpretable Model-agnostic
models are, for example, the well-known linear models Explanations (LIME) [251]. LIME belongs to the class of
and decision trees, as well as the less-known generalized surrogate models, where a model is used to approximate
additive models. the predictions of a target black-box model to infer
Finally, model-agnostic methods and model-specific the reasoning of the black-box model. LIME trains a
methods are distinguished. Model-agnostic methods can local surrogate model to explain the predictions for a
be used on top of every kind of model, while model- specific sample by first aggregating permutations of the
specific methods can only be used by specific model original feature inputs of the sample into a new dataset,
families. A prominent example of model-specific meth- weighting the samples of the dataset according to their
ods are saliency maps [247], which are computed from proximity to the original sample, and then training an
the feature maps learned by a model and can be used interpretable model on this dataset to approximate the
in computer vision to highlight the regions on which the predictions of the black-box model. After training, the
model focuses when processing input. They are generally local model can be interpreted to understand the black-
applicable when using CNNs. This also implies that the box model’s reasoning.
nature of the data directly influences the applicable XAI Another type of local model-agnostic post-hoc ex-
techniques for the different use cases, e.g., time series plainers are counterfactual explanations [245]. Counter-
XAI techniques are not usable for graph data. factual explanations are used for causal reasoning and
may serve to answer what-if questions, i.e., "would Y
B. SPECIFIC XAI METHODS have occurred if X had not occurred before". These tech-
Since there are many different categories of XAI techniques may be helpful for network operators when they
niques, there is a wide spectrum of specific XAI methods. try to analyze and manage their network with respect
Thus, the following explained methods are only a small to critical situations, e.g., how to avoid congestion in a
selection. Due to the fact that in many XAI scenarios a network. In a nutshell, they work by deriving causal rela-
black-box ML model should be intelligible, the methods tionships from the input features and then manipulating
introduced first focus on post-hoc explainer. While it is input features to perform specific reasoning.
common to perform post-hoc explanations, the authors
of [248] argue that we should stop using post-hoc ex- 2) Interpretable Models
plainers and instead directly use interpretable models. The easiest to interpret and most known interpretable
Interpretable models often perform weaker than black- models include decision trees, which are interpretable in
box models, but are interpretable by design. an if-else fashion, and linear models like linear regression
or logistic regression, where slope and intercept directly
1) Post-Hoc Explainers characterize the input mapping. Generalized linear mod-
As a majority of advances in ML happen in computer els (GLMs) and generalized additive models (GAMs) ex-
vision, there exists a huge variety of post-hoc explainers tend linear models to better reflect non-linear functions
explaining the learnt filters of a CNN, e.g., saliency and different target distributions other than Gaussian
maps. As a consequence, these techniques are model- distributions as with linear regression [245]. Especially
specific and usually not applicable for network data. for GAMs, many different models exist by now that are
Nevertheless, there exist approaches where network data directly interpretable. The main idea behind GAMs is
is transformed to images beforehand, e.g., for encrypted that a separate function fi is learnt for each feature xi
network traffic classification in [249], and processed with and that the outputs of these functions are then summed
a CNN, so saliency maps could be applied here. up with a bias β to perform the prediction. The bias can
Layer-wise Relevance Propagation (LRP) is a post- hereby be learnt or fixed beforehand. Variants of GAM
hoc method that uses the neural network’s forward pass include the Explainable Boosting Machine (EBM) [252],
and propagates its output backwards through the layers which is a tree-based GAM and uses gradient boosting
until the input layer to derive the relevance of an input to improve performance, and the Neural Additive Model
on the model’s prediction. (NAM) [253], which consists of n neural subnetworks
A prevalent local model-agnostic post-hoc explainer (fi ) transforming each of the n input features to a new
is called SHapley Additive exPlanations (SHAP), which representation. The architecture of NAM with its n
uses methods from game theory to judge the importance subnetworks and the bias β is depicted in Fig. 7.
of different feature inputs. Although this method can In Fig. 8, we show the interpretability of EBM and
28 VOLUME 11, 2023

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3384460

Afifi et al.: A Primer in Machine Learning with Computer Networks: Techniques, Datasets and Models

f1(x1) 2 EBM
NAM
2 EBM
NAM
2 EBM
NAM
1 1 1
x1 0 0 0
−1 −1 −1

Effect [MOS]
−2 −2 −2
f2(x2)
0 5 10 0 5 10 0 2 4 6 8
x2 Avg. Bitrate [Mbps] Initial Delay [s] Stallings [#]
5
∑ 2 EBM
NAM
2 EBM
NAM
1 1 4
...

...

MOS
0 0 3
fn(xn) −1 −1
IQX
2 WQL

xn −2 −2
1
LIN

0 10 20 30 40 0 10 20 10.0 7.5 5.0 2.5 0

Total Stalling Duration [s] Quality Switches [#] Fit of Avg. Bitrate [Mbps]

β FIGURE 8: Interpretability of EBM and NAM for an ex-

emplary video streaming QoE modelling scenario [254].
FIGURE 7: Architecture for Neural Additive Model
(NAM) [253].
the groundtruth in a best possible way. This results
in the model learning point estimates, which may be
NAM with the exemplary use case of Quality of Experi- close to the groundtruth or very far away from it for
ence (QoE) modelling for video streaming [254]. QoE single data points, but across all data points the model
describes the delight or annoyance of a user with a may be performing well on average. While this may be
networked service as influenced by different factors, e.g., acceptable for some use cases, for other use cases this
context factors or system factors [255]. In order to avoid becomes problematically as soon as wrong decisions be-
customer churn, network service providers and ISPs are come costly for stakeholders, e.g., a ML model organising
thus interested in estimating QoE or, as a proxy, QoE the routing tables in a network may cause downtimes in
influence factors from encrypted network traffic. For this a network due to wrongly deleting or adding routing
purpose, most approaches rely on ML as can be seen for rules. Thus, stakeholders have to assume that the model
video streaming and web browsing in [157], [256]. For does not always perform as expected. This, however,
data-driven QoE modelling, the goal is to predict the impedes the deployment of ML models in general, and
Mean Opinion Score (MOS) of a service on the range also in particular in communication networks, and a
from 1 (bad) to 5 (excellent) in a regression task given way to quantify the certainty or, more important, the
some QoE influence factors. With video streaming QoE, uncertainty is required.
some of these influence factors include video quality, ini- Uncertainty thus describes a model’s insecurity about
tial delay, and stalling events (interruption of playback its prediction. Uncertainty can be divided into aleatoric
due to buffer underrun), and also quality switches. Given uncertainty and epistemic uncertainty [257]. Aleatoric
a sample X with the individual feature values x1 to x5 uncertainty resembles the uncertainty in the data used
for avg. bitrate, initial delay, number of stallings, total for training and testing. This kind of uncertainty can
stalling duration, and number of quality switches, the never be reduced as the underlying process of the train-
model simply transforms each feature input xi with the ing and test data generation already includes noise. Epis-
learnt function fi to an effect and sums up these effects temic uncertainty, on the other hand, can be reduced as
along with the bias β to obtain the prediction. The this uncertainty is caused by the ML model due to a
figure for example shows that an increase in avg. bitrate lack of training data at specific points or an insufficient
leads to a higher MOS (positive effect) and that stalling capacity, i.e., the model is not complex enough to
events severely degrade the MOS (negative effect). A capture the actual relationships between the features.
service provider should therefore try to increase the Hence, epistemic uncertainty can also be considered as
avg. bitrate and try to avoid stalling events to improve model uncertainty. This uncertainty can be reduced by
QoE. The learnt functions of the model can be easily collecting for example more training data for training
interpret and the model’s decision can therefore always points, where data is scarce, or by increasing model
be reconstructed from the inputs. Given some experts’ complexity.
domain knowledge, the learnt functions can also be In most ML and XAI models, a way to quantify uncer-
verified with the expected experts’ functions. Both these tainty is usually not included. Instead, different means
aspects strongly increase the trust towards the model to quantify aleatoric and epistemic uncertainty have
and, thus, allow actual deployment. to be used. Note also that many approaches quantify
only epistemic or aleatoric uncertainty, but not both
C. UNCERTAINTY simultaneously. A survey over existing approaches is
One major issue with every form of ML is the fact provided in [258]. In the following, some selected ways
that training usually consists of fitting models such to quantify uncertainty are shortly introduced. A way to
that the mean response of the model approximates estimate epistemic uncertainty is for example the use of
VOLUME 11, 2023 29

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3384460

Afifi et al.: A Primer in Machine Learning with Computer Networks: Techniques, Datasets and Models

TABLE 13: Overview on XAI techniques and libraries.

Model Local Global Classif. Regr. Libraries
SHAP ✓ ✗ ✓ ✓ SHAP, InterpretML, OmniXAI, Alibi
LIME ✓ ✗ ✓ ✗ LIME, InterpretML, OmniXAI, ELI5
Anchors ✓ ✗ ✓ ✗ Anchors, Alibi
Post-Hoc Gradient-based Methods ✓ ✗ ✓ ✓ OmniXAI, Alibi, Captum, TorchRay
Permutation Importance ✓ ✗ ✓ ✓ OmniXAI, ELI5, Alibi
Partial Dependence Plot ✓ ✗ ✓ ✓ Scikit-learn, OmniXAI, Alibi, explainX
Linear Regression ✓ ✓ ✗ ✓ Scikit-learn, InterpretML, OmniXAI
Logistic Regression ✓ ✓ ✓ ✗ Scikit-learn, InterpretML, OmniXAI
Decision Tree ✓ ✓ ✓ ✓ Scikit-learn, InterpretML, OmniXAI
Interpretable Decision Rules ✓ ✓ ✓ ✓ InterpretML, OmniXAI
Explainable Boosting Machine (EBM) ✓ ✓ ✓ ✓ InterpretML
Neural Additive Models (NAM) ✓ ✓ ✗ ✓ Github
TabNet ✓ ✗ ✓ ✓ TabNet

ensembles, i.e., training the same model with different AI, there are additional principles, which must be kept
seeds and considering the predictions of each model and, in mind, when developing and deploying ML models.
in particular, the differences in these predictions. The These principles include the prevention of discrimination
stronger the differences between the models’ predictions, against persons, groups, or races, i.e., the model must
the higher the uncertainty. This approach can be used be fair. In the context of communication networks, this
for any kind of model. Another simple approach for could for example mean that a model discriminates
estimating epistemic uncertainty in neural networks is specific users by assigning them lower bandwidth shares
the use of Monte Carlo Dropout. With Monte Carlo and a higher latency. Additionally, responsible AI en-
Dropout, the dropout layers, which are usually used for sures that users or stakeholders are always aware of the
improved model generalizability during training, are also usage of ML models. Specifically, it must be transparent
kept active during inference. Generating multiple model to everybody that ML has been used and how it has
predictions with active dropout can also be considered been used. An example for communication networks
as approximate Bayesian inference. Again, the variation is the adaptive change of a routing table by an ML
in the returned predictions quantifies the degree of model. The model must be able to outline why a change
uncertainty. To learn aleatoric uncertainty, it is usually was required and why it has changed specific routes.
required to learn in the model not only mean responses, Next, the use of ML models should always end up in a
but instead the variance must be learnt, too [259]. With beneficial way for humanity in all aspects of life. They
neural networks and a regression task, this is for example should not be used in disruptive ways, e.g., generating
easily possible by simply adding another head, i.e., downtimes in a network for specific users on purpose.
output neuron, to the neural network, which learns the Finally, privacy and security is also a very important
variance and by accordingly adjusting the loss function. topic. ML models require data. Here, the privacy and
Using the negative log-likelihood of a Normal distribu- security of sensitive data must be kept throughout the
tion (or any other distribution) as loss function, it is thus whole lifecycle of preparing and deploying the model.
for example possible to learn a Normal distribution for Responsible AI is still a young field of research. Nev-
an input, thereby allowing to quantify the uncertainty ertheless, all the mentioned principles must be kept
in form of the variance for an input. Finally, Bayesian in mind when preparing and deploying ML models in
Neural Networks as proposed by Kendall and Gal [259] practice. It is one of those topics already diligently
can model both aleatoric and epistemic uncertainty. discussed in the conceptualization of future networks,
With Bayesian Neural Networks, model weights are e.g., 6G [261]. Meanwhile [262] is a more generic survey
assigned a probability distribution instead of a single of best practices to ensure that the AI environments are
value. Using these probability distributions, it is then responsible .
possible to quantify epistemic uncertainty. For aleatoric
uncertainty, they simply use two heads, where they learn E. LIBRARIES
both mean and variance for a data point.
Several XAI libraries are available for all kinds of frame-
works, e.g., Scikit-learn, PyTorch, and TensorFlow (cf.
D. RESPONSIBLE AI Section IV). Microsoft created a Python library named
Strongly related to uncertainty is the concept of Re- InterpretML [252], which unifies black-box explainers,
sponsible AI. According to Arrieta et al. [246], XAI e.g., SHAP values, LIME, or Partial Dependence Plots,
alone is not sufficient for an ethical and responsible and transparent models, e.g., linear models, decision
usage of ML models. Responsible AI is in general a trees, decision rules, and also EBM, a tree-based gen-
much broader topic than XAI [260]. With responsible eralized additive model. OmniXAI [263], AIX360 [264],
30 VOLUME 11, 2023

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3384460

Afifi et al.: A Primer in Machine Learning with Computer Networks: Techniques, Datasets and Models

and Alibi [265] also provide a collection of various post- These metrics could depend on the specific task or
hoc explainers and models for all kinds of data types and application of the model [274]. Although these metrics
backends. In contrast to the libraries containing several were introduced for ML models with network application
different tools, individual explainers like SHAP or An- ("ML for Networks), it is worth noting that some metrics
chors, but also interpretable models like the attention- can also help answer the questions arising in "Networks
based model TabNet42 are available as separate Python for ML". The choice of metrics will depend on the
packages. specific problem and the desired outcome. Hence, "ML
All gradient-based methods, e.g., Integrated Gradi- for Networks" and "Networks for ML" are not mutually
ents [266], can be directly implemented within Py- exclusive [275].
Torch and TensorFlow, or additional libraries like Cap- For instance, Data Quality is a metric that can be used
tum [267], TorchRay [268] and TF-Explain [269] can for evaluating both. As ML is generally data-driven, data
be used. Captum also comprises a huge number of quality is very important for model development. Thus,
techniques for explaining image-based data. when ML is applied for network tasks, data quality is
often determined primarily measured by the correctness
VI. NETWORKS FOR MACHINE LEARNING and representativeness of events/classes. This can also
In the previous sections, we primarily explained ML be utilized in "Networks for ML". However, as it focuses
methods, architectures, and principles to develop ML on decentralized data sources, data distribution can
models. Hence, we focused on applying ML to design additionally be considered. Other metrics typically con-
and optimize networks, detect patterns and anomalies, sidered by the ML community are: Privacy, Robustness,
and predict network behavior autonomously. We refer Energy Efficiency, and Fairness. However, as "Networks
to this application as "ML for Networks" [270], [271], for ML" also focuses on network behavior, typical net-
where ML models are developed from network data to, work metrics are often applied, such as Throughput, La-
e.g., design the communication topology of a network or tency, Packet loss rate, and Spectral efficiency. Table 14
to balance the traffic load. further explains the metrics and their impact.
However, networks and ML form a mutual relation- So why do these network metrics influence the ML
ship in which networks support ML, e.g., by using a models? High latency and low throughput (as well as
network as an infrastructure for ML algorithms, both low spectral efficiency) can cause delays in the training
for training and inference. As we will see throughout this process, leading to slower training times and increased
section, networks are thus a key success factor for ML iteration cycles. Packet loss can impact the accuracy and
by connecting and providing computational power and also the consistency of ML models, because it can lead
data storage [272], [273]. We refer to this support and to incorrect or incomplete data inputs, and can cause
infrastructure functionality of networks as "Networks for inconsistent data transfer in case of retransmissions.
ML". Important to note is that it is detached from the This, in turn, can affect the model’s ability to generalize,
ML model application. Instead, any ML model can be converge and make accurate predictions.
trained or deployed in a networked system. As ML is Different network topologies could affect the "Net-
a relatively new network task, challenges for networks works for ML" performance, scalability, and security.
arising from ML traffic and possible effects on ML from Furthermore, when considering ML for Networks, the
networks are still the subject of research. "Networks for choice of network topology can also affect the accuracy
ML" generally comprises these open research questions. and efficiency of the models.
ML algorithms primarily use a network to access For example, in a star topology, all nodes are di-
data from memory or to exchange model parame- rectly connected to a central hub, which can make the
ters/updates. The traffic load generated, the traffic network easier to manage and administer. From a ML
shape, and network requirements, e.g., regarding latency perspective, this topology would lend itself to centralized
and robustness, are unknown for many ML methods and learning, where data from all nodes is collected and
are likely to be application-specific and method-specific. processed in a central location. This approach could
All this can pose new challenges for networks and make simplify the deployment and maintenance of the ML
a better understanding of the mutual relationship be- model, but it could also lead to a single point of failure
tween ML and networks necessary. Thus, it is no longer and potential privacy concerns.
sufficient to evaluate ML model performance alone but On the other hand, a mesh topology, in which nodes
also the network performance. Hence, one might ask the are connected in a decentralized fashion, can be more
question: Which metrics to use to evaluate model and resilient to failures and provide more privacy, but it can
network performance when applying ML in networks? also be more difficult to manage. In terms of ML, this
From Section II, we know, that several metrics can topology can be suitable for distributed learning, where
be used to evaluate the performance of ML models. each node trains a local model and shares its knowledge
with the other nodes. This approach could improve the
42 https://dreamquark-ai.github.io/tabnet/ scalability and privacy of the model, but it could also
VOLUME 11, 2023 31

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3384460

Afifi et al.: A Primer in Machine Learning with Computer Networks: Techniques, Datasets and Models

TABLE 14: Examples of metrics for ML for Networks and Networks for ML.
Metric Usage Networks for ML ML for Networks

Throughput it measures the amount of data that can be transmitted the main objective is
over a network in a given time period. implicitly impacts the to optimize these
convergence rate of metrics for network
Latency it measures the time it takes for a packet to travel from ML models applications
the source to the destination.
Spectral efficiency it measures the amount of data that can be transmitted
per unit of wireless bandwidth.

Packet loss rate it measures the proportion of packets that are lost also impacts the model ac-
during transmission. curacy.

Privacy it measures the level of data privacy that is maintained any data such as medical network data such as
during the training/learning process data or images HTTP or IP traffic
Robustness it measures the ability to cope with changes in the the ability of a model to the ability of a network to
network maintain its performance continue operating in the
when tested on new data presence of failures or at-
that is different from the tacks.
training data
Energy Efficiency it relates to the power consumption of network devices the product of power for wireless networks, it
consumption per measures the amount of en-
inference/training step ergy consumed per bit of
and the amount of time for data transmitted.
executing the task [276]
Fairness for resource allocation, it measures the distribution of refers to correcting and resources are bandwidth
resources among different users or devices. eliminating bias in ML de- and computation
cisions. Hence, resources
are data
Data Quality: it measures the quality of data that is used for training high-quality data may compromise privacy or suffer from
across different devices. high latency when collecting them. There is a trade-off
between this metric and the aforementioned ones.

increase the synchronization overhead. A. CENTRALIZED ML

There are also other network topologies, such as bus, Centralized ML refers to training ML models on a cen-
ring, tree, and hybrid, which can have different tradeoffs tral node of the network using data from multiple nodes
in terms of network metrics. Choosing the right topology and is widely applied in networked systems such as the
for a ML application depends on several factors, such IoT [278], [279]. Therefore, the data is first collected from
as the size and complexity of the network, the nature various nodes in the network and then transmitted to a
of the data and task, and the available resources and central server to train the ML model. Typically, the data
constraints. is also preprocessed for the training, which can happen
Examples of these constraints are computational re- on both, the collecting nodes and the central server.
sources and data availability. In the former, a star In many cases, the central server has more computing
topology would have a central server that must have resources and larger storage space than the collection
sufficient computational resources to process all the nodes. Since training and inference are independent, the
data, whereas, in a mesh topology, each node could con- resulting model can be used centrally and decentrally for
tribute computational resources, reducing the burden on inference. In centralized inference, a central computing
any one node. In the latter constraint, a star network node (server) employs the model to infer from the data
topology would have the data stored in a single location, of various collection nodes. The collection nodes usually
which can limit the amount of data available for training. send their observed data to the central computing node
In contrast, a mesh topology could distribute data across and receive the model predictions. However, it is also
multiple nodes, providing a larger and more diverse data common to distribute the centrally trained ML model
set for training. We refer to [277] for a comprehensive to different nodes, which then independently infer from
survey on the convergence, robustness and privacy of their local data.
ML algorithms with respect to network architecture and Centralized ML takes advantage of monopolization
implementation in the context of 5G networks. through central servers (e.g., on the cloud) with powerful
In the following, we will explain advanced ML topics computing resources that can handle the processing and
that have distributed implementation, exploiting both training of computationally-heavy models using large
"ML for Networks" and "Networks for ML" domains. datasets [280]. While the increase in training speed
32 VOLUME 11, 2023

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3384460

Afifi et al.: A Primer in Machine Learning with Computer Networks: Techniques, Datasets and Models

and better resource utilization is obvious, the benefit dataset. All instances have the same model structure,
of more accurate predictions requires a more detailed number of layers, and number of neurons per layer, but
explanation. Unlike the case where each node trains its the parameter values can vary. The workers periodically
model using its local data, centralized ML training ben- communicate to exchange model parameters and aggre-
efits from aggregating data from multiple nodes [281]. gate their updates after processing a predefined number
Thus, the model trains on a larger dataset, but also of samples locally. Various data-parallel methods have
the aggregated data better represents the overall data been formulated, differing primarily in the manner of
distribution, allowing the model to generalize better. For cooperation among workers during training, encompass-
example, an ML model can extract significant informa- ing how workers communicate and where update aggre-
tion from data from different sensor types or locations. gations occur. From this perspective, architectures can
This is particularly useful in network applications such be primarily distinguished by Client-Server and Peer-
as smart cities, environment monitoring, and industrial to-Peer methods. The Client-Server methods use a set
IoT. of decentralized workers that process model updates as
However, there are also several disadvantages as- Clients and a centralized server as Server. The server can
sociated with centralized ML in networked systems. be a single worker, or multiple workers organized equally
First, the dependence on a central computing node for or in hierarchical layers. Regardless of the server’s in-
model training and inference introduces a single point ternal structure, the server maintains the shared model
of failure, and scalability issues, potentially impacting state and stores all model parameters. Clients receive
the reliability and availability of the system [282]. The the current model state with its parameter set from the
centralized approach places high demands on the central server and communicate their updates only to it. All
server in terms of computing and network performance, communication is thus handled by the server, which can
making its acquisition and maintenance expensive. Sec- lead to a bottleneck. In contrast, Peer-to-Peer methods
ondly, the data collected by networked devices (e.g., entail direct communication of updates among workers
multimedia sensors, intelligent vehicles) is transmitted without the presence of a central server managing the
in large quantities over the network, requiring high data global model state. Which workers can communicate
rates. Nevertheless, transferring large amounts of data with each other is defined in a communication topology.
to a central node can cause network congestion and Here, all-to-all but also graph-based topologies such
degrade real-time performance [283]. Sensitive data may as trees and rings are possible. In addition to the
need to be transmitted to the central server, potentially cooperation relationship, data-parallel methods differ in
compromising user privacy. Recently, there are growing whether workers transmit their updates synchronously
concerns about privacy in networked systems with data or asynchronously and in the amount of communication
generated by networked devices, such as wearable de- overhead incurred. In production, where it is usually
vices or sensors, where data is often very private or inferred from the model, the machines use the mutual
sensitive [284]. This results in additional requirements model instance.
for the network over which the data is transmitted, Model parallelism, on the other hand, splits the model
processed, and stored. and distributes it across multiple workers, allowing for
model sizes larger than the memory of a single machine.
B. DISTRIBUTED ML Each worker trains and infers only its parts of the model,
In various fields of application, the complexity of tasks which requires less memory. Consequently, the model’s
being tackled by ML models has led to an increase entirety is upheld collectively by all workers, necessitat-
in the number of model parameters. To cope with ing constant communication among them during both
this complexity, distributed ML techniques make use the training and inference phases. The data is fed to
of networks of interconnected computing machines to the workers that maintain the input layer of the model,
address challenges such as handling larger and dis- and each worker forwards its computed output to the
tributed datasets, accommodating heightened comput- worker holding the next part of the model. In the back-
ing resource demands, and dealing with models that sur- propagation step during training, the workers holding
pass the memory capacity of a single machine. Here, two the output layer first compute the updates. The updates
approaches are prevalent and usually take advantage of are then propagated to the workers in reverse order and
networking to enhance model training: 1) data-parallel applied. A central challenge within model parallelism
and 2) model-parallel. Combinations of data- and model- lies in devising an effective strategy for partitioning a
parallel methods are also possible. given model across multiple networked machines. This
Data-parallel corresponds to scale-out parallelization partitioning determines how the model segments are dis-
and, therefore, increases computational capacity. Dur- tributed among workers to optimize communication and
ing training, several machines, so-called worker, train computation while maintaining overall model coherence.
instances of the ML model. These instances operate Common methods for distributed ML for data-parallel
on distinct and usually non-overlapping portions of the and model-parallel are explained below.
VOLUME 11, 2023 33

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3384460

Afifi et al.: A Primer in Machine Learning with Computer Networks: Techniques, Datasets and Models

C. PARAMETER SERVER central server. The server aggregates the updates from
Parameter Server [285], [286] is a data-parallel Client- all devices and uses them to update the global model.
Server method (cf. Figure 9a). Here, multiple decen- Figure 9b shows an FL scenario with three connected
tralized clients - Worker are connected to a centralized devices and a central server. The key idea behind FL is
server - Parameter Server. The parameter server stores that the global model is trained using a large amount of
the model parameters, assigns data to workers, and data from multiple devices, while each device only needs
aggregates the updates received from workers. Often, to share the model updates. This allows FL to achieve
the parameter server is a single machine but can also the same performance as traditional centralized learning
be a set of equivalent or hierarchically structured ma- while preserving user privacy.
chines [287]. Each worker maintains an instance of One of the most widely used FL algorithms is Feder-
the model and individually processes parameter up- ated Averaging (FedAvg). FedAvg is designed to address
dates based on its data. Typically, SGD is used for several challenges that arise in FL, including the need
parameter optimization during a training process. The to preserve data privacy, mitigate bias and inconsistency
processed data can either be captured and stored on across devices, reduce communication overhead, and
the worker machine or transmitted from the parameter enable model convergence. FedAvg works by having each
server. Usually, workers access only (non-overlapping) device train its own local model using its local data, and
portions of the data. The complete dataset is thus then the local models are aggregated to form a global
distributed across multiple workers. After processing a model that is distributed back to the devices for further
predefined number of data samples, the workers first training. To address the challenge of bias and inconsis-
propose their parameter updates to the parameter server tency across devices, FedAvg uses a weighted average of
and then receive the updated model. However, how the local models, with the weights determined based on
many other workers have contributed to the updated the amount of data each device contributes to the model.
model depends on the Parameter Server implementa- This approach ensures that each device’s contribution is
tion. For synchronous implementations, the parameter weighted appropriately, producing a more representative
server considers updates from all workers. The workers and robust global model.
do not continue processing until the updated model has By training a model locally, FL allows devices to
been broadcast. Therefore, the slowest worker impacts make predictions and decisions without the need for a
the time for a model update significantly. In contrast, constant network connection to a central server. This is
in asynchronous implementations, the parameter server particularly useful in applications such as autonomous
updates and broadcasts the model immediately after vehicles, drones, and medical devices where data needs
receiving an update from the sending worker. Here, to be processed in real-time. Additionally, FL is also
workers proceed on different model instances. This is a beneficial in scenarios where data is sensitive and cannot
problem in heterogeneous environments, with different be shared, such as medical imaging or financial data.
computing resources and transmission delays. Slower Another essential benefit of FL is its ability to han-
workers working on outdated model instances can de- dle data that is non-IID (Independent and Identically
range SGD’s solution with their updates, causing the Distributed), a common characteristic of data collected
model to converge incorrectly. For homogeneous cluster from networked devices.
environments, this is not the case and is often faster than In traditional centralized learning, data is often as-
synchronous systems [288]. Since synchronous and asyn- sumed to be IID, which means that it has the same
chronous Parameter Server implementations struggle distribution across all devices. However, in practice, each
in heterogeneous environments, time-wise and model- device can have its own data distribution, which can
quality-wise, respectively, Parameter Server is typically lead to biased or suboptimal models. FL algorithms
applied in data centers. such as Federated Averaging [289], Federated Transfer
Learning [290], and Federated Meta-Learning [291] are
D. FEDERATED LEARNING proposed to address these issues.
Federated Learning (FL) [289] is another data-parallel
Client-Server distributed ML method that enables mul- E. ALL-REDUCE
tiple devices to collaboratively train a shared model The All-Reduce approach [292] is a data-parallel dis-
without sharing their raw data. This approach has tributed ML method for training ML models and imple-
gained significant attention in recent years due to its ments the Peer-to-Peer concept. Therefore it dispenses
ability to protect user privacy and enable learning on with a central server, and instead, workers communicate
edge devices with limited computational resources. directly. Which workers communicate with one another
In FL, multiple devices, such as smartphones, IoT is specified by the communication topology used. Mul-
devices, or edge servers, participate in the training tiple communication topologies are possible for the All-
process by locally training a model using their own data Reduce approach, i.e., ring [293], butterfly [294], and
and then sending their updated model parameters to a trees [295]. The communication topologies affect the
34 VOLUME 11, 2023

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3384460

Afifi et al.: A Primer in Machine Learning with Computer Networks: Techniques, Datasets and Models

Parameter Server Central Server Central Server

Worker Worker Worker

(a) Parameter Server (b) Federated Learning (c) Ring All-Reduce (d) Split Learning
FIGURE 9: Cooperation topologies for common distributed ML architectures

TABLE 15: Summary of approaches of networks for ML approaches.

Method Description Advantages Disadvantages
Parameter Server a distributed system for training ML models where the single point of failure,
less complexity, efficient
model parameters are stored and updated on a cen- privacy and security
use of resources.
tralized server and model updates are communicated concerns with centralized
from workers to the server. In contrast to Federated data storage.
Learning, the data can be managed and assigned from
the central server.

Federated Learning a method of training a ML model across multiple privacy and security of communication overhead,
decentralized devices or servers while keeping data on data, ability to handle devices may have different
the devices. large-scale decentralized data distributions and
data. update rates.

All-Reduce a data-parallel distributed ML method for training ML low implementation requires communication
models and implementing a Peer-to-Peer concept. complexity, allows parallel between all devices, and
computation on multiple may be limited in terms of
devices. scalability.
Split Learning a method of training a ML model where the model is better privacy and potential loss of accuracy
split between two or more devices and only the output security compared to due to approximation in
activations are transmitted between devices. traditional centralized forward computation,
learning, reduced synchronization of model
communication overhead. parameters across devices
may be difficult.
Federated Split Learning a hybrid approach that combines elements of Federated achieves better privacy may be more complex to
Learning and Split Learning. and security compared to implement than other
traditional centralized methods, devices may
training, reduced have different data
communication overhead distributions and update
compared to Federated rates.
Learning.

data rate and latency of the network differently. In some updates and proceeds to produce the next local updates.
cases, the topology also restricts access to the data set. The repetitive and expensive communication of updates
guarantees that all workers work with the same model
In principle, each worker maintains an instance of the instance [292]. Figure 9c illustrates the communication
model and individually processes updates by its assigned topology of a ring All-Reduce approach.
portion of data. The data is usually distributed at the
beginning of the training. After processing a predefined
number of data samples, the workers communicate their F. SPLIT LEARNING AND INFERENCE
local updates with all their peers. Shortly after, they Split Learning (SL) [296], [297] is a model-parallel
receive the updates of their peers and aggregate them distributed ML method that decouples model training
with their own. This step of communication and aggre- from the need for direct access to the raw data, in
gation can be repeated several times. When all updates which a model is split into at least two sub-models. It is
are distributed to all workers, each worker adjusts its similar to FL, but it focuses on the case where devices
model instance parameters according to the aggregated have low computational power, memory constraints, or
VOLUME 11, 2023 35

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3384460

Afifi et al.: A Primer in Machine Learning with Computer Networks: Techniques, Datasets and Models

limited energy budget. In contrast to FL, where devices regard to computation and size on the client as sensors
typically train a model locally and send the updated have limited resources and to minimize the amount of
model parameters to a central server, in SL, the devices communication while making sure that the model does
only forward a feature representation of their data to not lose too much accuracy.
the central server, which performs the model updates. Matsubara et al. [301] provide a comprehensive survey
In SL, the model is split into at least two parts, describing many proposed methods to optimize SC. Ad-
with one part running on the device and the other part ditionally, it also contains links to code where available.
running on the central server. Figure 9d shows the SL With sc2bench [302], there is also a pip package to
representation with three devices. The key idea behind test and compare several SC techniques while providing
SL is that the device part of the model is lightweight a framework to start creating your own method.
and can be run on devices with low computational
power instead of the entire computationally demanding VII. FURTHER READINGS
model. Thus, SL enables model training and inference Several related survey and tutorial papers exist that
on devices with low computing resources. cover parts of the interplay between ML and networking
SL over networked devices is particularly useful in to a varying extent and on varying scales of granularity.
scenarios where devices have low computational power Table 16 lists the most related of these papers while
and high communication bandwidth. For example, in a highlighting their ML scope, covered network applica-
network of smartphones, each smartphone may have a tions, and whether they focus on ML for Networks
camera that captures images, but the device may not (ML4N) or Networks for ML (N4ML).
have the computational power to process the images. Perhaps the most comprehensive survey on ML for
SL can be used to train a model that can classify images Networks, [306] discusses ML approaches for a wide
without needing to process the images on the device. range of networking challenges and provides further
references to more specialized surveys about ML ap-
1) Federated Split Learning proaches in certain networking domains. The work of
Federated Split Learning (FSL) [298], [299] is a dis- [309] considers itself as an update to [306], covering more
tributed algorithm that combines the ideas of computing recent developments and discussing recent IDS datasets.
the weighted average, a characteristic of the FL architec- Additionally, several surveys consider ML approaches
ture, and the neural network split between the client and for a subset of networked systems, such as vehicu-
server of the SL architecture. It thus combines data- and lar networks in [315], [316], Software-Defined Networks
model parallelism. In FSL, all clients compute in parallel (SDN) in [305], mobile/wireless/ubiquitous networks in
and independently. They send/receive their smashed [4], [277], edge computing [303], [304] or network traffic
data to/from the server in parallel. The client-side sub- monitoring and analysis [313]. The work of [307] takes
network synchronization, i.e., forming the global client- a unique stance and provides the joint application of
side network, is done by aggregating (e.g., weighted recent ML and Blockchain technologies for networking
averaging) all local client-side networks on a separate problems. Other surveys focus on specific ML subdo-
server. mains such as unsupervised learning [29], deep learning
[270] or distributed ML [308].
2) Split Computing The work presented in [310] and [311] specifically con-
Splitting a neural network for inference tasks is usually sider the role of FL in networking. While [310] discusses
called Split Computing (SC). It is very similar to SL, as FL several applications in the domain of communica-
a model is split into sub-models and then distributed on tions and networking, [311] focuses on mobile edge com-
multiple devices communicating with each other. It is puting but also discusses how communication techniques
helpful in scenarios where sensor devices are resource- influence FL methods. The studies [283], [312] provide an
limited and can not deploy full models. Instead of overview of various applications of ML methods through
offloading the sensor data, the sensor can compute a IoT systems and analyses various approaches on how
part of the model and then transmit the compressed ML models can be distributed and processed in the
feature representation, resulting in a smaller end-to-end cloud-to-things continuum. The survey [282] discusses
latency [300]. the convergence of edge computing methods and ML;
Most works focus on a simple client-server scenario. specifically, it provides a comprehensive view of how
The model is then split into a head and a tail part. networking can be utilized for cooperative processing
The client, a sensor, gathers sensor data, feeds it into of deep learning models on edge devices. The survey
the head of the model, and then transmits the fea- [301] provides insights into how networked devices such
ture representation to the server. The server receives as smartphones and autonomous vehicles are used for
the feature representation and completes the inference collaborative training of ML model, and inference oper-
process using the tail. In this client-server scenario, ations over the network using split computing and early
the main challenges are to minimize the head with exit methods.
36 VOLUME 11, 2023

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3384460

Afifi et al.: A Primer in Machine Learning with Computer Networks: Techniques, Datasets and Models

TABLE 16: Selective surveys & tutorials on using ML for Networks (ML4N) and Networks for ML (N4ML)
Paper Network Application ML Scope N4ML or ML4N
[303], [304] edge computing: video processing, autonomous vehi- distributed training/inference and model compression: both
cles, smart homes, privacy, security, and caching DNN, FL, Transfer Learning (TL) and shallow ML
[305], [306] prediction and configuration: routing, congestion con- shallow ML, Deep Learning (DL) and RL ML4N
trol, traffic prediction/classification, and security
[282] edge, fog, and cloud computing: intelligent manufactur- distributed ML, FL, RL, TL N4ML
ing, security, real-time video processing, smart city
[270] wireless and mobile computing (e.g., user localization, DNN: e.g., CNN, RNN and GAN; and deep RL ML4N
mobility/mobile data analysis, signal processing)
[307] traffic classification, routing optimization, resource shallow ML, DL, RL ML4N
management, security
[4] wireless: Virtual Reality (VR), Mobile Edge Comput- DNN: e.g., Spiking Neural Network (SNN) and RNN both
ing (MEC), Internet of Things (IoT) and Unmanned
Aerial Vehicle (UAV)
[308] 5G: privacy and resource allocation distributed ML, model compression both
[301] edge computing: video/audio processing, autonomous Slit Computing (SC), distributed ML, Early Exit N4ML
vehicles, smart homes, industrial IoT (EEoI)
[29] traffic classification, network optimization, and Intru- unsupervised ML ML4N
sion Detection System (IDS)
[309] IDS, routing, network optimization, and resource allo- supervised and unsupervised ML, RL and DNN ML4N
cation
[310] IoT, vehicular networks, edge computing, wireless net- FL, RL ML4N
works, UAV swarms
[311] edge computing: privacy, security and model compres- FL both
sion
[283], [312] distributed computing: smart cities, autonomous vehi- distributed ML, FL, EEoI N4ML
cles, smart homes, industrial IoT
[277] mobile communication: privacy, security, robustness, distributed learning: FL, multi-agent and parameter both
and other exemplary applications server.
[313] traffic prediction/classification and security DNN: e.g., CNN, RNN and GAN ML4N
[314] IoT: resource management and traffic classification shallow ML and DL ML4N
[315], [316] vehicular network: security, privacy, and resource allo- shallow ML, DL, TL and RL ML4N
cation
[317] wireless/6G (e.g., signal detection, channel, and sensor explainable DL (e.g., symbolic representation, feature ML4N
network diagnostics, antenna selection) visualization, model reduction)

Concerning the role of XAI in networking, the amount networks with millions of nodes and edges. Most
of survey work is limited. The work of [318] motivates ML approaches are initially developed and tested
the usage of XAI methods in networking challenges but for small-scale networks to better debug them and
only covers a single concrete problem. While there exist understand their effect on individual network com-
survey papers on XAI [319] and Explainable Reinforce- ponents. However, making them work at scale is
ment Learning (XRL) [320] in general (i.e., not limited not always trivial because large-scale network struc-
to networking), to the best of our knowledge only [317] tures might lead to computation time explosions (as
surveys XAI techniques in the domain of networking, has been indicated e.g. for SDN in [324]), especially
namely in challenges related to wireless/6G. for problems where global decisions are taken in a
centralized manner.
VIII. CHALLENGES AND FUTURE DIRECTIONS • Limited data: One of the challenges in ML for
The adoption of ML in networks also brings forth several Networks is the limited amount of data available
challenges and opens up exciting future directions for for training. Collecting and labeling network data is
research and development for both ML for networks and a time-consuming and costly process, and in some
Networks for ML. In this section we touch on some of cases, data may be proprietary or sensitive, making
these challenges, while we refer to [321] for further dis- it difficult to obtain.
cussions on limitations and challenges of ML and more • Interoperability: Another challenge is the lack of
specifically, when we apply ML for Network [322], [323]. interoperability of ML models. In many cases, it
The following are some of the current challenges in ML is difficult to understand how a model arrived
for Networks: at a particular decision or prediction, making it
• Scalability: One of the critical challenges in ML for challenging to debug or troubleshoot issues.
Networks is scaling up models to handle large-scale
VOLUME 11, 2023 37

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3384460

Afifi et al.: A Primer in Machine Learning with Computer Networks: Techniques, Datasets and Models

• Heterogeneous data: Networks often contain hetero- models, since formulating ML problems for complex
geneous data from multiple sources, such as text, application domains or sub-problems where suitable
images, and numerical data. Incorporating this data training data is available often requires several
into ML models and designing models that can simplifying and/or narrowing assumptions at the
effectively handle heterogeneous data is another start [330]. Leaving out such assumptions one by
challenge that requires further research. one brings ML systems closer to deployment in real-
• Robustness: ML models are vulnerable to attacks world scenarios, but often is a non-trivial task that
and adversarial examples, especially in network en- brings unexpected challenges in every step along the
vironments where data may be noisy or corrupted. way.
• Real-time decision-making in closed-loop systems: On the other hand, the challenges related to Networks
In many Network Control Systems (NCSs) envi- for ML include:
ronments, decisions must be made in real-time,
requiring efficient and fast ML model inference • Resource constraints: ML algorithms often require
[325], [326]. Developing algorithms that can make significant computational resources, including pro-
accurate but fast decisions in real-time is a sig- cessing power, memory, and storage. Moreover, the
nificant challenge in ML for Networks. One of the training of ML models requires large amounts of
core problems is the potential for unstable system data, and transferring this data across networks
behavior caused by a mismatch between the in- can be time-consuming and resource-intensive. This
dented NCS sampling time and the time required can be a challenge in resource-constrained networks,
for inference of an ML model. As a result, input such as those in IoT devices and edge comput-
delays affect the resulting system, which must be ing environments, or when specialized networking
handled carefully [327]. Hence, there is a trade-off hardware disallows certain compute operations. In
between large models that can handle large-scale addition, storing data in a centralized location can
networks (the first mentioned challenge) and the create a bottleneck and security issues.
required time for their inference. In general, the • Latency: Network latency can affect the perfor-
required inference time by AI and ML models will mance of ML algorithms, particularly in real-time
be a non-trivial function of the resulting closed-loop applications where decisions must be made quickly.
system in which it is embedded. For RL, delays due High latency can lead to delays in data transmission
to model inference can be explicitly included in the and processing, which can negatively impact the
modeling, resulting in the notion of real-time MDPs accuracy and effectiveness of the algorithm [331].
and real-time RL algorithms [328]. Beyond cyber- • Bandwidth: ML algorithms often require large
physical closed-loop systems, model inference delay amounts of bandwidth to transfer data, and this
impacts user experience when prompt LLMs, IoT or can be a challenge in networks with limited band-
VR services are run via edge computing networks width. High bandwidth requirements can also lead
[329]. In other words, in these cases, the system to increased costs for network infrastructure in a
loop is closed via human feedback, where unstable real-world deployment.
behavior will eventually result in the performance • Network topology: The topology of a network can
loss. impact the performance of ML algorithms. For
• Energy efficiency: ML models often require signifi- example, networks with high levels of congestion
cant computational resources, which can be chal- or interference may not be suitable for real-time
lenging in resource-constrained network environ- applications.
ments. As the current trend points towards ever- • Privacy and security: ML algorithms require access
increasing model scales, energy efficiency might to data, which can create potential privacy and se-
become an even more important aspect in even curity risks, increasing the risk of data breaches and
more situations. cyber-attacks during transmission over the network
• Privacy and security: Networks can contain sensi- or remote processing of user data.
tive and private data, which requires ML algorithms • Heterogeneous resources: The computing and com-
to be developed with strong privacy and security munication resources in devices used for processing
safeguards. ML algorithms for networks must main- ML algorithms over the network may vary widely,
tain data privacy while providing accurate predic- leading to unstable training processes. Furthermore,
tions. this can lead to the presence of slower devices
• Network complexity: Computer Networks can be (stragglers) that slow down the training of a global
highly complex and dynamic, with large numbers model and affect the model’s efficiency.
of interconnected nodes, an interplay of various dif- As earlier mentioned in Section VI some of these
ferent protocols and changing operation conditions. challenges may overlap, such as privacy and security.
This makes it challenging to develop accurate ML Overall, ML for Networks and Networks for ML are
38 VOLUME 11, 2023

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3384460

Afifi et al.: A Primer in Machine Learning with Computer Networks: Techniques, Datasets and Models

rapidly growing fields with many challenges and oppor- and enable data ingestion, processing, and exposure
tunities for future research. Addressing these challenges across layers and domains.
will require collaboration between researchers from dif-
ferent disciplines. In the following sections, we will 2) Intelligence everywhere
discuss some of the trending applications that focus on a comprehensive and automated management of AI
these challenges. models, from training to deployment to monitoring,
and the ability to handle model drift, retraining, and
A. A NEW PARADIGM FOR NEXT-GENERATION WIRELESS versioning. This would take place for every network layer
NETWORKS and on every network device.
The rapid advancement of AI and ML technologies
3) Zero touch
has also opened up new vistas for next-generation
a high degree of automation and autonomy for the man-
wireless networks like 5G Advanced and 6G. These
agement of AI and data, and the ability to express and
next-generation networks essentially serve two purposes:
supervise high-level goals rather than low-level actions.
Data transport and service delivery. They comprise
various types of devices from User Equipments (UEs),
4) AI as a service
base stations, switches, routers, and servers in a
the exposure of AI and data services to external parties,
data center. With the integration of SDN and
such as service providers or customers, and the creation
Network Function Virtualization (NFV), all devices can
of a platform for innovation and collaboration.
now constantly adapt to new situations, such as chang-
For further readings on the evaluation metrics of such
ing traffic patterns, better function placements, or new
networks, we refer to [338]. The authors in [339] also
service demands, and incorporateAI and ML [332].
provide a road-map with potential frameworks to build
These technologies promise to revolutionize the way
such networks.
we design and manage wireless networks, leading to
the emergence of AI-native networks and AI-native air
B. DEEP NEURAL NETWORKS MODEL COMPLEXITY AND
interfaces.
ENERGY CONSUMPTION
On the one hand, AI-native networks are networks
The increasing complexity of DNNs has direct impli-
designed with AI integration at their core, rather than
cations on energy consumption, a critical factor in
as an afterthought or add-on. Hence, AI (partially)
both environmental sustainability and practical deploy-
replaces human-defined rules, models, and algorithms,
ment [340]. The complexity of DNNs is largely driven
which may not be optimal or scalable for the complex
by the depth and breadth of the network architecture.
and dynamic wireless scenarios, so that these networks
As DNNs grow deeper (with more layers) and wider
can learn, adapt, and optimize itself autonomously and
(with more neurons in each layer), they can capture
intelligently.
more intricate patterns in data. This increased capacity,
On the other hand, an AI-native air interface is an air
while maybe beneficial for model accuracy, leads to a
interface that uses AI and ML to define and configure
higher number of computations during both the training
its physical and medium access control layer parame-
and inference phases [282]. Each computation requires
ters, such as waveforms, constellations, pilots, coding,
a certain amount of energy, and thus, as models grow
modulation, synchronization, channel estimation, equal-
more complex, their energy requirements escalate.
ization, detection, decoding, and access schemes [333].
The energy consumption of DNNs is a multifaceted
One of the main challenges here is the complexity
issue. Training DNNs is an energy-intensive process that
and heterogeneity of wireless networks. This complexity
requires substantial computational resources [341]. This
makes it difficult to collect, process, and analyze data in
phase often necessitates the use of high-performance
real-time [334]. However, this can be mitigated by using
GPUs or even clusters of GPUs, which are power-
distributed AI engines, which can process data closer to
hungry devices [342]. The electricity consumption dur-
the source and reduce latency. Another challenge is the
ing this phase is considerable, contributing to the overall
lack of standardized frameworks and architectures for
energy footprint of developing DNNs. The inference
implementing AI in networks. To address this challenge,
phase, where DNNs make predictions on new data,
industry, and academia collaborate to develop standard-
also demands a considerable amount of energy [343].
ized AI frameworks and tools that can be used across
This phase is critical in real-world applications where
different networks [335], [336]. There are four aspects
continuous or on-demand operation of DNNs is required,
to address this challenge [337]:
such as in autonomous systems or real-time analysis
applications.
1) Data infrastructure The substantial energy consumption of DNNs poses
a distributed data infrastructure that can handle mas- a significant challenge for environmental sustainability.
sive amounts of varied, distributed, and dynamic data, As these networks become more prevalent across various
VOLUME 11, 2023 39

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3384460

Afifi et al.: A Primer in Machine Learning with Computer Networks: Techniques, Datasets and Models

sectors, the need for energy-efficient neural network Compared to the aforementioned surveys and tutorials
architectures and training methods becomes increasingly (Section VII), we are the first to provide a comprehen-
important [344]. In energy-constrained environments sive bidirectional overview of ML and XAI techniques
(e.g., with battery-operated devices) the energy de- across different networking fields, and vice versa 43 .
mands of DNNs are a crucial consideration. This has led Furthermore, in addition to an overview of the current
to a focus on balancing model complexity with energy state of the art, our work provides practical guidance for
efficiency, driving innovation in optimization techniques, aspiring researchers to shortcut their way into meaning-
and the development of specialized hardware to run ful research:
these models more efficiently [345]. Moreover, different • Many of the mentioned related papers do not con-
models and benchmarks are used to estimate and plan sider datasets and/or starting points to reproduce
the energy consumption of DNNs [346]–[348]. the results or even to just start experimenting. In
contrast, we refer to publicly available datasets as
C. TINY MACHINE LEARNING well as methods and tools to generate synthetic
Tiny Machine Learning (TinyML) is an emerging field datasets (Section IV) and design ML models suit-
that combines ML with ultra-low power computing, able for the respective task.
typically found in microcontrollers and small IoT de- • We categorize existing approaches as ML serv-
vices [349]. Its goal is to deploy efficient ML models that ing networks (ML4N) and networks serving ML
can operate in environments with limited memory, pro- (N4ML) based on the used metrics, which helps to
cessing power, and energy. This is particularly relevant identify research gaps and possible future directions
for applications where traditional ML models would be of research.
impractical due to their size and energy requirements. We introduced the most popular ML techniques,
The primary motivation for TinyML is the need model types, and tools as well as several practical as-
for localized data processing, especially in situations pects to consider when practicing ML such as obtaining
where privacy, speed, and power efficiency are critical, high-quality data for the learning algorithm, or the
rather than transmitting it to a centralized server or incorporation of inductive biases (more specifically for
cloud [350]. This can be applied for many applica- networking data and network topologies) into ML mod-
tions, spanning from smart home devices and wearable els in order to reduce resource requirements. Secondly,
technology to healthcare monitoring and environmental we introduced the most common computer networking
sensors [283]. problem domains and pointed to existing tools and
The core implementation of TinyML relies on ML datasets to accelerate and facilitate ML research on
model quantization, which reduces its numerical pre- networking problems.
cision and size. Hence, implementing TinyML in en- Thirdly, we introduced how XAI methods can improve
vironments with limited resources presents several on- the transparency of ML models’ decisions and thus
going challenges – The low computational capabilities push their acceptance in the computer networks research
and storage capacities of smaller devices restrict the domain and their suitability in productive environments.
complexity of the models that can be deployed [351]. We also elaborated on how networking techniques can
This constraint can adversely affect the efficacy and boost the performance of existing ML setups and work-
precision of TinyML-based applications. To address this, flows, e.g. through several approaches for distributed
some research suggests the integration of cooperative learning.
ML (Section VI) and TinyML approaches [343], [352].
Lastly, we provided a large number of pointers
This strategy would enable devices with constrained
for further reading, such as surveys on more specific
resources to work collaboratively on ML tasks. More-
ML/networking domains, example research works for
over, progress in hardware development, particularly in
some of the problems introduced in this paper or links
creating more efficient microcontrollers and sensors, is
to many of the mentioned datasets or tools.
expected to broaden the range of possible applications
Despite our comprehensive coverage of established
for TinyML. For a recent survey of tinyML applications
tools, approaches, and recent breakthroughs, it’s impor-
and techniques, we refer to [353]
tant to acknowledge the dynamic nature of ML research.
The field is characterized by the emergence of new
IX. CONCLUSION
algorithms, the potential availability of additional tools
The aim of this paper is to provide interested but inexpe-
and features in the future, and the hopeful prospect
rienced readers an an inspiring and practical jumpstart
for research in the intersection of ML and computer net- 43 We do not aim for a comprehensive review of state-of-the-
working. This encompasses not only the creation of novel art research in ML or its sub-disciplines, as there are numerous
ML-powered solutions for covered networking scenarios survey and tutorial resources that provide an excellent ML-focused
overview. Rather, we view ML techniques solely in relation to net-
but also leveraging established networking technology to working, either as facilitators (ML for Networks) or beneficiaries
enhance existing ML approaches. (Networks for ML).

40 VOLUME 11, 2023

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3384460

Afifi et al.: A Primer in Machine Learning with Computer Networks: Techniques, Datasets and Models

of more open-sourced datasets. While this evolution is svm,” in 2016 IEEE international congress on big data
happening at an unprecedented pace, this paper still (BigData Congress). IEEE, 2016, pp. 402–409.
[16] A. J. Smola and B. Schölkopf, “A tutorial on support vector
serves as a valuable starting point for researchers and regression,” Statistics and computing, vol. 14, pp. 199–222,
newcomers alike and provides a timely and relevant 2004.
contribution to the intersection of the fields of ML and [17] C.-Y. Hsu, P.-Y. Chen, S. Lu, S. Liu, and C.-M. Yu,
“Adversarial examples can be effective data augmentation
computer networking. for unsupervised machine learning,” 2021.
[18] D. Kim and J. Choi, “Self-supervised learning for
REFERENCES binary networks by joint classifier training,” CoRR,
[1] J. M. Stokes, K. Yang, K. Swanson, W. Jin, A. Cubillos- vol. abs/2110.08851, 2021. [Online]. Available: https:
Ruiz, N. M. Donghia, C. R. MacNair, S. French, L. A. //arxiv.org/abs/2110.08851
Carfrae, Z. Bloom-Ackermann, V. M. Tran, A. Chiappino- [19] E. Schubert, J. Sander, M. Ester, H. P. Kriegel, and
Pepe, A. H. Badran, I. W. Andrews, E. J. Chory, G. M. X. Xu, “Dbscan revisited, revisited: why and how you should
Church, E. D. Brown, T. S. Jaakkola, R. Barzilay, and (still) use dbscan,” ACM Transactions on Database Systems
J. J. Collins, “A Deep Learning Approach to Antibiotic (TODS), vol. 42, no. 3, pp. 1–21, 2017.
Discovery,” Cell, vol. 180, no. 4, pp. 688–702.e13, Feb. [20] J. Li, H. Izakian, W. Pedrycz, and I. Jamal, “Clustering-
2020. [Online]. Available: https://www.sciencedirect.com/ based anomaly detection in multivariate time series data,”
science/article/pii/S0092867420301021 Applied Soft Computing, vol. 100, p. 106919, 2021.
[2] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, [21] I. Ullah and H. Y. Youn, “Task classification and scheduling
and B. Ommer, “High-Resolution Image Synthesis based on k-means clustering for edge computing,” Wireless
With Latent Diffusion Models,” in Proceedings of the Personal Communications, vol. 113, pp. 2611–2624, 2020.
IEEE/CVF Conference on Computer Vision and Pattern [22] Z. Fan and R. Liu, “Investigation of machine learning
Recognition, 2022, pp. 10 684–10 695. [Online]. Available: based network traffic classification,” in 2017 International
https://openaccess.thecvf.com/content/CVPR2022/html/ Symposium on Wireless Communication Systems (ISWCS).
Rombach_High-Resolution_Image_Synthesis_With_ IEEE, 2017, pp. 1–6.
Latent_Diffusion_Models_CVPR_2022_paper.html [23] R. Bellman, “Dynamic programming,” Science, vol. 153, no.
[3] A. Davies, P. Veličković, L. Buesing, S. Blackwell, 3731, pp. 34–37, 1966.
D. Zheng, N. Tomašev, R. Tanburn, P. Battaglia, [24] H. Abdi and L. J. Williams, “Principal component analysis,”
C. Blundell, A. Juhász, M. Lackenby, G. Williamson, Wiley interdisciplinary reviews: computational statistics,
D. Hassabis, and P. Kohli, “Advancing mathematics by vol. 2, no. 4, pp. 433–459, 2010.
guiding human intuition with AI,” Nature, vol. 600, [25] C. Fefferman, S. Mitter, and H. Narayanan, “Testing the
no. 7887, pp. 70–74, Dec. 2021. [Online]. Available: manifold hypothesis,” Journal of the American Mathemati-
https://www.nature.com/articles/s41586-021-04086-x cal Society, vol. 29, no. 4, pp. 983–1049, 2016.
[4] M. Chen, U. Challita, W. Saad, C. Yin, and M. Debbah, [26] U. Narayanan, A. Unnikrishnan, V. Paul, and S. Joseph, “A
“Artificial neural networks-based machine learning for wire- survey on various supervised classification algorithms,” in
less networks: A tutorial,” IEEE Communications Surveys 2017 International Conference on Energy, Communication,
& Tutorials, vol. 21, no. 4, pp. 3039–3071, 2019. Data Analytics and Soft Computing (ICECDS), 2017, pp.
[5] S. J. Russell, Artificial intelligence a modern approach. 2118–2124.
Pearson Education, Inc., 2010. [27] J. E. van Engelen and H. H. Hoos, “A survey on
[6] M. F. Ahmad Fauzi, R. Nordin, N. F. Abdullah, and semi-supervised learning,” Mach Learn, vol. 109, no. 2, pp.
H. A. H. Alobaidy, “Mobile network coverage prediction 373–440, Feb. 2020. [Online]. Available: https://doi.org/10.
based on supervised machine learning algorithms,” IEEE 1007/s10994-019-05855-6
Access, vol. 10, pp. 55 782–55 793, 2022. [28] M. A. Alsheikh, S. Lin, D. Niyato, and H.-P. Tan, “Machine
[7] C. Ioannou and V. Vassiliou, “Classifying security attacks learning in wireless sensor networks: Algorithms, strategies,
in iot networks using supervised learning,” in 2019 15th In- and applications,” IEEE Communications Surveys & Tuto-
ternational Conference on Distributed Computing in Sensor rials, vol. 16, no. 4, pp. 1996–2018, 2014.
Systems (DCOSS), 2019, pp. 652–658. [29] M. Usama, J. Qadir, A. Raza, H. Arif, K.-L. A. Yau,
[8] W. Hu, Y. Liao, and V. R. Vemuri, “Robust anomaly Y. Elkhatib, A. Hussain, and A. Al-Fuqaha, “Unsupervised
detection using support vector machines,” in Proceedings of machine learning for networking: Techniques, applications
the international conference on machine learning. Morgan and research challenges,” IEEE access, vol. 7, pp. 65 579–
Kaufmann Publishers, 2003, pp. 282–289. 65 615, 2019.
[9] B. Mohammed, I. Awan, H. Ugail, and M. Younas, “Fail- [30] Z. Ghahramani, “Probabilistic machine learning and
ure prediction using machine learning in a virtualised hpc artificial intelligence,” Nature, vol. 521, no. 7553, pp. 452–
system and application,” Cluster Computing, vol. 22, pp. 459, May 2015. [Online]. Available: https://www.nature.
471–485, 2019. com/articles/nature14541
[10] M. A. Hearst, S. T. Dumais, E. Osuna, J. Platt, and [31] K. P. Murphy, Probabilistic Machine Learning: An
B. Scholkopf, “Support vector machines,” IEEE Intelligent introduction. MIT Press, 2022. [Online]. Available:
Systems and their applications, vol. 13, no. 4, pp. 18–28, https://probml.github.io/pml-book/book1.html
1998. [32] X. Liu, F. Zhang, Z. Hou, L. Mian, Z. Wang, J. Zhang,
[11] S. B. Kotsiantis, “Decision trees: a recent overview,” Artifi- and J. Tang, “Self-Supervised Learning: Generative or Con-
cial Intelligence Review, vol. 39, pp. 261–283, 2013. trastive,” IEEE Transactions on Knowledge and Data Engi-
[12] L. Breiman, “Random forests,” Machine learning, vol. 45, neering, vol. 35, no. 1, pp. 857–876, Jan. 2023.
pp. 5–32, 2001. [33] F. Ebert, C. Finn, A. X. Lee, and S. Levine, “Self-Supervised
[13] G. Shakhnarovich, T. Darrell, and P. Indyk, “Nearest- Visual Planning with Temporal Skip Connections,” in Con-
neighbor methods in learning and vision,” IEEE Trans. ference on Robot Learning (CoRL), 2017.
Neural Networks, vol. 19, no. 2, p. 377, 2008. [34] S. Meyn, Control systems and reinforcement learning. Cam-
[14] M. Nasri and M. Hamdi, “Lte qos parameters prediction bridge University Press, 2022.
using multivariate linear regression algorithm,” in 2019 22nd [35] Y. Xu, G. Gui, H. Gacanin, and F. Adachi, “A survey on
conference on innovation in clouds, internet and networks resource allocation for 5g heterogeneous networks: Current
and workshops (ICIN). IEEE, 2019, pp. 145–150. research, future trends, and challenges,” IEEE Communica-
[15] A. Y. Nikravesh, S. A. Ajila, C.-H. Lung, and W. Ding, tions Surveys & Tutorials, vol. 23, no. 2, pp. 668–695, 2021.
“Mobile network traffic prediction using mlp, mlpwd, and

VOLUME 11, 2023 41

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3384460

Afifi et al.: A Primer in Machine Learning with Computer Networks: Techniques, Datasets and Models

[36] M. M. Sadeeq, N. M. Abdulkareem, S. R. Zeebaree, D. M. environments using a3c learning and residual recurrent
Ahmed, A. S. Sami, and R. R. Zebari, “Iot and cloud neural networks,” IEEE transactions on mobile computing,
computing issues, challenges and opportunities: A review,” vol. 21, no. 3, pp. 940–954, 2020.
Qubahan Academic Journal, vol. 1, no. 2, pp. 1–7, 2021. [56] M. Chen, T. Wang, K. Ota, M. Dong, M. Zhao, and
[37] P. Kumar and R. Kumar, “Issues and challenges of load A. Liu, “Intelligent resource allocation management for
balancing techniques in cloud computing: A survey,” ACM vehicles network: An a3c learning approach,” Computer
Computing Surveys (CSUR), vol. 51, no. 6, pp. 1–35, 2019. Communications, vol. 151, pp. 485–494, 2020.
[38] A. Alwarafy, M. Abdallah, B. S. Ciftler, A. Al-Fuqaha, [57] S. Still and D. Precup, “An information-theoretic approach
and M. Hamdi, “Deep reinforcement learning for radio to curiosity-driven reinforcement learning,” Theory in Bio-
resource allocation and management in next generation sciences, vol. 131, pp. 139–148, 2012.
heterogeneous wireless networks: A survey,” arXiv preprint [58] Y. Burda, H. Edwards, A. Storkey, and O. Klimov, “Ex-
arXiv:2106.00574, 2021. ploration by random network distillation,” arXiv preprint
[39] R. S. Sutton and A. G. Barto, Reinforcement Learning: An arXiv:1810.12894, 2018.
Introduction, 2018. [59] M. L. Littman, “Markov games as a framework for multi-
[40] D. Bertsekas, Dynamic programming and optimal control: agent reinforcement learning,” in Machine learning proceed-
Volume II. Athena scientific, 2012, vol. 2. ings 1994. Elsevier, 1994, pp. 157–163.
[41] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-dynamic pro- [60] T. Gabel, “Multi-agent reinforcement learning approaches
gramming. Athena Scientific Belmont, MA, 1996, vol. 5. for distributed job shop scheduling problems,” Ph.D. disser-
[42] T. M. Moerland, J. Broekens, A. Plaat, C. M. Jonker et al., tation, Osnabrück, Univ., Diss., 2009, 2009.
“Model-based reinforcement learning: A survey,” Founda- [61] L. Canese, G. C. Cardarilli, L. Di Nunzio, R. Fazzolari,
tions and Trends® in Machine Learning, vol. 16, no. 1, pp. D. Giardino, M. Re, and S. Spanò, “Multi-agent reinforce-
1–118, 2023. ment learning: A review of challenges and applications,”
[43] Y.-P. Hsu, E. Modiano, and L. Duan, “Age of information: Applied Sciences, vol. 11, no. 11, p. 4948, 2021.
Design and analysis of optimal scheduling algorithms,” in [62] T. Li, K. Zhu, N. C. Luong, D. Niyato, Q. Wu, Y. Zhang,
2017 IEEE International Symposium on Information Theory and B. Chen, “Applications of multi-agent reinforcement
(ISIT). IEEE, 2017, pp. 561–565. learning in future internet: A comprehensive survey,” IEEE
[44] Q. Sykora, M. Ren, and R. Urtasun, “Multi-agent routing Communications Surveys & Tutorials, 2022.
value iteration network,” in International Conference on [63] E. Altman, Constrained Markov decision processes. CRC
Machine Learning. PMLR, 2020, pp. 9300–9310. press, 1999, vol. 7.
[45] S. S. Mwanje, L. C. Schmelz, and A. Mitschele-Thiel, [64] S. Gu, L. Yang, Y. Du, G. Chen, F. Walter, J. Wang,
“Cognitive cellular networks: A q-learning framework for Y. Yang, and A. Knoll, “A review of safe reinforcement
self-organizing networks,” IEEE Transactions on Network learning: Methods, theory and applications,” arXiv preprint
and Service Management, vol. 13, no. 1, pp. 85–98, 2016. arXiv:2205.10330, 2022.
[46] Y. Kim, S. Kim, and H. Lim, “Reinforcement learning based [65] A. Avranas, M. Kountouris, and P. Ciblat, “Deep
resource management for network slicing,” Applied Sciences, reinforcement learning for wireless scheduling with
vol. 9, no. 11, p. 2361, 2019. multiclass services,” CoRR, vol. abs/2011.13634, 2020.
[47] H. Afifi and H. Karl, “Reinforcement learning for virtual [Online]. Available: https://arxiv.org/abs/2011.13634
network embedding in wireless sensor networks,” in 2020 [66] S. Khairy, P. Balaprakash, L. X. Cai, and Y. Cheng,
16th International Conference on Wireless and Mobile Com- “Constrained deep reinforcement learning for energy
puting, Networking and Communications (WiMob). IEEE, sustainable multi-uav based random access iot networks
2020, pp. 123–128. with NOMA,” CoRR, vol. abs/2002.00073, 2020. [Online].
[48] A. Geramifard, T. J. Walsh, S. Tellex, G. Chowdhary, Available: https://arxiv.org/abs/2002.00073
N. Roy, J. P. How et al., “A tutorial on linear function [67] C. Sun, C. She, and C. Yang, “Optimizing ultra-reliable
approximators for dynamic programming and reinforcement and low-latency communication systems with unsupervised
learning,” Foundations and Trends® in Machine Learning, learning,” CoRR, vol. abs/2006.01641, 2020. [Online].
vol. 6, no. 4, pp. 375–451, 2013. Available: https://arxiv.org/abs/2006.01641
[49] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour, [68] Constrained Unsupervised Learning for Wireless Network
“Policy gradient methods for reinforcement learning with Optimization. Cambridge University Press, 2022, p.
function approximation,” Advances in neural information 182–211.
processing systems, vol. 12, 1999. [69] D. Wu, L. Deng, Z. Liu, Y. Zhang, and Y. S. Han, “Re-
[50] R. J. Williams, “Simple statistical gradient-following algo- inforcement learning random access for delay-constrained
rithms for connectionist reinforcement learning,” Reinforce- heterogeneous wireless networks: A two-user case,” in 2021
ment learning, pp. 5–32, 1992. IEEE Globecom Workshops (GC Wkshps), 2021, pp. 1–7.
[51] I. Grondman, L. Busoniu, G. A. Lopes, and R. Babuska, “A [70] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning.
survey of actor-critic reinforcement learning: Standard and MIT Press, 2016, http://www.deeplearningbook.org.
natural policy gradients,” IEEE Transactions on Systems, [71] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet
Man, and Cybernetics, Part C (Applications and Reviews), classification with deep convolutional neural networks,”
vol. 42, no. 6, pp. 1291–1307, 2012. Commun. ACM, vol. 60, no. 6, pp. 84–90, May
[52] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, 2017. [Online]. Available: https://dl.acm.org/doi/10.1145/
T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous 3065386
methods for deep reinforcement learning,” in International [72] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D.
conference on machine learning. PMLR, 2016, pp. 1928– Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry,
1937. A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger,
[53] H. Mao, M. Alizadeh, I. Menache, and S. Kandula, “Re- T. Henighan, R. Child, A. Ramesh, D. Ziegler,
source management with deep reinforcement learning,” in J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler,
Proceedings of the 15th ACM workshop on hot topics in M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner,
networks, 2016, pp. 50–56. S. McCandlish, A. Radford, I. Sutskever, and D. Amodei,
[54] C. Zhong, Z. Lu, M. C. Gursoy, and S. Velipasalar, “A deep “Language Models are Few-Shot Learners,” in Advances in
actor-critic reinforcement learning framework for dynamic Neural Information Processing Systems, vol. 33. Curran
multichannel access,” IEEE Transactions on Cognitive Com- Associates, Inc., 2020, pp. 1877–1901. [Online]. Avail-
munications and Networking, vol. 5, no. 4, pp. 1125–1139, able: https://proceedings.neurips.cc/paper/2020/hash/
2019. 1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html
[55] S. Tuli, S. Ilager, K. Ramamohanarao, and R. Buyya, [73] W. S. McCulloch and W. Pitts, “A logical calculus of
“Dynamic scheduling for stochastic edge-cloud computing the ideas immanent in nervous activity,” The bulletin of
mathematical biophysics, vol. 5, no. 4, p. 115–133, Dec 1943.
42 VOLUME 11, 2023

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3384460

Afifi et al.: A Primer in Machine Learning with Computer Networks: Techniques, Datasets and Models

[74] S. Sharma, S. Sharma, and A. Athaiya, “Activation functions [91] A. Cheng, “Pac-gan: Packet generation of network traffic
in neural networks,” Towards Data Sci, vol. 6, no. 12, pp. using generative adversarial networks,” in 2019 IEEE 10th
310–316, 2017. Annual Information Technology, Electronics and Mobile
[75] K. Hornik, M. Stinchcombe, and H. White, “Multilayer Communication Conference (IEMCON), 2019, pp. 0728–
feedforward networks are universal approximators,” Neural 0734.
Networks, vol. 2, no. 5, pp. 359–366, Jan. 1989. [Online]. [92] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine
Available: https://www.sciencedirect.com/science/article/ translation by jointly learning to align and translate,” arXiv
pii/0893608089900208 preprint arXiv:1409.0473, 2014.
[76] R. Hecht-Nielsen, “Theory of the backpropagation neural [93] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones,
network,” in Neural networks for perception. Elsevier, 1992, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is
pp. 65–93. all you need,” Advances in neural information processing
[77] D. P. Kingma and J. Ba, “Adam: A method for stochastic systems, vol. 30, 2017.
optimization,” arXiv preprint arXiv:1412.6980, 2014. [94] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert:
[78] G. Lan, First-order and stochastic optimization methods for Pre-training of deep bidirectional transformers for language
machine learning. Springer, 2020, vol. 1. understanding,” arXiv preprint arXiv:1810.04805, 2018.
[79] M. M. Bronstein, J. Bruna, T. Cohen, and P. Veličković, [95] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn,
“Geometric Deep Learning: Grids, Groups, Graphs, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer,
Geodesics, and Gauges,” May 2021. [Online]. Available: G. Heigold, S. Gelly et al., “An image is worth 16x16 words:
http://arxiv.org/abs/2104.13478 Transformers for image recognition at scale,” arXiv preprint
[80] R. Eldan and O. Shamir, “The power of depth for feedfor- arXiv:2010.11929, 2020.
ward neural networks,” in Conference on learning theory. [96] C. Joshi, “Transformers are graph neural networks,” The
PMLR, 2016, pp. 907–940. Gradient, vol. 12, 2020.
[81] T. Gruber, S. Cammerer, J. Hoydis, and S. Ten Brink, “On [97] D. K. Kholgh and P. Kostakos, “Pac-gpt: A novel approach
deep learning-based channel decoding,” in 2017 51st annual to generating synthetic network traffic with gpt-3,” IEEE
conference on information sciences and systems (CISS). Access, 2023.
IEEE, 2017, pp. 1–6. [98] N. Ziems, G. Liu, J. Flanagan, and M. Jiang, “Explaining
[82] H. Sun, X. Chen, Q. Shi, M. Hong, X. Fu, and N. D. tree model decisions in natural language for network intru-
Sidiropoulos, “Learning to optimize: Training deep neural sion detection,” arXiv preprint arXiv:2310.19658, 2023.
networks for wireless resource management,” in 2017 IEEE [99] T. Ali and P. Kostakos, “Huntgpt: Integrating ma-
18th International Workshop on Signal Processing Advances chine learning-based anomaly detection and explainable
in Wireless Communications (SPAWC). IEEE, 2017, pp. ai with large language models (LLMs),” arXiv preprint
1–6. arXiv:2309.16021, 2023.
[83] T. A. Tang, L. Mhamdi, D. McLernon, S. A. R. Zaidi, [100] S. K. Mani, Y. Zhou, K. Hsieh, S. Segarra, T. Eberl, E. Azu-
and M. Ghogho, “Deep learning approach for network in- lai, I. Frizler, R. Chandra, and S. Kandula, “Enhancing net-
trusion detection in software defined networking,” in 2016 work management using code generated by large language
international conference on wireless networks and mobile models,” in Proceedings of the 22nd ACM Workshop on Hot
communications (WINCOM). IEEE, 2016, pp. 258–263. Topics in Networks, 2023, pp. 196–204.
[84] S. Hochreiter and J. Schmidhuber, “Long Short-Term Mem- [101] Y. Huang, H. Du, X. Zhang, D. Niyato, J. Kang, Z. Xiong,
ory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, Nov. S. Wang, and T. Huang, “Large language models for net-
1997. working: Applications, enabling techniques, and challenges,”
[85] K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, arXiv preprint arXiv:2311.17474, 2023.
F. Bougares, H. Schwenk, and Y. Bengio, “Learning [102] J. Sun, Q. V. Liao, M. Muller, M. Agarwal,
Phrase Representations using RNN Encoder–Decoder for S. Houde, K. Talamadupula, and J. D. Weisz,
Statistical Machine Translation,” in Proceedings of the 2014 “Investigating explainability of generative ai for code
Conference on Empirical Methods in Natural Language through scenario-based design,” in 27th International
Processing (EMNLP). Doha, Qatar: Association for Conference on Intelligent User Interfaces, ser. IUI ’22.
Computational Linguistics, 2014, pp. 1724–1734. [Online]. New York, NY, USA: Association for Computing
Available: http://aclweb.org/anthology/D14-1179 Machinery, 2022, p. 212–228. [Online]. Available:
[86] P. Veličković, “Everything is Connected: Graph Neural https://doi.org/10.1145/3490099.3511119
Networks,” Jan. 2023. [Online]. Available: http://arxiv.org/ [103] Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai,
abs/2301.08210 J. Sun, Q. Guo, M. Wang, and H. Wang, “Retrieval-
[87] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, augmented generation for large language models: A survey,”
D. Warde-Farley, S. Ozair, A. Courville, and arXiv preprint arXiv:2312.10997, 2024.
Y. Bengio, “Generative adversarial nets,” in Advances [104] Cisco, Cisco Unveils Next-Gen Solutions that Empower
in Neural Information Processing Systems, Z. Ghahramani, Security and Productivity with Generative AI, 2023.
M. Welling, C. Cortes, N. Lawrence, and K. Weinberger, [Online]. Available: https://newsroom.cisco.com/c/r/
Eds., vol. 27. Curran Associates, Inc., 2014. [Online]. Avail- newsroom/en/us/a/y2023/m06/cisco-unveils-next-gen-
able: https://proceedings.neurips.cc/paper_files/paper/ solutions-that-empower-security-and-productivity-with-
2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf generative-ai.html
[88] C. Han, H. Hayashi, L. Rundo, R. Araki, W. Shimoda, [105] Juniper, AI for IT Operations (AIOps), 2023. [On-
S. Muramatsu, Y. Furukawa, G. Mauri, and H. Nakayama, line]. Available: https://www.juniper.net/us/en/solutions/
“Gan-based synthetic brain mr image generation,” in 2018 artificial-intelligence-for-it-operations-aiops.html
IEEE 15th International Symposium on Biomedical Imaging [106] O. Santos. (2023) Securing ai: Navigating
(ISBI 2018), 2018, pp. 734–738. the complex landscape of models, fine-
[89] Y. Chen, Y. Pan, T. Yao, X. Tian, and T. Mei, “Mocycle- tuning, and rag. [Online]. Available:
gan: Unpaired video-to-video translation,” in Proceedings https://blogs.cisco.com/security/securing-ai-navigating-
of the 27th ACM international conference on multimedia, the-complex-landscape-of-models-fine-tuning-and-rag
2019, pp. 647–655. [107] I. Cisco Systems. (2023) Cisco ai assistant - cisco. [Online].
[90] J. Kong, J. Kim, and J. Bae, “Hifi-gan: Generative adversar- Available: [4](https://www.cisco.com/site/us/en/solutions/
ial networks for efficient and high fidelity speech synthesis,” artificial-intelligence/ai-assistant/index.html)
Advances in Neural Information Processing Systems, vol. 33, [108] S. Thrun and A. Schwartz, “Issues in using function ap-
pp. 17 022–17 033, 2020. proximation for reinforcement learning,” in Proceedings of
the Fourth Connectionist Models Summer School, vol. 255.
Hillsdale, NJ, 1993, p. 263.

VOLUME 11, 2023 43

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3384460

Afifi et al.: A Primer in Machine Learning with Computer Networks: Techniques, Datasets and Models

[109] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, Proceedings of the ACM Web Conference 2022, 2022, pp.
M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, 3407–3417.
G. Ostrovski et al., “Human-level control through deep [127] D. Raca, J. J. Quinlan, A. H. Zahran, and C. J. Sreenan,
reinforcement learning,” nature, vol. 518, no. 7540, pp. 529– “Beyond throughput: A 4g lte dataset with channel and con-
533, 2015. text metrics,” in Proceedings of the 9th ACM Multimedia
[110] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Systems Conference, 2018, pp. 460–465.
Y. Tassa, D. Silver, and D. Wierstra, “Continuous con- [128] S. Farthofer, M. Herlich, C. Maier, S. Pochaba, J. Lackner,
trol with deep reinforcement learning,” arXiv preprint and P. Dorfinger, “An open mobile communications drive
arXiv:1509.02971, 2015. test data set and its use for machine learning,” IEEE Open
[111] A. Ramaswamy, S. Bhatnagar, and N. Saxena, “A framework Journal of the Communications Society, vol. 3, pp. 1688–
for provably stable and consistent training of deep feedfor- 1701, 2022.
ward networks,” arXiv preprint arXiv:2305.12125, 2023. [129] J. Wu, L. Wang, Q. Pei, X. Cui, F. Liu, and T. Yang, “Hitdl:
[112] A. Tampuu, T. Matiisen, D. Kodelja, I. Kuzovkin, K. Kor- High-throughput deep learning inference at the hybrid mo-
jus, J. Aru, J. Aru, and R. Vicente, “Multiagent cooperation bile edge,” IEEE Transactions on Parallel and Distributed
and competition with deep reinforcement learning,” PloS Systems, vol. 33, no. 12, pp. 4499–4514, 2022.
one, vol. 12, no. 4, p. e0172395, 2017. [130] D. Raca, D. Leahy, C. J. Sreenan, and J. J. Quinlan, “Beyond
[113] R. Lowe, Y. I. Wu, A. Tamar, J. Harb, O. Pieter Abbeel, and throughput, the next generation: a 5g dataset with channel
I. Mordatch, “Multi-agent actor-critic for mixed cooperative- and context metrics,” in Proceedings of the 11th ACM
competitive environments,” Advances in neural information Multimedia Systems Conference, 2020, pp. 303–308.
processing systems, vol. 30, 2017. [131] 3rd Party, “Geant/abilene network topology data and traffic
[114] A. Redder, A. Ramaswamy, and H. Karl, “3dpg: Distributed traces,” 2020. [Online]. Available: ^1^
deep deterministic policy gradient algorithms for networked [132] [Online]. Available: https://www.geni.net/
multi-agent systems,” arXiv preprint arXiv:2201.00570v2, [133] “Caida data - completed datasets,” Nov 2020. [Online]. Avail-
2022. able: https://www.caida.org/catalog/datasets/completed-
[115] C. Qiu, H. Yao, F. R. Yu, F. Xu, and C. Zhao, “Deep datasets/
q-learning aided networking, caching, and computing re- [134] N. Spring, R. Mahajan, D. Wetherall, and T. Anderson,
sources allocation in software-defined satellite-terrestrial “Measuring isp topologies with rocketfuel,” IEEE/ACM
networks,” IEEE Transactions on Vehicular Technology, Transactions on Networking, vol. 12, no. 1, pp. 2–16, 2004.
vol. 68, no. 6, pp. 5871–5883, 2019. [135] “The internet topology zoo.” [Online]. Available: http:
[116] S. Schneider, R. Khalili, A. Manzoor, H. Qarawlus, R. Schel- //www.topology-zoo.org/dataset.html
lenberg, H. Karl, and A. Hecker, “Self-learning multi- [136] M. Roughan, “A case study of the accuracy of snmp
objective service coordination using deep reinforcement measurements,” Journal of Electrical and Computer
learning,” IEEE Transactions on Network and Service Man- Engineering, vol. 2010, p. 812979, May 2010. [Online].
agement, vol. 18, no. 3, pp. 3829–3842, 2021. Available: https://doi.org/10.1155/2010/812979
[117] A. Redder, A. Ramaswamy, and D. E. Quevedo, “Deep re- [137] J. Kua, G. Armitage, and P. Branch, “A survey of rate
inforcement learning for scheduling in large-scale networked adaptation techniques for dynamic adaptive streaming over
control systems,” IFAC-PapersOnLine, vol. 52, no. 20, pp. http,” IEEE Communications Surveys & Tutorials, vol. 19,
333–338, 2019. no. 3, pp. 1842–1866, 2017.
[118] H. Afifi, A. Ramaswamy, and H. Karl, “Reinforcement learn- [138] G. Zhou, R. Wu, M. Hu, Y. Zhou, T. Z. Fu, and D. Wu,
ing for autonomous vehicle movements in wireless sensor “Vibra: Neural adaptive streaming of vbr-encoded videos,”
networks,” in ICC 2021-IEEE International Conference on in Proceedings of the 31st ACM Workshop on Network and
Communications. IEEE, 2021, pp. 1–6. Operating Systems Support for Digital Audio and Video,
[119] B. Jang, M. Kim, G. Harerimana, and J. W. Kim, “Q- 2021, pp. 1–8.
learning algorithms: A comprehensive classification and ap- [139] Y. Yuan, W. Wang, Y. Wang, S. S. Adhatarao, B. Ren,
plications,” IEEE access, vol. 7, pp. 133 653–133 667, 2019. K. Zheng, and X. Fu, “Vsim: Improving qoe fairness for video
[120] N. C. Luong, D. T. Hoang, S. Gong, D. Niyato, P. Wang, Y.- streaming in mobile environments,” in Proceedings of the
C. Liang, and D. I. Kim, “Applications of deep reinforcement IEEE International Conference on Computer Communica-
learning in communications and networking: A survey,” tions (INFOCOM). IEEE, 2022, pp. 1309–1318.
IEEE Communications Surveys & Tutorials, vol. 21, no. 4, [140] S. Lederer, C. Müller, and C. Timmerer, “Dynamic adaptive
pp. 3133–3174, 2019. streaming over http dataset,” in Proceedings of the 3rd
[121] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and Multimedia Systems Conference, 2012, pp. 89–94.
R. R. Salakhutdinov, “Improving neural networks by pre- [141] S. Lederer, C. Mueller, C. Timmerer, C. Concolato,
venting co-adaptation of feature detectors,” arXiv preprint J. Le Feuvre, and K. Fliegel, “Distributed dash dataset,” in
arXiv:1207.0580, 2012. Proceedings of the 4th ACM multimedia systems conference,
[122] H. Riiser, P. Vigmostad, C. Griwodz, and P. Halvorsen, 2013, pp. 131–135.
“Commute path bandwidth traces from 3g networks: anal- [142] J. Le Feuvre, J.-M. Thiesse, M. Parmentier, M. Raulet, and
ysis and applications,” in Proceedings of the 4th ACM C. Daguet, “Ultra high definition hevc dash data set,” in Pro-
Multimedia Systems Conference, 2013, pp. 114–118. ceedings of the 5th ACM Multimedia Systems Conference,
[123] X. Zuo, J. Yang, M. Wang, and Y. Cui, “Adaptive bitrate 2014, pp. 7–12.
with user-level qoe preference for video streaming,” in Pro- [143] A. Zabrovskiy, C. Feldmann, and C. Timmerer, “Multi-codec
ceedings of the IEEE International Conference on Computer dash dataset,” in Proceedings of the 9th ACM Multimedia
Communications (INFOCOM). IEEE, 2022, pp. 1279–1288. Systems Conference, 2018, pp. 438–443.
[124] J. Van Der Hooft, S. Petrangeli, T. Wauters, R. Huysegems, [144] A. Chandramohan, M. Poel, B. Meijerink, and G. Heijenk,
P. R. Alface, T. Bostoen, and F. De Turck, “Http/2-based “Machine learning for cooperative driving in a multi-lane
adaptive streaming of hevc video over 4g/lte networks,” highway environment,” in 2019 Wireless Days (WD), 2019,
IEEE Communications Letters, vol. 20, no. 11, pp. 2177– pp. 1–4.
2180, 2016. [145] L. N. Alegre, T. Ziemke, and A. L. C. Bazzan, “Using rein-
[125] L. Zhang, Y. Zhang, X. Wu, F. Wang, L. Cui, Z. Wang, and forcement learning to control traffic signals in a real-world
J. Liu, “Batch adaptative streaming for video analytics,” scenario: An approach based on linear function approxi-
in Proceedings of the IEEE International Conference on mation,” IEEE Transactions on Intelligent Transportation
Computer Communications (INFOCOM). IEEE, 2022, pp. Systems, vol. 23, no. 7, pp. 9126–9135, 2022.
2158–2167. [146] C. Liu, Y. Zhang, W. Chen, F. Wang, H. Li, and Y.-D.
[126] A. Alhilal, T. Braud, B. Han, and P. Hui, “Nebula: Reliable Shen, “Adaptive matching strategy for multi-target multi-
low-latency video transmission for mobile cloud gaming,” in camera tracking,” in ICASSP 2022 - 2022 IEEE Interna-

44 VOLUME 11, 2023

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3384460

Afifi et al.: A Primer in Machine Learning with Computer Networks: Techniques, Datasets and Models

tional Conference on Acoustics, Speech and Signal Process- traffic classification,” in Proc. Int. Conf. Eng. Telecommun.,
ing (ICASSP), 2022, pp. 2934–2938. 2021, pp. 1–5.
[147] M. Maciejewski, “A comparison of microscopic traffic flow [163] G. Aceto, D. Ciuonzo, A. Montieri, V. Persico, and
simulation systems for an urban area,” Transport Problems, A. Pescapé, “Mirage: Mobile-app traffic capture and ground-
vol. 5, no. 4, pp. 29–40, 2010. truth creation,” in 2019 4th International Conference
[148] F. K. Karnadi, Z. H. Mo, and K.-c. Lan, “Rapid generation of on Computing, Communications and Security (ICCCS).
realistic mobility models for vanet,” in 2007 IEEE Wireless IEEE, 2019, pp. 1–8.
Communications and Networking Conference, 2007, pp. [164] C. Wang, A. Finamore, L. Yang, K. Fauvel, and D. Rossi,
2506–2511. “Appclassnet: A commercial-grade dataset for application
[149] M. Tsao, D. Milojevic, C. Ruch, M. Salazar, E. Frazzoli, identification research,” ACM SIGCOMM Computer Com-
and M. Pavone, “Model predictive control of ride-sharing munication Review, vol. 52, no. 3, pp. 19–27, 2022.
autonomous mobility-on-demand systems,” in 2019 Inter- [165] M. Ring, S. Wunderlich, D. Scheuring, D. Landes, and
national Conference on Robotics and Automation (ICRA), A. Hotho, “A survey of network-based intrusion detection
2019, pp. 6665–6671. data sets,” Computers & Security, vol. 86, pp. 147–167, 2019.
[150] C. M. Moyano, J. F. Ortega, and D. E. Mogrovejo, “Effi- [166] “Datasets,” 2023. [Online]. Available: https://www.unb.ca/
ciency analysis during calibration of traffic microsimulation cic/datasets/
models in conflicting intersections near universidad del [167] A. Dvir, Y. Zion, J. Muehlstein, O. Pele, C. Hajaj, and
azuay, using aimsun 8.1,” in MOVICI-MOYCOT 2018: Joint R. Dubin, “Robust machine learning for encrypted traffic
Conference for Urban Mobility in the Smart City, 2018, pp. classification,” arXiv preprint arXiv:1603.04865, 2016.
1–6. [168] R. Poorzare and O. P. Waldhorst, “Toward the implemen-
[151] L. Yang and W. Lan, “On secondary development of ptv- tation of mptcp over mmwave 5g and beyond: Analysis,
vissim for traffic optimization,” in 2018 13th International challenges, and solutions,” IEEE Access, feb 2023.
Conference on Computer Science & Education (ICCSE), [169] R. Poorzare and A. Calveras Augé, “Challenges on the way
2018, pp. 1–5. of implementing tcp over 5g networks,” IEEE Access, sep
[152] L. Lu, T. Yun, L. Li, Y. Su, and D. Yao, “A comparison of 2020.
phase transitions produced by paramics, transmodeler, and [170] T. Henderson, M. Lacage, G. Riley, C. Dowell, and
vissim,” IEEE Intelligent Transportation Systems Magazine, J. Kopena, “Network simulations with the ns-3 simulator,”
vol. 2, no. 3, pp. 19–24, 2010. SIGCOMM Demonstration, vol. 14, p. 527, 2008.
[153] Z. Tang, M. Naphade, M.-Y. Liu, X. Yang, S. Birchfield, [171] M. Mezzavilla, M. Zhang, M. Polese, R. Ford, S. Dutta,
S. Wang, R. Kumar, D. Anastasiu, and J.-N. Hwang, S. Rangan, and Z. M, “End-to-end simulation of 5g mmwave
“Cityflow: A city-scale benchmark for multi-target multi- networks,” IEEE Communications Surveys & Tutorials,
camera vehicle tracking and re-identification,” in 2019 vol. 20, no. 3, pp. 2237–2263, 2018.
IEEE/CVF Conference on Computer Vision and Pattern [172] P. Gawłowicz and Zubow, “ns3-gym: Extending ope-
Recognition (CVPR), 2019, pp. 8789–8798. nai gym for networking research,” [Online]. Available:
[154] Z. Wang, B. Li, and B. Liang, “Quick: Quality-of-service im- arXiv:1810.03943.
provement with cooperative relaying and network coding,” [173] H. Yin, P. Liu, K. Liu, L. Cao, L. Zhang, Y. Gao, and X. Hei,
in 2010 IEEE International Conference on Communications. “Ns3-ai: Fostering artificial intelligence algorithms for
IEEE, 2010, pp. 1–5. networking research,” in Proceedings of the 2020 Workshop
[155] T. Mangla, E. Halepovic, M. Ammar, and E. Zegura, on Ns-3, ser. WNS3 ’20. New York, NY, USA: Association
“emimic: Estimating http-based video qoe metrics from for Computing Machinery, 2020, p. 57–64. [Online].
encrypted network traffic,” in 2018 Network Traffic Mea- Available: https://doi.org/10.1145/3389400.3389404
surement and Analysis Conference (TMA). IEEE, 2018, [174] M. Schettler, D. S. Buse, A. Zubow, and F. Dressler, “How to
pp. 1–8. train your its? integrating machine learning with vehicular
[156] C. Gutterman, K. Guo, S. Arora, X. Wang, L. Wu, E. Katz- network simulation,” in 2020 IEEE Vehicular Networking
Bassett, and G. Zussman, “Requet: Real-time qoe detection Conference (VNC), 2020, pp. 1–4.
for encrypted youtube traffic,” in Proceedings of the 10th [175] F. Ruffy, M. Przystupa, and I. Beschastnikh, “Iroko: A
ACM Multimedia Systems Conference, 2019, pp. 48–59. framework to prototype reinforcement learning for data
[157] M. Seufert, P. Casas, N. Wehner, L. Gang, and K. Li, center traffic control,” CoRR, vol. abs/1812.09975, 2018.
“Stream-based machine learning for real-time qoe analysis [Online]. Available: http://arxiv.org/abs/1812.09975
of encrypted video streaming traffic,” in 2019 22nd Con- [176] J. Charlier, A. Singh, G. Ormazabal, R. State, and
ference on innovation in clouds, internet and networks and H. Schulzrinne, “Syngan: Towards generating synthetic net-
workshops (ICIN). IEEE, 2019, pp. 76–81. work attacks using gans,” CoRR, vol. abs/1908.09899, 2019.
[158] N. Wehner, M. Ring, J. Schüler, A. Hotho, T. Hoßfeld, [177] M. Ring, D. Schlör, D. Landes, and A. Hotho, “Flow-based
and M. Seufert, “On learning hierarchical embeddings from network traffic generation using generative adversarial
encrypted network traffic,” in NOMS 2022-2022 IEEE/IFIP networks,” Comput. Secur., vol. 82, no. C, p. 156–172,
Network Operations and Management Symposium. IEEE, may 2019. [Online]. Available: https://doi.org/10.1016/j.
2022, pp. 1–7. cose.2018.12.012
[159] K. Dietz, M. Mühlhauser, M. Seufert, N. Gray, T. Hoßfeld, [178] A. Mozo, Á. González-Prieto, A. Pastor, S. Gómez-Canaval,
and D. Herrmann, “Browser fingerprinting: How to protect and E. Talavera, “Synthetic flow-based cryptomining
machine learning models and data with differential privacy?” attack generation through generative adversarial networks,”
Electronic Communications of the EASST, vol. 80, 2021. Scientific Reports, vol. 12, no. 1, p. 2091, Feb 2022. [Online].
[160] N. Wehner, M. Seufert, J. Schüler, P. Casas, and T. Hoßfeld, Available: https://doi.org/10.1038/s41598-022-06057-2
“How are your apps doing? qoe inference and analysis in [179] Y. Guo, G. Xiong, Z. Li, J. Shi, M. Cui, and G. Gou, “Com-
mobile devices,” in 2021 17th International Conference on bating imbalance in network traffic classification using gan
Network and Service Management (CNSM). IEEE, 2021, based oversampling,” in 2021 IFIP Networking Conference
pp. 49–55. (IFIP Networking), 2021, pp. 1–9.
[161] A. Azab, M. Khasawneh, S. Alrabaee, K.-K. R. Choo, [180] T. J. Anande and M. S. Leeson, “Generative adversarial
and M. Sarsour, “Network traffic classification: Techniques, networks (gans): A survey on network traffic generation,”
datasets, and challenges,” Digital Communications and Net- International Journal of Machine Learning and Computing,
works, 2022. vol. 12, no. 6, 2022.
[162] D. Shamsimukhametov, M. Liubogoshchev, E. Khorov, and [181] M. Rigaki and S. Garcia, “Bringing a gan to a knife-fight:
I. Akyildiz, “Youtube netflix web dataset for encrypted Adapting malware communication to avoid detection,” in
2018 IEEE Security and Privacy Workshops (SPW), 2018,
pp. 70–75.

VOLUME 11, 2023 45

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3384460

Afifi et al.: A Primer in Machine Learning with Computer Networks: Techniques, Datasets and Models

[182] C. Zhang, X. Ouyang, and P. Patras, “Zipnet-gan: Inferring [201] Q. Wang, J. Xiong, L. Han, H. Liu, T. Zhang et al., “Expo-
fine-grained mobile traffic patterns via a generative nentially weighted imitation learning for batched historical
adversarial neural network,” CoRR, vol. abs/1711.02413, data,” Advances in Neural Information Processing Systems,
2017. [Online]. Available: http://arxiv.org/abs/1711.02413 vol. 31, 2018.
[183] B. Dowoo, Y. Jung, and C. Choi, “Pcapgan: Packet cap- [202] A. Kumar, A. Zhou, G. Tucker, and S. Levine, “Conservative
ture file generator by style-based generative adversarial q-learning for offline reinforcement learning,” Advances in
networks,” in 2019 18th IEEE International Conference On Neural Information Processing Systems, vol. 33, pp. 1179–
Machine Learning And Applications (ICMLA), 2019, pp. 1191, 2020.
1149–1154. [203] Z. Wang, A. Novikov, K. Zolna, J. S. Merel, J. T. Sprin-
[184] L. Engstrom, A. Ilyas, S. Santurkar, D. Tsipras, F. Janoos, genberg, S. E. Reed, B. Shahriari, N. Siegel, C. Gulcehre,
L. Rudolph, and A. Madry, “Implementation matters in N. Heess et al., “Critic regularized regression,” Advances in
deep rl: A case study on ppo and trpo,” in International Neural Information Processing Systems, vol. 33, pp. 7768–
Conference on Learning Representations, 2020. 7778, 2020.
[185] A. Raffin, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus, [204] D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi, “Dream to
and N. Dormann, “Stable-baselines3: Reliable reinforcement control: Learning behaviors by latent imagination,” arXiv
learning implementations,” The Journal of Machine Learn- preprint arXiv:1912.01603, 2019.
ing Research, vol. 22, no. 1, pp. 12 348–12 355, 2021. [205] L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih,
[186] S. Huang, R. F. J. Dossa, C. Ye, J. Braga, D. Chakraborty, T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning
K. Mehta, and J. G. AraÃšjo, “Cleanrl: High-quality single- et al., “Impala: Scalable distributed deep-rl with importance
file implementations of deep reinforcement learning algo- weighted actor-learner architectures,” in International con-
rithms,” Journal of Machine Learning Research, vol. 23, no. ference on machine learning. PMLR, 2018, pp. 1407–1416.
274, pp. 1–18, 2022. [206] S. Kapturowski, G. Ostrovski, J. Quan, R. Munos, and
[187] E. Liang, R. Liaw, P. Moritz, R. Nishihara, R. Fox, W. Dabney, “Recurrent experience replay in distributed rein-
K. Goldberg, J. E. Gonzalez, M. I. Jordan, and I. Stoica, forcement learning,” in International conference on learning
“Rllib: Abstractions for distributed reinforcement learning,” representations, 2019.
2017. [Online]. Available: https://arxiv.org/abs/1712.09381 [207] M. Hessel, J. Modayil, H. Van Hasselt, T. Schaul, G. Ostro-
[188] M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, vski, W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver,
P. Welinder, B. McGrew, J. Tobin, O. Pieter Abbeel, and “Rainbow: Combining improvements in deep reinforcement
W. Zaremba, “Hindsight experience replay,” Advances in learning,” in Proceedings of the AAAI conference on artifi-
neural information processing systems, vol. 30, 2017. cial intelligence, vol. 32, no. 1, 2018.
[189] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and [208] E. Ie, V. Jain, J. Wang, S. Narvekar, R. Agarwal, R. Wu, H.-
O. Klimov, “Proximal policy optimization algorithms,” T. Cheng, T. Chandra, and C. Boutilier, “Slateq: a tractable
arXiv preprint arXiv:1707.06347, 2017. decomposition for reinforcement learning with recommen-
[190] T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, dation sets,” in Proceedings of the 28th International Joint
J. Tan, V. Kumar, H. Zhu, A. Gupta, P. Abbeel et al., “Soft Conference on Artificial Intelligence, 2019, pp. 2592–2599.
actor-critic algorithms and applications,” arXiv preprint [209] E. Wijmans, A. Kadian, A. Morcos, S. Lee, I. Essa,
arXiv:1812.05905, 2018. D. Parikh, M. Savva, and D. Batra, “Dd-ppo: Learning near-
[191] S. Fujimoto, H. Hoof, and D. Meger, “Addressing function perfect pointgoal navigators from 2.5 billion frames,” arXiv
approximation error in actor-critic methods,” in Interna- preprint arXiv:1911.00357, 2019.
tional conference on machine learning. PMLR, 2018, pp. [210] TensorFlow, “Tensorboard: A unified platform for visual-
1587–1596. izing live, rich data for tensorflow models,” in The IEEE
[192] H. Mania, A. Guy, and B. Recht, “Simple random search Conference on Computer Vision and Pattern Recognition
provides a competitive approach to reinforcement learning,” (CVPR) Workshops, 2016.
arXiv preprint arXiv:1803.07055, 2018. [211] L. Biewald, “Experiment tracking with weights and
[193] W. Dabney, M. Rowland, M. Bellemare, and R. Munos, biases,” 2020, software available from wandb.com. [Online].
“Distributional reinforcement learning with quantile regres- Available: https://www.wandb.com/
sion,” in Proceedings of the AAAI Conference on Artificial [212] Comet.ml, “Comet.ml: Machine learning operations plat-
Intelligence, vol. 32, no. 1, 2018. form,” https://www.comet.ml/, 2018.
[194] S. Huang, R. F. J. Dossa, A. Raffin, A. Kanervisto, and [213] A. Chen, A. Chow, A. Davidson, A. DCunha, A. Ghodsi,
W. Wang, “The 37 implementation details of proximal policy S. A. Hong, A. Konwinski, C. Mewald, S. Murching,
optimization,” The ICLR Blog Track 2023, 2022. T. Nykodym, P. Ogilvie, M. Parkhe, A. Singh, F. Xie,
[195] A. Kuznetsov, P. Shvechikov, A. Grishin, and D. Vetrov, M. Zaharia, R. Zang, J. Zheng, and C. Zumar,
“Controlling overestimation bias with truncated mixture of “Developments in mlflow: A system to accelerate the
continuous distributional quantile critics,” in International machine learning lifecycle,” in Proceedings of the Fourth
Conference on Machine Learning. PMLR, 2020, pp. 5556– International Workshop on Data Management for End-to-
5566. End Machine Learning, ser. DEEM’20. New York, NY,
[196] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, USA: Association for Computing Machinery, 2020. [Online].
“Trust region policy optimization,” in International confer- Available: https://doi.org/10.1145/3399579.3399867
ence on machine learning. PMLR, 2015, pp. 1889–1897. [214] I. Habibie, M. Kleinsorge, Z. Al-Ars, J. Schneider,
[197] S. Huang and S. Ontañón, “A closer look at invalid ac- W. Kessler, and T. Kuhlen, “Visdom: A Tool for Visual-
tion masking in policy gradient algorithms,” arXiv preprint ization and Monitoring of Machine Learning Experiments,”
arXiv:2006.14171, 2020. in ArXiv e-prints, Mar. 2017.
[198] M. G. Bellemare, W. Dabney, and R. Munos, “A distribu- [215] Microsoft, “Tensorwatch,” https://github.com/microsoft/
tional perspective on reinforcement learning,” in Interna- tensorwatch, 2021.
tional conference on machine learning. PMLR, 2017, pp. [216] M. R. Asia, “Nni (neural network intelligence): An
449–458. open-source automl toolkit for neural architecture search
[199] K. W. Cobbe, J. Hilton, O. Klimov, and J. Schulman, and hyper-parameter tuning,” GitHub repository, 2021.
“Phasic policy gradient,” in International Conference on [Online]. Available: https://github.com/microsoft/nni
Machine Learning. PMLR, 2021, pp. 2020–2027. [217] T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama,
[200] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, “Optuna: A next-generation hyperparameter optimization
A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel framework,” in Proceedings of the 25rd ACM SIGKDD
et al., “Mastering chess and shogi by self-play with a International Conference on Knowledge Discovery and Data
general reinforcement learning algorithm,” arXiv preprint Mining, 2019.
arXiv:1712.01815, 2017.

46 VOLUME 11, 2023

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3384460

Afifi et al.: A Primer in Machine Learning with Computer Networks: Techniques, Datasets and Models

[218] R. Liaw, E. Liang, R. Nishihara, P. Moritz, J. E. Gonzalez, [233] B. Chun, D. Culler, T. Roscoe, A. Bavier, L. Peterson,
and I. Stoica, “Tune: A research platform for distributed M. Wawrzoniak, and M. Bowman, “Planetlab: An overlay
model selection and training,” 2018. [Online]. Available: testbed for broad-coverage services,” SIGCOMM Comput.
https://arxiv.org/abs/1807.05118 Commun. Rev., vol. 33, no. 3, p. 3–12, jul 2003. [Online].
[219] M. Jaderberg, V. Dalibard, S. Osindero, W. M. Czarnecki, Available: https://doi.org/10.1145/956993.956995
J. Donahue, A. Razavi, O. Vinyals, T. Green, I. Dunning, [234] B. White, J. Lepreau, L. Stoller, R. Ricci, S. Guruprasad,
K. Simonyan, C. Fernando, and K. Kavukcuoglu, M. Newbold, M. Hibler, C. Barb, and A. Joglekar,
“Population based training of neural networks,” 2017. “An integrated experimental environment for distributed
[Online]. Available: https://arxiv.org/abs/1711.09846 systems and networks,” SIGOPS Oper. Syst. Rev.,
[220] J. Bergstra, D. Yamins, and D. Cox, “Making a science of vol. 36, no. SI, p. 255–270, dec 2003. [Online]. Available:
model search: Hyperparameter optimization in hundreds of https://doi.org/10.1145/844128.844152
dimensions for vision architectures,” in Proceedings of the [235] M. Berman, J. S. Chase, L. H. Landweber, A. Nakao,
30th International Conference on Machine Learning, ser. M. Ott, D. Raychaudhuri, R. Ricci, and I. Seskar, “Geni:
Proceedings of Machine Learning Research, S. Dasgupta and A federated testbed for innovative network experiments,”
D. McAllester, Eds., vol. 28, no. 1. Atlanta, Georgia, USA: Comput. Networks, vol. 61, pp. 5–23, 2014.
PMLR, 17–19 Jun 2013, pp. 115–123. [Online]. Available: [236] L. Yang, F. Wen, J. Cao, and Z. Wang, “Edgetb: A hy-
https://proceedings.mlr.press/v28/bergstra13.html brid testbed for distributed machine learning at the edge
[221] R. Ostrovskiy and A. Gordon, “Keras tuner,” GitHub with high fidelity,” IEEE Transactions on Parallel and
repository, 2020. [Online]. Available: https://github.com/ Distributed Systems, vol. 33, no. 10, pp. 2540–2553, 2022.
keras-team/keras-tuner [237] F. Hussain, R. Hussain, and E. Hossain, “Explainable arti-
[222] J. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl, ficial intelligence (xai): An engineering perspective,” arXiv
“Algorithms for hyper-parameter optimization,” in Advances preprint arXiv:2101.03613, 2021.
in Neural Information Processing Systems, 2013, pp. 2546– [238] S. Mukherjee, J. Rupe, and J. Zhu, “Xai for communication
2554. [Online]. Available: http://papers.nips.cc/paper/ networks,” in 2022 IEEE International Symposium on Soft-
4443-algorithms-for-hyper-parameter-optimization ware Reliability Engineering Workshops (ISSREW). IEEE,
[223] M. Feurer, A. Klein, K. Eggensperger, J. T. Springenberg, 2022, pp. 359–364.
M. Blum, and F. Hutter, “Efficient and robust automated [239] C. Liaskos, S. Nie, A. Tsioliaridou, A. Pitsillides, S. Ioan-
machine learning,” in Advances in Neural Information nidis, and I. Akyildiz, “End-to-end wireless path deploy-
Processing Systems, 2015, pp. 2962–2970. [Online]. ment with intelligent surfaces using interpretable neural
Available: http://papers.nips.cc/paper/5872-efficient-and- networks,” IEEE Transactions on Communications, vol. 68,
robust-automated-machine-learning no. 11, pp. 6792–6806, 2020.
[224] M. Merenda, C. Porcaro, and D. Iero, “Edge machine learn- [240] A.-D. Marcu, S. K. G. Peesapati, J. M. Cortes, S. Imtiaz,
ing for ai-enabled iot devices: A review,” Sensors, vol. 20, and J. Gross, “Explainable artificial intelligence for energy-
no. 9, p. 2533, 2020. efficient radio resource management,” in 2023 IEEE Wire-
[225] A.-S. Tonneau, N. Mitton, and J. Vandaele, “A survey on less Communications and Networking Conference (WCNC).
(mobile) wireless sensor network experimentation testbeds,” IEEE, 2023, pp. 1–6.
in 2014 IEEE International Conference on Distributed Com- [241] P. Barnard, I. Macaluso, N. Marchetti, and L. A. DaSilva,
puting in Sensor Systems. IEEE, 2014, pp. 263–268. “Resource reservation in sliced networks: an explainable
[226] M. Chernyshev, Z. Baig, O. Bello, and S. Zeadally, “Internet artificial intelligence (xai) approach,” in ICC 2022-IEEE
of things (iot): Research, simulators, and testbeds,” IEEE International Conference on Communications. IEEE, 2022,
Internet of Things Journal, vol. 5, no. 3, pp. 1637–1647, pp. 1530–1535.
2017. [242] A. Palaios, C. L. Vielhaus, D. F. Külzer, C. Watermann,
[227] S. Zhu, S. Yang, X. Gou, Y. Xu, T. Zhang, and Y. Wan, “Sur- R. Hernangomez, S. Partani, P. Geuer, A. Krause, R. Sat-
vey of testing methods and testbed development concerning tiraju, M. Kasparick, G. Fettweis, F. H. P. Fitzek, H. D.
internet of things,” Wireless Personal Communications, pp. Schotten, and S. Stanczak, “The story of qos prediction in
1–30, 2022. vehicular communication: From radio environment statistics
[228] R. Lim, F. Ferrari, M. Zimmerling, C. Walser, P. Sommer, to network-access throughput prediction,” 2023.
and J. Beutel, “Flocklab: A testbed for distributed, synchro- [243] S. Hariharan, A. Velicheti, A. Anagha, C. Thomas, and
nized tracing and profiling of wireless embedded systems,” N. Balakrishnan, “Explainable artificial intelligence in cy-
in Proceedings of the 12th international conference on bersecurity: A brief review,” in 2021 4th International Con-
Information processing in sensor networks, 2013, pp. 153– ference on Security and Privacy (ISEA-ISAP). IEEE, 2021,
166. pp. 1–12.
[229] R. Trüb, R. Da Forno, L. Daschinger, A. Biri, J. Beutel, and [244] N. Capuano, G. Fenza, V. Loia, and C. Stanzione, “Explain-
L. Thiele, “Non-intrusive distributed tracing of wireless iot able artificial intelligence in cybersecurity: A survey,” IEEE
devices with the flocklab 2 testbed,” ACM Transactions on Access, vol. 10, pp. 93 575–93 600, 2022.
Internet of Things, vol. 3, no. 1, pp. 1–31, 2021. [245] C. Molnar, Interpretable Machine Learning, 2nd ed.,
[230] C. Adjih, E. Baccelli, E. Fleury, G. Harter, N. Mitton, 2022. [Online]. Available: https://christophm.github.io/
T. Noel, R. Pissard-Gibollet, F. Saint-Marcel, G. Schreiner, interpretable-ml-book
J. Vandaele et al., “Fit iot-lab: A large scale open experimen- [246] A. B. Arrieta, N. Díaz-Rodríguez, J. Del Ser, A. Bennetot,
tal iot testbed,” in 2015 IEEE 2nd World Forum on Internet S. Tabik, A. Barbado, S. García, S. Gil-López, D. Molina,
of Things (WF-IoT). IEEE, 2015, pp. 459–464. R. Benjamins et al., “Explainable artificial intelligence (xai):
[231] M. Schuß, C. A. Boano, M. Weber, and K. Römer, “A Concepts, taxonomies, opportunities and challenges toward
competition to push the dependability of low-power wireless responsible ai,” Information fusion, 2020.
protocols to the edge,” in Proceedings of the 14th Inter- [247] K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep in-
national Conference on Embedded Wireless Systems and side convolutional networks: Visualising image classification
Networks (EWSN). Junction Publishing, Feb. 2017, pp. models and saliency maps,” arXiv preprint arXiv:1312.6034,
54–65. 2013.
[232] D. Molteni, G. P. Picco, M. Trobinger, and [248] C. Rudin, “Stop explaining black box machine learning mod-
D. Vecchia, “Cloves: A large-scale ultra-wideband els for high stakes decisions and use interpretable models
testbed,” in Proceedings of the 20th ACM Conference instead,” Nature machine intelligence, vol. 1, no. 5, pp. 206–
on Embedded Networked Sensor Systems, ser. SenSys 215, 2019.
’22. New York, NY, USA: Association for Computing [249] T. Shapira and Y. Shavitt, “Flowpic: Encrypted internet
Machinery, 2023, p. 808–809. [Online]. Available: traffic classification is as easy as image recognition,” in IEEE
https://doi.org/10.1145/3560905.3568072 INFOCOM 2019-IEEE Conference on Computer Communi-

VOLUME 11, 2023 47

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3384460

Afifi et al.: A Primer in Machine Learning with Computer Networks: Techniques, Datasets and Models

cations Workshops (INFOCOM WKSHPS). IEEE, 2019, C. Araya, S. Yan et al., “Captum: A unified and generic
pp. 680–687. model interpretability library for pytorch,” arXiv preprint
[250] S. M. Lundberg and S.-I. Lee, “A unified approach to inter- arXiv:2009.07896, 2020.
preting model predictions,” Advances in neural information [268] “TorchRay,” 2019. [Online]. Available: https://github.com/
processing systems, vol. 30, 2017. facebookresearch/TorchRay
[251] M. T. Ribeiro, S. Singh, and C. Guestrin, “" why should [269] “TF-Explain,” 2019. [Online]. Available: https://tf-explain.
i trust you?" explaining the predictions of any classifier,” readthedocs.io/en/latest/index.html
in Proceedings of the 22nd ACM SIGKDD international [270] C. Zhang, P. Patras, and H. Haddadi, “Deep learning in
conference on knowledge discovery and data mining, 2016, mobile and wireless networking: A survey,” IEEE Commu-
pp. 1135–1144. nications Surveys & Tutorials, vol. 21, no. 3, pp. 2224–2287,
[252] H. Nori, S. Jenkins, P. Koch, and R. Caruana, “Interpretml: 2019.
A unified framework for machine learning interpretability,” [271] H. Hellström, J. M. B. d. Silva Jr, V. Fodor, and C. Fis-
arXiv preprint arXiv:1909.09223, 2019. chione, “Wireless for machine learning,” arXiv preprint
[253] R. Agarwal, L. Melnick, N. Frosst, X. Zhang, B. Lengerich, arXiv:2008.13492, 2020.
R. Caruana, and G. E. Hinton, “Neural additive models: [272] D. Jin, Z. Yu, P. Jiao, S. Pan, D. He, J. Wu, P. Yu, and
Interpretable machine learning with neural nets,” Advances W. Zhang, “A survey of community detection approaches:
in Neural Information Processing Systems, vol. 34, pp. 4699– From statistical modeling to deep learning,” IEEE Transac-
4711, 2021. tions on Knowledge and Data Engineering, 2021.
[254] N. Wehner, A. Seufert, T. Hoßfeld, and M. Seufert, “Ex- [273] J. Zhou, G. Cui, Z. Zhang, C. Yang, Z. Liu, and
plainable Data-Driven QoE Modelling with XAI,” in 2022 M. Sun, “Graph neural networks: A review of methods and
15th International Conference on Quality of Multimedia applications,” CoRR, vol. abs/1812.08434, 2018. [Online].
Experience (QoMEX), 2023, pp. 1–6. Available: http://arxiv.org/abs/1812.08434
[255] K. Brunnström, S. A. Beker, K. De Moor, A. Dooms, [274] M. A. Ridwan, N. A. M. Radzi, F. Abdullah, and Y. E. Jalil,
S. Egger, M.-N. Garcia, T. Hossfeld, S. Jumisko-Pyykkö, “Applications of machine learning in networking: A survey
C. Keimel, M.-C. Larabi et al., “Qualinet white paper on of current issues and future challenges,” IEEE Access, vol. 9,
definitions of quality of experience,” 2013. pp. 52 523–52 556, 2021.
[256] N. Wehner, M. Seufert, J. Schuler, S. Wassermann, P. Casas, [275] F. Tang, B. Mao, N. Kato, and G. Gui, “Comprehensive
and T. Hossfeld, “Improving web qoe monitoring for en- survey on machine learning in vehicular network: Technol-
crypted network traffic through time series modeling,” ACM ogy, applications and challenges,” IEEE Communications
SIGMETRICS Performance Evaluation Review, vol. 48, Surveys & Tutorials, vol. 23, no. 3, pp. 2027–2057, 2021.
no. 4, pp. 37–40, 2021. [276] E. García-Martín, C. F. Rodrigues, G. Riley, and H. Grahn,
[257] E. Hüllermeier and W. Waegeman, “Aleatoric and epistemic “Estimation of energy consumption in machine learning,”
uncertainty in machine learning: An introduction to con- Journal of Parallel and Distributed Computing, vol. 134, pp.
cepts and methods,” Machine Learning, vol. 110, pp. 457– 75–88, 2019. [Online]. Available: https://www.sciencedirect.
506, 2021. com/science/article/pii/S0743731518308773
[258] A. F. Psaros, X. Meng, Z. Zou, L. Guo, and G. E. Kar- [277] L. Song, X. Hu, G. Zhang, P. Spachos, K. N. Plataniotis, and
niadakis, “Uncertainty quantification in scientific machine H. Wu, “Networking systems of ai: On the convergence of
learning: Methods, metrics, and comparisons,” Journal of computing and communications,” IEEE Internet of Things
Computational Physics, vol. 477, p. 111902, 2023. Journal, vol. 9, no. 20, pp. 20 352–20 381, 2022.
[259] A. Kendall and Y. Gal, “What uncertainties do we need in [278] G. Drainakis, K. V. Katsaros, P. Pantazopoulos, V. Sourlas,
bayesian deep learning for computer vision?” Advances in and A. Amditis, “Federated vs. centralized machine learning
neural information processing systems, vol. 30, 2017. under privacy-elastic users: A comparative analysis,” in 2020
[260] V. Dignum, Responsible artificial intelligence: how to de- IEEE 19th International Symposium on Network Comput-
velop and use AI in a responsible way. Springer, 2019, vol. ing and Applications (NCA). IEEE, 2020, pp. 1–8.
2156. [279] I. A. Majeed, S. Kaushik, A. Bardhan, V. S. K. Tadi, H.-
[261] Y. Siriwardhana, P. Porambage, M. Liyanage, and K. Min, K. Kumaraguru, and R. D. Muni, “Comparative
M. Ylianttila, “Ai and 6g security: Opportunities and chal- assessment of federated and centralized machine learning,”
lenges,” in 2021 Joint European Conference on Networks arXiv preprint arXiv:2202.01529, 2022.
and Communications & 6G Summit (EuCNC/6G Summit). [280] W. Hassan, T.-S. Chou, O. Tamer, J. Pickard, P. Appiah-
IEEE, 2021, pp. 616–621. Kubi, and L. Pagliari, “Cloud computing survey on services,
[262] Q. Lu, L. Zhu, X. Xu, J. Whittle, D. Zowghi, and enhancements and challenges in the era of machine learning
A. Jacquet, “Responsible ai pattern catalogue: A collection and data science,” International Journal of Informatics and
of best practices for ai governance and engineering,” ACM Communication Technology (IJ-ICT), vol. 9, no. 2, pp. 117–
Comput. Surv., oct 2023, just Accepted. [Online]. Available: 139, 2020.
https://doi.org/10.1145/3626234 [281] Y. Ko, K. Choi, H. Jei, D. Lee, and S.-W. Kim, “Aladdin:
[263] W. Yang, H. Le, S. Savarese, and S. C. Hoi, “Omnixai: A Asymmetric centralized training for distributed deep learn-
library for explainable ai,” arXiv preprint arXiv:2206.01612, ing,” in Proceedings of the 30th ACM International Confer-
2022. ence on Information & Knowledge Management, 2021, pp.
[264] V. Arya, R. K. Bellamy, P.-Y. Chen, A. Dhurandhar, 863–872.
M. Hind, S. C. Hoffman, S. Houde, Q. V. Liao, R. Luss, [282] X. Wang, Y. Han, V. C. Leung, D. Niyato, X. Yan,
A. Mojsilović et al., “Ai explainability 360 toolkit,” in and X. Chen, “Convergence of edge computing and deep
Proceedings of the 3rd ACM India Joint International learning: A comprehensive survey,” IEEE Communications
Conference on Data Science & Management of Data (8th Surveys & Tutorials, vol. 22, no. 2, pp. 869–904, 2020.
ACM IKDD CODS & 26th COMAD), 2021, pp. 376–379. [283] F. Samie, L. Bauer, and J. Henkel, “From cloud down
[265] J. Klaise, A. Van Looveren, G. Vacanti, and A. Coca, to things: An overview of machine learning in internet of
“Alibi explain: Algorithms for explaining machine learning things,” IEEE Internet of Things Journal, vol. 6, no. 3, pp.
models,” The Journal of Machine Learning Research, vol. 22, 4921–4934, 2019.
no. 1, pp. 8194–8200, 2021. [284] W. Toussaint and A. Y. Ding, “Machine learning systems
[266] M. Sundararajan, A. Taly, and Q. Yan, “Axiomatic attri- in the iot: Trustworthiness trade-offs for edge intelligence,”
bution for deep networks,” in International conference on in 2020 IEEE Second International Conference on Cognitive
machine learning. PMLR, 2017, pp. 3319–3328. Machine Intelligence (CogMI). IEEE, 2020, pp. 177–184.
[267] N. Kokhlikyan, V. Miglani, M. Martin, E. Wang, [285] A. Smola and S. Narayanamurthy, “An architecture for par-
B. Alsallakh, J. Reynolds, A. Melnikov, N. Kliushkina, allel topic models,” Proceedings VLDB Endowment, vol. 3,
no. 1-2, pp. 703–710, Sep. 2010.

48 VOLUME 11, 2023

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3384460

Afifi et al.: A Primer in Machine Learning with Computer Networks: Techniques, Datasets and Models

[286] M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, [305] J. Xie, F. R. Yu, T. Huang, R. Xie, J. Liu, C. Wang, and
V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su, “Scaling Y. Liu, “A survey of machine learning techniques applied
distributed machine learning with the parameter server,” to software defined networking (sdn): Research issues and
2014. challenges,” IEEE Communications Surveys & Tutorials,
[287] J. Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin, Q. V. vol. 21, no. 1, pp. 393–430, 2018.
Le, M. Z. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, [306] R. Boutaba, M. A. Salahuddin, N. Limam, S. Ayoubi,
and A. Y. Ng, “Large scale distributed deep networks,” N. Shahriar, F. Estrada-Solano, and O. M. Caicedo, “A
Advances in neural information processing systems, vol. 25, comprehensive survey on machine learning for networking:
2012. evolution, applications and research opportunities,” Journal
[288] J. Jiang, B. Cui, C. Zhang, and L. Yu, “Heterogeneity-aware of Internet Services and Applications, vol. 9, no. 1, pp. 1–99,
distributed parameter servers,” in Proceedings of the 2017 2018.
ACM International Conference on Management of Data, [307] Y. Liu, F. R. Yu, X. Li, H. Ji, and V. C. M. Leung,
ser. SIGMOD ’17. New York, NY, USA: Association for “Blockchain and machine learning for communications and
Computing Machinery, May 2017, pp. 463–478. networking systems,” IEEE Communications Surveys &
[289] B. McMahan, E. Moore, D. Ramage, S. Hampson, and Tutorials, vol. 22, no. 2, pp. 1392–1431, 2020.
B. A. y Arcas, “Communication-efficient learning of deep [308] O. Nassef, W. Sun, H. Purmehdi, M. Tatipamula, and
networks from decentralized data,” in Artificial intelligence T. Mahmoodi, “A survey: Distributed machine learning for
and statistics. PMLR, 2017, pp. 1273–1282. 5g and beyond,” Computer Networks, vol. 207, p. 108820,
[290] Y. Liu, Y. Kang, C. Xing, T. Chen, and Q. Yang, “A secure 2022.
federated transfer learning framework,” IEEE Intelligent [309] M. A. Ridwan, N. A. M. Radzi, F. Abdullah, and Y. Jalil,
Systems, vol. 35, no. 4, pp. 70–82, 2020. “Applications of machine learning in networking: a survey of
[291] F. Chen, M. Luo, Z. Dong, Z. Li, and X. He, “Federated current issues and future challenges,” IEEE Access, vol. 9,
meta-learning with fast convergence and efficient communi- pp. 52 523–52 556, 2021.
cation,” arXiv preprint arXiv:1802.07876, 2018. [310] O. A. Wahab, A. Mourad, H. Otrok, and T. Taleb, “Fed-
[292] A. Gibiansky, “Bringing HPC techniques to deep learning,” erated machine learning: Survey, multi-level classification,
Baidu Research, Tech. Rep., 2017. desirable criteria and future directions in communication
[293] P. Patarasuk and X. Yuan, “Bandwidth optimal all-reduce and networking systems,” IEEE Communications Surveys
algorithms for clusters of workstations,” pp. 117–124, 2009. & Tutorials, vol. 23, no. 2, pp. 1342–1397, 2021.
[294] H. Zhao and J. Canny, “Butterfly mixing: Accelerating [311] W. Y. B. Lim, N. C. Luong, D. T. Hoang, Y. Jiao, Y.-C.
Incremental-Update algorithms on clusters,” in Proceedings Liang, Q. Yang, D. Niyato, and C. Miao, “Federated learning
of the 2013 SIAM International Conference on Data Mining in mobile edge networks: A comprehensive survey,” IEEE
(SDM), ser. Proceedings. Society for Industrial and Applied Communications Surveys & Tutorials, vol. 22, no. 3, pp.
Mathematics, May 2013, pp. 785–793. 2031–2063, 2020.
[295] X. Wan, H. Zhang, H. Wang, S. Hu, J. Zhang, and K. Chen, [312] A. Imteaj, K. Mamun Ahmed, U. Thakker, S. Wang,
“Rat-resilient allreduce tree for distributed machine learn- J. Li, and M. H. Amini, “Federated learning for resource-
ing,” in 4th Asia-Pacific workshop on networking, 2020, pp. constrained iot devices: Panoramas and state of the art,”
52–57. Federated and Transfer Learning, pp. 7–27, 2022.
[296] O. Gupta and R. Raskar, “Distributed learning of deep [313] M. Abbasi, A. Shahraki, and A. Taherkordi, “Deep learning
neural network over multiple agents,” Journal of Network for network traffic monitoring and analysis (ntma): A sur-
and Computer Applications, vol. 116, pp. 1–8, 2018. vey,” Computer Communications, vol. 170, pp. 19–41, 2021.
[297] E. Samikwa, A. Di Maio, and T. Braun, “Ares: Adaptive [314] F. Hussain, S. A. Hassan, R. Hussain, and E. Hossain,
resource-aware split learning for internet of things,” Com- “Machine learning for resource management in cellular and
puter Networks, vol. 218, p. 109380, 2022. iot networks: Potentials, current solutions, and open chal-
[298] V. Turina, Z. Zhang, F. Esposito, and I. Matta, “Federated lenges,” IEEE communications surveys & tutorials, vol. 22,
or split? a performance and privacy analysis of hybrid split no. 2, pp. 1251–1275, 2020.
and federated learning architectures,” in 2021 IEEE 14th [315] A. Talpur and M. Gurusamy, “Machine learning for security
International Conference on Cloud Computing (CLOUD). in vehicular networks: A comprehensive survey,” IEEE Com-
IEEE, 2021, pp. 250–260. munications Surveys & Tutorials, vol. 24, no. 1, pp. 346–379,
[299] Y. Gao, M. Kim, C. Thapa, S. Abuadbba, Z. Zhang, 2021.
S. Camtepe, H. Kim, and S. Nepal, “Evaluation and op- [316] M. A. Hossain, R. M. Noor, K.-L. A. Yau, S. R. Azzuhri,
timization of distributed machine learning techniques for M. R. Z’aba, and I. Ahmedy, “Comprehensive survey of ma-
internet of things,” IEEE Transactions on Computers, 2021. chine learning approaches in cognitive radio-based vehicular
[300] E. Samikwa, A. Di Maio, and T. Braun, “Adaptive early exit ad hoc networks,” IEEE Access, vol. 8, pp. 78 054–78 108,
of computation for energy-efficient and low-latency machine 2020.
learning over iot networks,” in 2022 IEEE 19th Annual Con- [317] W. Guo, “Explainable artificial intelligence for 6g: Improving
sumer Communications & Networking Conference (CCNC). trust between human and machine,” IEEE Communications
IEEE, 2022, pp. 200–206. Magazine, vol. 58, no. 6, pp. 39–45, 2020.
[301] Y. Matsubara, M. Levorato, and F. Restuccia, “Split com- [318] Y. Zheng, Z. Liu, X. You, Y. Xu, and J. Jiang, “Demystifying
puting and early exiting for deep learning applications: deep learning in networking,” in Proceedings of the 2nd
Survey and research challenges,” ACM Computing Surveys, Asia-Pacific Workshop on Networking, 2018, pp. 1–7.
vol. 55, no. 5, pp. 1–30, 2022. [319] A. Adadi and M. Berrada, “Peeking inside the black-box:
[302] Y. Matsubara, R. Yang, M. Levorato, and S. Mandt, “Sc2: A survey on explainable artificial intelligence (xai),” IEEE
Supervised compression for split computing,” arXiv preprint Access, vol. 6, pp. 52 138–52 160, 2018.
arXiv:2203.08875, 2022. [320] A. Heuillet, F. Couthouis, and N. Díaz-Rodríguez, “Explain-
[303] M. G. S. Murshed, C. Murphy, D. Hou, N. Khan, ability in deep reinforcement learning,” Knowledge-Based
G. Ananthanarayanan, and F. Hussain, “Machine learning Systems, vol. 214, p. 106685, 2021.
at the network edge: A survey,” ACM Comput. Surv., [321] A. Paleyes, R.-G. Urma, and N. D. Lawrence, “Challenges in
vol. 54, no. 8, oct 2021. [Online]. Available: https: deploying machine learning: A survey of case studies,” ACM
//doi.org/10.1145/3469029 Comput. Surv., vol. 55, no. 6, p. 114, 2022.
[304] J. Shuja, K. Bilal, W. Alasmary, H. Sinky, and E. Alanazi, [322] S. Mehdizadeh and A. Shirmarz, “A comprehensive survey
“Applying machine learning techniques for caching in next- on machine learning using in software defined networks
generation edge networks: A comprehensive survey,” Journal (sdn),” Human-Centric Intelligent Systems, vol. 3, no. 4, pp.
of Network and Computer Applications, vol. 181, p. 103005, 312–343, 2023.
2021.

VOLUME 11, 2023 49

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3384460

Afifi et al.: A Primer in Machine Learning with Computer Networks: Techniques, Datasets and Models

[323] M. A. Ridwan, N. A. M. Radzi, F. Abdullah, and Y. E. Jalil, omy, research challenges, and future research directions,”
“Applications of machine learning in networking: A survey Ieee Access, vol. 8, pp. 187 498–187 522, 2020.
of current issues and future challenges,” IEEE Access, vol. 9, [341] D. Li, X. Chen, M. Becchi, and Z. Zong, “Evaluating the
pp. 52 523–52 556, 2021. energy efficiency of deep convolutional neural networks
[324] S. Sezer, S. Scott-Hayward, P. K. Chouhan, B. Fraser, on cpus and gpus,” in 2016 IEEE international confer-
D. Lake, J. Finnegan, N. Viljoen, M. Miller, and N. Rao, ences on big data and cloud computing (BDCloud), so-
“Are we ready for sdn? implementation challenges for cial computing and networking (SocialCom), sustainable
software-defined networks,” IEEE Communications Maga- computing and communications (SustainCom)(BDCloud-
zine, vol. 51, no. 7, pp. 36–43, 2013. SocialCom-SustainCom). IEEE, 2016, pp. 477–484.
[325] C. Lu, A. Saifullah, B. Li, M. Sha, H. Gonzalez, D. Gunati- [342] M. Svedin, S. W. Chien, G. Chikafa, N. Jansson, and
laka, C. Wu, L. Nie, and Y. Chen, “Real-time wireless sensor- A. Podobas, “Benchmarking the nvidia gpu lineage: From
actuator networks for industrial cyber-physical systems,” early k80 to modern a100 with asynchronous memory trans-
Proceedings of the IEEE, vol. 104, no. 5, pp. 1013–1024, fers,” in Proceedings of the 11th International Symposium
2015. on Highly Efficient Accelerators and Reconfigurable Tech-
[326] H. Kopetz and W. Steiner, “Real-time communication,” in nologies, 2021, pp. 1–6.
Real-time systems: Design principles for distributed embed- [343] E. Samikwa, A. Di Maio, and T. Braun, “Disnet: Distributed
ded applications. Springer, 2022, pp. 177–200. micro-split deep learning in heterogeneous dynamic iot,”
[327] D. Zhang, P. Shi, Q.-G. Wang, and L. Yu, “Analysis and IEEE internet of things journal, 2023.
synthesis of networked control systems: A survey of recent [344] Y. Chen, T.-J. Yang, J. Emer, and V. Sze, “Understanding
advances and challenges,” ISA transactions, vol. 66, pp. 376– the limitations of existing energy-efficient design approaches
392, 2017. for deep neural networks,” Energy, vol. 2, no. L1, p. L3, 2018.
[328] S. Ramstedt and C. Pal, “Real-time reinforcement learning,” [345] E. Hossain and F. Fredj, “Editorial energy efficiency of
Advances in neural information processing systems, vol. 32, machine-learning-based designs for future wireless systems
2019. and networks,” IEEE Transactions on Green Communica-
[329] J. Mendez, K. Bierzynski, M. Cuéllar, and D. P. Morales, tions and Networking, vol. 5, no. 3, pp. 1005–1010, 2021.
“Edge intelligence: concepts, architectures, applications, and [346] R. Desislavov, F. Martínez-Plumed, and J. Hernández-
future directions,” ACM Transactions on Embedded Com- Orallo, “Trends in ai inference energy consumption:
puting Systems (TECS), vol. 21, no. 5, pp. 1–41, 2022. Beyond the performance-vs-parameter laws of deep
[330] L. E. Lwakatare, A. Raj, I. Crnkovic, J. Bosch, and H. H. learning,” Sustainable Computing: Informatics and Systems,
Olsson, “Large-scale machine learning systems in real-world vol. 38, p. 100857, 2023. [Online]. Available: https://www.
industrial settings: A review of challenges and solutions,” sciencedirect.com/science/article/pii/S2210537923000124
Information and software technology, vol. 127, p. 106368, [347] T.-J. Yang, Y.-H. Chen, J. Emer, and V. Sze, “A method to
2020. estimate the energy consumption of deep neural networks,”
[331] A. Redder, A. Ramaswamy, and H. Karl, “Stability and con- in 2017 51st Asilomar Conference on Signals, Systems, and
vergence of distributed stochastic approximations with large Computers, 2017, pp. 1916–1920.
unbounded stochastic information delays,” arXiv preprint [348] T.-J. Yang, Y.-H. Chen, and V. Sze, “Deep neural network
arXiv:2305.07091, 2023. energy estimation tool,” 1, 2017, accessed: 2024-01-25.
[332] R. Bless, B. Bloessl, M. Hollick, M. Corici, H. Karl, [349] J. Lin, W.-M. Chen, Y. Lin, C. Gan, S. Han et al., “Mcunet:
D. Krummacker, D. Lindenschmitt, H. D. Schotten, and Tiny deep learning on iot devices,” Advances in Neural
L. Wimmer, “Dynamic network (re-) configuration across Information Processing Systems, vol. 33, pp. 11 711–11 722,
time, scope, and structure,” in 2022 Joint European Con- 2020.
ference on Networks and Communications & 6G Summit [350] H. Cai, C. Gan, L. Zhu, and S. Han, “Tinytl: Reduce
(EuCNC/6G Summit). IEEE, 2022, pp. 547–552. activations, not trainable parameters for efficient on-device
[333] J. Hoydis, F. A. Aoudia, A. Valcarce, and H. Viswanathan, learning,” arXiv preprint arXiv:2007.11622, 2020.
“Toward a 6g ai-native air interface,” IEEE Communications [351] L. Heim, A. Biri, Z. Qu, and L. Thiele, “Measuring what
Magazine, vol. 59, no. 5, pp. 76–81, 2021. really matters: Optimizing neural networks for tinyml,”
[334] ——, “Toward a 6g ai-native air interface,” 2021. arXiv preprint arXiv:2104.10645, 2021.
[335] F. Ait Aoudia, J. Hoydis, A. Valcarce, and [352] Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally,
H. Viswanathan. (2021) Toward a 6g ai-native air “Deep gradient compression: Reducing the communica-
interface. [Online]. Available: https://www.bell-labs.com/ tion bandwidth for distributed training,” arXiv preprint
institute/white-papers/toward-6g-ai-native-air-interface/ arXiv:1712.01887, 2017.
[336] R. . Schwarz. (2023) Enabling an ai-native air [353] Y. Abadade, A. Temouden, H. Bamoumen, N. Benamar,
interface for 6g: Rohde & schwarz showcases ai/ml- Y. Chtouki, and A. S. Hafid, “A comprehensive survey on
based neural receiver with optimized modulation tinyml,” IEEE Access, vol. 11, pp. 96 892–96 922, 2023.
at brooklyn 6g summit, in collaboration with
nvidia. [Online]. Available: https://www.rohde-
schwarz.com/se/about/news-press/all-news/enabling-
an-ai-native-air-interface-for-6g-rohde-schwarz-showcases-
ai-ml-based-neural-receiver-with-optimized-modulation-
at-brooklyn-6g-summit-in-collaboration-with-nvidia-press- HAITHAM AFIFI obtained his B.Sc degree
release-detailpage_229356-1425541.html in Information Engineering and Technol-
[337] Ericsson. (2021) Defining ai native: A key ogy in 2014 and M.Sc. in Communication
enabler for advanced intelligent telecom networks. Engineering 2015 from the German Uni-
[Online]. Available: https://www.ericsson.com/en/reports- versity in Cairo, and received his PhD at
and-papers/white-papers/ai-native
the Hasso Plattner Institute in 2023. His
[338] C. Chaccour, W. Saad, M. Debbah, Z. Han, and H. V.
research interests include wireless network
Poor, “Less data, more knowledge: Building next generation
semantic communication networks,” 2022. virtualization and reinforcement learning,
[339] C. K. Thomas, C. Chaccour, W. Saad, M. Debbah, and C. S. and network optimization. He also has in-
Hong, “Causal reasoning: Charting a revolutionary course dustry experience as a network engineer
for next-generation ai-native wireless networks,” 2023. at Orange Business Services and IT Consultant for integrating
[340] A. Mughees, M. Tahir, M. A. Sheikh, and A. Ahad, “Towards generative AI in network operations.
energy efficient 5g networks using machine learning: Taxon-

50 VOLUME 11, 2023

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3384460

Afifi et al.: A Primer in Machine Learning with Computer Networks: Techniques, Datasets and Models

SABRINA POCHABA completed her Mas- REZA POORZARE (Member, IEEE) received
ter’s degree in mathematics at Ruprecht- the B.S. and M.S. degrees in computer
Karls-University in Heidelberg, Germany. engineering from the Azad University of
Since 2021 she is working as a data sci- Iran, in 2010 and 2014, respectively, and
entist at Salzburg Research Forschungsge- the Ph.D. degree in network engineering
sellschaft and doing her doctoral studies from Universitat Politècnica de Catalunya,
at Paris-Lodron University Salzburg. There Barcelona, Spain, in 2022. Currently, he is
she is engaged in different machine learning a Postdoctoral Researcher with the Data-
methods, focusing on networks and commu- Centric Software Systems (DSS) Research
nication. Group, Institute of Applied Research, Karl-
sruhe University of Applied Sciences, Karlsruhe, Germany. His
research interests include 5G, mmWave, wireless mobile networks,
TCP, MPTCP, congestion control, and artificial intelligence

ANDREAS BOLTRES received the B.S. and

M.S. degrees in Informatics from Karlsruhe
Institute of Technology (KIT), Germany, in
2017 and 2021. He is currently pursuing
the Ph.D. degree with the Autonomous
Learning Robots Lab at KIT. His research DANIEL STOLPMANN received the B.Sc. and
interests include multi-agent and swarm re- M.Sc. degrees in computer science and engi-
inforcement learning, and their applications neering from Hamburg University of Tech-
to robotics and computer networking, in nology (TUHH), Germany, in 2017 and
particular routing optimization and traffic 2019. During his master studies, he started
engineering. working at the Institute of Communication
Networks (ComNets) as a student assistant
and became a research fellow after his grad-
uation. His research interests include ma-
chine learning for communication networks,
DOMINIC LANIEWSKI received the B.S. de- congestion control, active queue management, network coding and
gree in information systems from the Uni- network emulation.
versity of Münster in 2017, and the M.S. in
computer science from the Osnabrück Uni-
versity in 2019. Currently, he is pursuing the
Ph.D. degree with the Distributed Systems
Group at the Osnabrück University. His re-
search interests include machine learning for
networks, QoE of streaming applications, NIKOLAS WEHNER studied computer sci-
video and point cloud streaming, and robot ence at the University of Würzburg, Ger-
communications. many, where he received a Master’s degree.
In 2018, he started to work as a Research
Engineer at the Center for Technology Ex-
perience at the AIT Austrian Institute of
Technology in Vienna, Austria. Since Octo-
ber 2019, he is a doctoral researcher at the
JANEK HABERER received the B.Sc. and
Chair of Communication Networks of the
M.Sc. degrees in computer science from Kiel
University of Würzburg. His interests are
University, Germany, in 2019 and 2021. He
QoE of Internet applications, machine learning for networks, and
is currently pursuing the Ph.D. degree with
user-centric communication networks.
the Distributed Systems Group at Kiel Uni-
versity. His research interests include dis-
tributed machine learning and its applica-
tions in the Internet of Things, particularly
edge computing and split learning.

ADRIAN REDDER received a Master of Sci-

ence in Electrical Engineering from Pader-
born University (UPB), Germany, in 2019,
with a major in control and information
LEONARD PAELEKE received the B.Eng. de- theory. After his graduate studies, he was
gree in mechanical engineering in 2018 from a member of the computer networks group
the Berliner Hochschule für Technik (BHT) at UPB, where he pursued a Ph.D. in
and the M.S. degree in computational engi- computer science from 2020 to 2023 on
neering from the TU Berlin in 2022. Since Distributed Stochastic Approximation Al-
April 2022, he is a doctoral researcher in gorithms to be submitted in 12/2023. Since
the Internet-Technologies and Softwariza- 10/2023, he has been a member of the Automatic Control Group
tion group at the Hasso-Plattner-Institute at UPB, for which he will serve as an academic council.
(HPI). His research interests include ma-
chine learning for networks, networks for
machine learning, distributed machine learning, and their appli-
cation in mobile telecommunication networks such as 6G.
VOLUME 11, 2023 51

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3384460

Afifi et al.: A Primer in Machine Learning with Computer Networks: Techniques, Datasets and Models

ERIC SAMIKWA received the M.Sc. degree MICHAEL SEUFERT is a Full Professor at the
in computer science and engineering from University of Augsburg, Germany, heading
the Royal Institute of Technology (KTH), the Chair of Networked Embedded Systems
Stockholm, Sweden, in 2020. He is cur- and Communication Systems. He received
rently pursuing the Ph.D. degree with the the Bachelor’s degree in economathematics
Communication and Distributed Systems and the Diploma, PhD, and Habilitation de-
Group, Institute of Computer Science, Uni- grees in computer science from the Univer-
versity of Bern, Bern, Switzerland. His re- sity of Würzburg, Germany, and holds the
search interests are in the areas of dis- First State Examination degree in mathe-
tributed machine learning, federated learn- matics, computer science, and education for
ing, split learning, edge computing, and Internet of Things. teaching in secondary schools. His research focuses on user-centric
communication networks, including QoE of Internet applications,
AI/ML for QoE-aware network management, as well as perfor-
mance modeling and evaluation of communication systems.

52 VOLUME 11, 2023

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4