ML Book 20240903
ML Book 20240903
1 Chapter 1: Introduction 1
1.1 Network Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 The Role of Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Why Now? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 What You Will Learn in This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
i
7.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.2 Large Language Models in Networking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
11 Appendix: Activities 99
11.1 Packet Capture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
11.2 Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
11.3 Network Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
11.4 Data Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
11.5 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
11.6 Training a Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
11.7 Machine Learning Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
11.8 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
11.9 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
11.10 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
11.11 Trees and Ensembles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
11.12 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
11.13 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
11.14 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
11.15 Automation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
ii
CHAPTER
ONE
CHAPTER 1: INTRODUCTION
The Internet is indisputably part of our critical infrastructure. In spite of this, it is prone to failures, mis-
configurations, and attacks that can lead to disruptions of connectivity and service, the impact of which can
be serious. The Internet is not a self-managing system: as problems arise, network operators must detect,
diagnose, and fix these problems so that users can enjoy good performance and uninterrupted service. This
process of network management involves monitoring the network for aberrations and fixing problems as (or
before) they arise, so that the network can continue to run well.
To maintain good performance for the applications they support, networks must continually adapt to chang-
ing conditions. For example, when a component of the network fails, either due to misconfiguration or a
hardware failure, the network must be able to detect the failure and reroute traffic around the failed compo-
nent. Similarly, when the network is congested or traffic demands change, the network must reroute traffic
around such congestion, and additional capacity may need to be provisioned. Finally, when a new application
is deployed, the network must be configured to support the application’s requirements. These tasks are often
referred to as network management.
In 2020, the COVID-19 pandemic highlighted the importance of network management—and the fact that net-
works don’t just automatically adapt to these changes (yet). Human network engineers and operators needed
to respond to the changing conditions that resulted from sudden and dramatic changes to traffic patterns.
Specifically, as much of society shifted to working from home, Internet traffic patterns changed dramatically.
Traffic that previously was exchanged on enterprise, corporate, and university networks suddenly shifted to
home networks; video conference traffic from home networks skyrocketed, placing unprecedented load and
strain on home networks and local access networks. And yet, despite predictions of its demise, the Internet
actually handled these dramatic shifts quite well.
The process of adapting to these changes, however, was not automatic. Network management tasks on both
short and long timescales ensured that the Internet continued to operate well in spite of these dramatic shifts.
In the short term, network operators needed to detect and respond to various performance issues as sudden
shifts in traffic placed unprecedented strain on parts of the network. For example, as more people used
video conferencing, traffic volumes increased between access Internet service providers and the cloud service
providers hosting Zoom, WebEx, Teams, and other video conference services, which (at least temporarily)
introduced additional congestion on the network. Similarly, as entertainment stayed home, traffic volumes to
video-on-demand services increased beyond normal levels.
1
Machine Learning for Networking, Release Version 0.1
In general, communications networks need to be continually maintained by humans, who are typically re-
ferred to as network operators. Network management tasks can be short-term or long-term. Short-term
tasks involve correcting equipment or infrastructure failures, responding to sudden and unexpected changes
in traffic load, and countering cyberattacks. Long-term tasks involve provisioning capacity, establishing and
satisfying service-level agreements, and making other more substantial changes to infrastructure. All of
these tasks typically require a careful process, sometimes involving simulations or models to help predict
how a particular change to network configuration or infrastructure might affect how traffic flows through the
network and, ultimately, impact user experience.
For a long time, these network management tasks tasks have typically relied on extensive domain knowl-
edge—and a lot of trial and error. These tasks have also become more difficult and error-prone as the
systems themselves have become more complex. The need to predict the outcomes of these tasks—and
automate them—has become particularly apparent. While the networking industry had some success in cre-
ating closed-form models and simulations of protocols, applications, and systems several decades ago, these
systems today are far too complex for closed-form analysis.
The complexity of network management has increased in parallel with two other trends: (1) the ability to
collect larger amounts of data from (and about) networks; and (2) the emergence of practical machine learning
models and algorithms that can facilitate predictions about network behavior.
Before understanding how machine learning can improve the operation of communications networks, we
must first understand what machine learning is and how it might be used. Early definitions of machine
learning date to the 1950s. Arthur Samuel defined machine learning as “the ability to learn without being
explicitly programmed”. In the 1990s, Tom Mitchell refined this definition: “A computer program is said to
learn from experience E, with respect to some task T and some performance measure P, if its performance
on T, as measured by P, improves with experience E.” In this book, we will explore a wide array of machine
learning algorithms, but some of the more common themes include recognizing patterns in data, making
predictions based on past data or observations, inferring properties that cannot be directly measured, and
automatic generation of synthetic data.
Figure 1.1 illustrates the potential role of machine learning in the network ecosystem. The network produces
much information for a network operator to make sense of, including raw traffic flow records, summary
statistics, and even user feedback. Measurement, which we will discuss in detail in Chapter 3, is the process
by which these raw data can be recorded and converted into useful formats. Machine learning models can
subsequently use these data to make predictions or otherwise infer information about the network, such as
the presence of an attack, the performance of a particular application, or a user’s experience.
Given inferences or predictions from one or more models, a network operator can then make decisions about
possible changes to a network. Those changes might be implemented directly (and manually) by a network
operator, or in some automated fashion, such as through a program that automatically updates the network
configuration. Recent trends in networking have called such closed-loop control a “self-driving network,”
but in fact most updates to network configuration still involve a human in the loop.
The application of machine learning to networking is also complementary to another trend in networking
called software-defined networking, which seeks to add programmatic control to network configuration and
actions applied to network traffic, such as how traffic is forwarded. Knowing what control actions a software-
defined network should take in response to various network situations requires the ability to infer, forecast,
and predict the outcomes of possible options—all of which can be provided by machine learning.
Machine learning has been applied to networking since the mid-1990s, dating back to the design of early
email spam filters, as well as early anomaly and malware detection systems on closed datasets (e.g., the
DARPA 1999 KDD challenge). In the mid-2000s, machine learning applications to networking experienced
a renaissance, with particular advancements in network security and performance. In the security realm,
researchers devised techniques to use network traffic as inputs for machine learning models that detect “bot-
nets”, collections of compromised hosts that launch large-scale Internet attacks. Others found that applying
machine learning to network traffic could be a far more robust technique for detecting email spam than analyz-
ing the contents of email messages. Modeling network traffic also enabled detection of other new, previously
unseen attacks on networks and networked systems. Applications of machine learning in these domains ulti-
mately formed the basis of a wide array of commercial products in the network security space, from startups
to Fortune 500 companies.
A few years later, machine learning emerged as a feasible approach for predicting the performance of net-
worked systems that were difficult to model in closed form. For example, researchers developed machine
learning models for predicting how network configuration changes would ultimately affect the performance
of a web application, such as web search response time. Machine learning began to show significant promise
in this era, but most of the applications of machine learning remained focused on offline detection and pre-
diction.
Now, nearly 20 years later, machine learning is experiencing yet another rebirth in networking. The “de-
mocratization” of machine learning through widely available software libraries and the automation of many
machine learning pipelines make machine learning more widely accessible to those who wish to apply it
to existing datasets. To complement these developments, the emergence of programmable network teleme-
try has made it possible to gather a wide variety of network traffic data—and different representations of
that traffic—from the network, often in real-time. The combination of these two developments has created
vast possibilities for how machine learning can be applied to networking, from long-term prediction and
forecasting, to short-term diagnosis, to real-time adaptation to changing conditions.
This book offers a practical introduction to network measurement, machine learning methods and models,
and applications of machine learning to network management.
We will explore the details of network measurement, including the different types of network data that are
available given current network devices from routers to middleboxes, effective techniques for gathering and
acquiring network data, and what is possible to learn and infer from network data of varying types and
sources. We will also discuss new directions in network measurement and data representation, including
advances in network telemetry and ways of representing network data for input to machine learning models.
Many of the concepts in this book will be introduced through examples and exercises. The Appendix
includes an activity for performing simple analysis of a packet capture. This example will give you a
chance to get familiar with basic network data and how to load this type of data into a Jupyter notebook.
We will provide more details about network data in subsequent chapters.
Equipped with a better understanding of how various data is acquired from the network, we will then turn to
an overview of machine learning methods. This book is not designed to provide the detailed mathematical
foundations behind each of the methods—there are plenty of books out there for that already. Instead, this
book is meant to give you exposure to a variety of machine learning methods, provide an intuitive under-
standing of how these methods work, and show examples of how these methods can be applied in practice,
including when particular models may or may not be appropriate to apply to a particular dataset. Because
a large focus of the book is on how to apply models, we will focus significant attention on data acquisition
and representation. As we will see, the choice of how to represent the data to a model is often as important
as the choice of the model itself.
This practical view of machine learning for networking necessitates a focus on the entire machine learning
pipeline, from data acquisition to model selection, deployment, and maintenance. Whereas most machine
learning courses and textbooks focus on the modeling aspects of machine learning in isolation, many factors
contribute to the overall effectiveness of using machine learning to solve networking problems. As such, we
will explore all aspects of machine learning pipelines in the networking context with example applications
to network management tasks.
While this book will introduce you to many basic concepts in machine learning, our focus is on the application
of machine learning to networking problems. For a fuller introduction to machine learning, we recommend
the book below by Hastie et al.
Further Reading
Hastie, Trevor, Tibshirani, Robert, and Friedman, Jerome. The Elements of Statistical Learning: Data Min-
ing, Inference, and Prediction. Vol. 2. Springer, 2009.
TWO
Various tasks in computer networking rely on drawing inferences from data gathered from networks. Gen-
erally speaking, these networking tasks fall into one of the following categories:
• Security. Attack detection seeks to develop methods to detect and identify a wide class of attacks.
Common examples of attack detection include spam filtering, botnet detection, denial of service de-
tection, and malicious website detection. Anomaly detection (also known as novelty detection), which
seeks to identify unusual traffic, network events, or activity based on deviations from “normal” ob-
servations. Device and application identification aims to identify or classify network applications or
devices based on observed measurements. Common fingerprinting applications include application
identification (i.e., identifying applications from observed traffic that is typically encrypted), website
fingerprinting (i.e., identifying the web page or website that a user is visiting from encrypted traffic)
and device identification (i.e., identifying the hardware, operating system, or or other system details
about a device that is connected to the network).
• Performance Inference. Performance diagnosis seeks to infer the performance properties of a network
device, appliance, or application. Performance diagnosis can be short-term (e.g., detection of a sudden
degradation in application quality) or long-term (e.g., determination of the need to provision additional
capacity in the network). Performance prediction aims to determine or predict the effects of the network
(or its applications) in response to a particular change in infrastructure and environment. Such “what
if” predictions could be concerned with predicting the possible outcomes from a failure (e.g., what
happens to traffic load or application performance if a particular component fails) or from changes to
configuration (e.g., what happens to application response time or overall reliability if more capacity is
added at a particular data center).
• Resource Allocation. Many networking applications need to make decisions about how to allocate
resources, often in real time. One common resource allocation problem is congestion control, the
process by which the operating system or application at an end host determines a sending rate that
results in both high network utilization and fairness to competing flows. Another common resource
allocation problem concerns video encoding, whereby video streaming applications (e.g., video on
demand, video conferencing) need to determine the appropriate encoding rate given observed network
conditions.
The rest of this chapter explores each of these classes of application in more detail and also introduces
methods that are commonly used to address problems in each area.
7
Machine Learning for Networking, Release Version 0.1
2.1 Security
One of the earliest applications of machine learning to networking was in the area of network security. Specif-
ically, a technique known as Naive Bayes classification was used to detect email spam based on the contents
of the email messages. Since that first application, machine learning techniques have been applied to detect a
variety of attacks, including denial of service, phishing, botnets, and others. In many cases, machine learning
has been applied to detect a network-based attack post hoc, but in some cases machine learning has even been
used to predict attacks before they happen, as discussed below.
Some aspects of attack detection and anomaly detection overlap. A distinction between attack detection and
anomaly detection is that attack detection tends to involve supervised learning techniques and labeled data,
whereas anomaly detection often draws on unsupervised techniques and unlabeled data. Anomaly detection
applications can also be broader than attack detection—for example, anomaly detection could entail detecting
a misconfiguration or component failure not related to a malicious attack.
Activity: Security
The Appendix provides an opportunity to explore a security application in the context of network traffic
analysis, exploring how features of attack traffic differ from normal web browsing activities.
One of the earliest forms of messaging on the Internet was electronic mail (“email”). The design of the
Internet made it possible for anyone to send a message to anyone else on the Internet as long as that recipient’s
email address was known. Because the marginal cost of sending an email is so low, an ecosystem of spammers
developed; email spam dates back to the 1990s and at one point constituted more than 90% of all email traffic
on the Internet. To combat the growing tide of email spam, system administrators developed spam filters to
automatically distinguish spam from legitimate email based on the words that were included in the contents
of the mail. The intuition behind these initial spam filters is relatively straightforward: spam email is more
likely to contain certain words that do not appear in legitimate email. Based on this simple observation, it
was relatively straightforward to train simple Naive Bayes classifiers based on existing corpora of email that
had already been labeled as spam or “ham” (not spam). Such a Naive Bayes classifier could then take a new
email and classify it based on the frequencies of different words in the email.
Over time, however, spammers adapted to content-based spam filtering and began to create messages that
confused and evaded these filters. Common approaches included creating emails with “hidden” words (e.g.,
white text on white background) that would increase the likelihood that a spam message would be classified
as legitimate. One of the more significant challenges with content-based spam filters is that they are relatively
easy to evade in this fashion. Over time, the security industry worked to develop more robust filters based
on features that are more difficult to evade. One such development was the use of network-level features
in spam classifiers, such as the Internet service provider or autonomous system (AS) from which a spam
message originated, the distance the message traveled, the density of mail servers in the IP address space of
the email sender and so forth. These features ultimately proved to be more robust, particularly when used in
conjunction with content-based features.
Ultimately, similar classes of features became useful for detecting other types of network attacks, including
botnets. A pioneering use of machine learning in network security, in particular, was the use of domain name
system (DNS) traffic to detect large networks of compromised hosts, or so-called botnets. Various behaviors
characteristic of botnets make them particularly amenable to detection via machine learning-based classifiers.
Notably, many botnets rely on some form of “command and control” (C&C) infrastructure from which they
receive commands to launch different forms of attack (e.g., denial of service, spam, phishing). Coordinating
such activity from a C&C—which is often at least logically centralized—typically results in DNS lookups to
the C&C server that do not follow the patterns of normal DNS lookup activity. Such lookup characteristics
can thus be used to distinguish “normal” Internet hosts from those that may be members of a botnet.
Applying machine learning to DNS traffic to detect malware and botnets has become an active area of research
and innovation over the past 15 years. Because DNS traffic has been unencrypted and visible by Internet
service providers (or anyone operating a recursive DNS resolver), and because DNS traffic has been such
a good predictor of common forms of Internet attack, it has been appealing and relatively straightforward
to develop machine learning-based attack detection methods based on DNS traffic. Companies could sell
security appliances to Internet service providers, enterprise networks, or other institutions that would simply
“listen” to DNS traffic and detect unusual behavior. However, more recent trends—notably the evolution of
DNS to rely on encrypted transport and application protocols (i.e., DNS-over-TLS and DNS-over-HTTPS)
and the deployment of these technologies in more operating systems and browsers—mean that DNS traffic is
becoming more difficult for certain parties to observe and (e.g., in the case of Oblivious DNS) more difficult
to attribute a DNS lookup to the IP address that performed the lookup. Developing new machine learning
techniques to perform attack detection that are less reliant on specific network protocols or unencrypted
communications is an active area of research.
Some methods have applied machine learning to take attack detection a step further—predicting an attack
before it occurs. This approach recognizes that many network attacks require the establishment of infrastruc-
ture, including possibly registering one or more domain names, setting up a website, and so forth. Unusual
network attachment points (e.g., the Internet service provider that provides connectivity for a scam site) can
also be indicative of attacks or other unwanted network activity. For example, it is well understood that
the Russian Business Network, a cyber-criminal organization that provides web hosting, routinely changes
its upstream ISP to maintain connectivity in the face of shutdown. Other network information useful for
predicting attacks include DNS features (e.g., lexical properties of the domain name, when and where the
domain was registered, the authoritative name server for the domain, etc.) and features related to the routing
infrastructure (e.g., the entities providing upstream connectivity to a group of IP addresses and how those
connectivity properties have evolved over time). The ability to detect unusual behavior by observing changes
in network infrastructure is useful because these activities typically occur before an attack takes place. Thus, a
machine learning algorithm that could detect unusual infrastructure activity could provide advanced warning
and allow network operators to prevent an attack before it occurs.
Another area where machine learning is commonly applied is the detection of compromised accounts. In
particular, attackers may attempt to compromise a user account with stolen credentials (e.g., a password
or authentication keys), and system administrators may wish to detect that such unauthorized access has
occurred. To do so, system administrators may use a variety of features including the geographic location of
the user who is logging into the account (and whether that has changed over time), the type of device being
used to log into the account, the number of failed login attempts, various telemetry-related features (e.g.,
keystroke patterns) and so forth.
Past work in this area has used machine learning techniques to detect anomalous behavior, such as a sudden
2.1. Security 9
Machine Learning for Networking, Release Version 0.1
change in login behavior or location. For example, Microsoft’s Azure security team uses machine learning
to adapt to adversaries as these adversaries attempt to gain unauthorized access to accounts and systems.
The scale of the problem is significant, involving hundreds of terabytes of authentication logs, more than a
billion monthly active users, and more than 25 billion authentications per day. Machine learning has proved
effective in detecting account compromise in this environment, adapting to adversaries with limited human
intervention, as well as capturing non-linear correlations between features and labels (in this case, predicting
whether an attack has taken place). These models must take into account a variety of inputs, including
authentication logs, threat data, user login behavior, and so forth. Input features to these models include
aspects such as whether the login occurs at a normal or abnormal time of day, whether the login is coming
from a known device, application, IP address, and country, whether the IP address of the login has been
blacklisted, etc. An interesting characteristic of the application of machine learning in this particular context
is the importance of domain knowledge, both as it pertains to assessing data quality as well as in the process
of engineering and representing features to models.
If network topology, traffic, and protocols are sufficiently simple, it is possible to model the behavior of
protocols as a function of network conditions (e.g., available bandwidth and packet loss rate) in terms of
closed-form equations. The TCP throughput equations are a prime example. In realistic network scenarios,
however, it is difficult to model the behavior of a network protocol or application, let alone the user’s ex-
perience with the application, in closed form. The rise of encrypted traffic also make it difficult to perform
certain types of analysis directly, necessitating machine learning models that can infer properties of transport
protocols or applications indirectly from various features of the network traffic.
Activity: Performance
The Appendix provides an opportunity to explore a network performance application in the context of
network traffic analysis, exploring how application performance is evident in network traffic.
An increasingly active area for machine learning is application performance inference and the related area
of quality of experience (QoE) estimation. As its name suggests, application performance inference seeks
to develop models that can infer various properties of application performance based on directly observable
features. For example, within the context of on-demand video streaming, a machine learning model might
seek to infer characteristics such as video resolution or startup delay (i.e., how long it takes for the video
to start playing) from observable features like packets per second, bytes per second, and so forth. Beyond
inferring application performance, network operators may also want to predict aspects of user engagement
(e.g., how long a user has been watching a particular video) from other characteristics of observed network
traffic (e.g., packet loss, latency).
Quality of experience (QoE) and user engagement estimation can be particularly challenging given the in-
creasing extent to which network traffic is encrypted. For example a video stream typically includes client
requests for particular video segments, which can often directly indicate the resolution of the segments; how-
ever, if the request itself is obfuscated through encryption, traffic traces will not contain direct indicators of
video quality. As a result, it is often necessary to develop new models and algorithms to infer the quality of
these streams from features in encrypted traffic. These models typically rely on features such as the rate of
arrival of individual video segments, the size of these segments, and so forth.
Machine learning models are sometimes used to identify specific network elements and applications—a
process sometimes referred to as “fingerprinting”. This process can be used to identify applications, devices,
websites, and other services. Website fingerprinting is a common application of machine learning whereby a
model takes features from network traffic as input and attempts to identify the website—or even the specific
web page—that a user is visiting. Because individual websites tend to have a unique number of objects, and
each of those objects tend to have unique sizes, the combination of the number and size of objects being
transferred can often uniquely identify a website or web page. Other work has observed that sets of queries
from the Internet’s domain name system (DNS) can also uniquely identify websites.
Similar sets of features can also be used to identify network endpoints (e.g., the type of device that is con-
necting to the network), the type of service (e.g., video, web, or gaming), and even the specific service (e.g.,
Netflix or YouTube) without the benefits of direct traffic inspection. As with QoE estimation, fingerprinting
tasks have become more challenging as a result of increasing levels of traffic encryption. Yet, the features
that are available as metadata in network traffic, such as packet sizes and interarrival times, the number of
traffic flows, the duration of each flow, and so forth, have made it possible to uniquely identify applications,
devices, websites, and services even if the underlying traffic itself is encrypted.
Machine learning can also apply to scenarios where it is necessary to predict the performance of a networked
system. Predicting and evaluating the performance of networked systems with closed-form models can often
prove inaccurate, due to complex interdependencies that exist in these systems. One such area where machine
learning has proved to be useful is “what if” scenario evaluation. A network operator may want to make a
configuration change (e.g., a planned maintenance event or upgrade) but is unsure how that change may
ultimately affect the performance of the system. For example, an operator or a web search engine, commerce
site, or other online service may want to know how deploying or re-provisioning a new front-end web proxy
might ultimately affect the response time of the service.
In these cases, the complex interaction between different system components can make closed-form analysis
challenging. Thus, the ability to model service response time can make it possible to both analyze and predict
a complex result based on various input features. The challenge in such a case is ensuring that all relevant
features that could affect the prediction are captured as input features to the model. Another challenge involves
ensuring that the models are provided enough training data (typically packet traces or features derived from
packet traces) to ensure accurate predictions accross sufficiently diverse network scenarios.
Machine learning models can also be used to perform traffic volume forecasting, helping operators ensure
that networks have sufficient capacity in accordance with growing traffic demands. These planning models
have been used to help network operators adequately provision capacity based on predictions that are fairly
far (e.g., several months) into the future. Cellular networks use prediction models to determine traffic volume
at individual cellular towers, as well as across the network. Similarly, fixed-line Internet service providers
use forecasting models to predict how traffic demands will grow among a particular group of subscribers or
households and thus when it may be necessary to perform additional provisioning (e.g., a “node split”).
Machine learning has also begun to be used in contexts of network resource allocation and optimization.
Resource allocation—the problem of deciding how much resources should be devoted to a particular traffic
flow, application, or process—can be performed over both short and long timescales.
• Over longer timescales, machine learning can be used to predict and forecast how demand might
change, allowing network operators to better provision network resources.
• Over shorter timescales, machine learning can be used to determine how fast an application should
send, how data should be compressed and encoded, and even the paths that traffic should take through
the network.
In these areas in particular, specific types of machine learning, including both deep learning and reinforce-
ment learning, are especially applicable. Below, we discuss some of these applications, and how machine
learning is being applied in these contexts.
Large service provider networks routinely face questions about when, how, and where to provision capacity
in their networks. Typically, these problems may entail gathering data about past and present levels of traffic
on the network and making predictions about how traffic patterns will evolve over time. Based on those
predictions, network operators may then make changes to network configuration that result in additional
network capacity, traffic taking different routes through the network, and so forth.
In this part of the process, where operators make changes to affect provisioning and topology, machine
learning can also play a role, helping network operators answer “what if” questions about how changes to
network configuration ultimately affect other facets, such as traffic volumes along different paths, and even
application performance. For example, operators of large content distribution networks such as Google have
used machine learning to help predict how response time might be affected by the deployment of additional
web caches.
An active area in networked systems and applications today is the development of data encoding schemes
that measure network conditions and adapt encoding in real time to deliver content efficiently. Network
conditions, such as packet loss and latency, can vary over time for many reasons, such as variable traffic
loads that may introduce congestion. In such scenarios, when conditions change, the senders and receivers
of traffic may want to adapt how data is coded in order to maximize the likelihood that the data is delivered
with low latency at the highest possible quality.
Common applications where this problem arises include both on-demand and real-time video streaming. In
both cases, users want to receive a high quality video. Yet, the network may delay or lose packets, and the
receiving client must then either decode the video with the data it receives or wait for the lost data to be
retransmitted (which may not be an option if retransmission takes too long). In the past, common encoding
schemes would measure network conditions such as loss rate and compute a level of redundancy to include
with the transmission in advance. Yet, if that encoding was done ahead of time and the network conditions
change again, then inefficiencies can arise—with either excess data being sent (if the loss rate is less than
planned) or the client receiving a poor quality stream (if the loss rate is higher than planned). Modern video-
on-demand systems pre-encode at a range of rates and allow the client to select among those rates based on
network conditions, but this avoids determining what is the optimal encoding rate and how the client should
decide which rate to request.
Emerging machine-learning approaches use techniques like reinforcement learning to determine how to en-
code a video stream so that the client can receive a high-quality stream with low latency, even as network
conditions change. In addition to changing encoding rates, a service can can also adapt its sending rate in
response to changing network conditions (a process known as congestion control). Research has shown that,
in some circumstances, a sender can dynamically adjust its sending rate in ways that optimize a pre-defined
objective function. While some early incarnations have applied game theory to achieve this optimization,
there are fundamental questions about the learnability of congestion control algorithms, and to what extent
they can out-perform manually designed ones.
THREE
Managing networks requires gathering information that can be used to drive decisions about how to adapt
to changing conditions. Doing so requires gathering information that can be used as input to the decision-
making process, which includes models that perform prediction and inference about the state of the network
and the applications running on it. This process, often referred to as data acquisition in machine learning,
has a longstanding tradition in networking, and is often referred to as simply network measurement. There
are many ways to gather information about the network, which we will discuss in this chapter. Ultimately,
gathering data with the intent of providing input to machine learning models can be an involved process,
since the data must be collected efficiently and in a way that results in effective training for the models.
There are many reasons to measure networks and many ways of gathering these measurements. As we
discussed in the previous chapter, networks can be measured for performance diagnosis and troubleshooting,
security, forecasting, and many other use cases. Before we can talk about models, we first need to discuss
the different ways that network data can be collected. Internet measurement itself is rapidly evolving, as
new tools and capabilities emerge. In this chapter, we will explore both conventional techniques for Internet
measurement, as well as emerging techniques to measure and gather data from networks. We will also
explore how emerging technologies are both expanding the types of measurements that can be gathered from
networks and making it possible to gather new types of measurements. We will also talk about the process
of transforming the raw data into a form that can be used for training machine learning models—in other
words, going from raw measurement data to aggregate statistics and ultimately to features in a model.
We will explore network measurement in the context of two different types of measurement: (1) active
measurement, which introduces new traffic into the network to measure various network properties (from
performance to topology); and (2) passive measurement, which captures existing traffic on the network to
infer information about traffic that is already being sent across the network by existing applications and
services. Each of these types of measurements have various classes of tools to measure certain properties,
as well as various considerations. A simple way to think about these two classes of measurement is that
active measurements can directly measure certain properties of a specific network path (e.g., its capacity,
latency, application properties in some cases, or even the path itself) but cannot provide direct insight into
the circumstances that a particular user or application may be experiencing.
We’ll then explore these two types of measurement in more detail, explaining how each technique works, and
what it can (and cannot) measure. Along the way, we’ll dive into the details with a few examples and hands-
on activities, introducing techniques for processing raw network data into aggregate statistics, as well as
software libraries that can perform some of these transformations. Before getting into specific measurement
techniques, let’s talk a little bit about what can be measured in the first place.
There is an entire annual conference (SIGCOMM’s Internet Measurement Conference) dedicated to the topic
of measurement. One of the foundational papers on Internet Measurement is by Vern Paxson.
15
Machine Learning for Networking, Release Version 0.1
Further Reading
V. Paxson. Strategies for Sound Internet Measurement. SIGCOMM IMC, October 2004.
There are a wide variety of metrics that can be obtained from network data. We highlight some of the
common metrics below. Typically, these metrics can be obtained directly, either by injecting traffic into the
network and observing behavior (active measurement) or by collecting existing network traffic and computing
statistics about what is being observed (passive measurement).
Of course, in machine learning, the goal is inference, which means that the metrics that can be collected
directly, which we discuss below, are often used as inputs (or “features”) to a machine learning model to pre-
dict or infer higher-level phenomena or events, from application quality of experience to attacks. Below, we
highlight some of these metrics, and discuss which of these metrics can be obtained by active measurement,
passive measurement, or both.
• Throughput concerns the amount of time it takes to transfer a certain quantity of data. It is typi-
cally measured in bits per second (or some multiple of bits per second, such as gigabits per second).
Measuring throughput is important because it indicates how much data a network is capable of mov-
ing between two network endpoints within a fixed time window. For much of the Internet’s history,
throughput was the predominant determining factor for a user’s experience (and remains so in many
environments, particularly mobile wireless networks). As a result, it is the dimension of performance
that is most commonly referred to and benchmarked against when considering the performance of con-
sumer Internet access. For example, you might buy a “1 Gigabit” service plan from your ISP, which
typically indicates that you can expect a downstream (i.e., from a server somewhere on the Internet to
your home) throughput of approximately 1 gigabit per second (1 Gbps).
• Latency concerns the amount of time it takes a unit of data (e.g., a packet) to reach an Internet des-
tination. Latency is measured in seconds (ideally, milliseconds). The speed of light is a fundamental
cause of latency, but because Internet traffic is multiplexed in queues, when traffic congestion occurs,
time that packets sit in buffers can also introduce additional latency. Latency is important for many
applications, and can be a differentiator in performance between networks, particularly in the presence
of network equipment with large buffers that can introduce delays when those devices become bot-
tlenecks. The metric is increasingly important and relevant in the context of interactive applications,
such as video conferencing, gaming, and even browsing the web. A related important metric is latency
under load, which is the latency experienced by packets when the network is carrying other traffic that
saturates the capacity of a link, even if briefly (e.g., starting a streaming video, uploading a photo,
sending an email, browsing the web).
• Jitter reflects the change in latency over time—it is effectively the first derivative of latency, or the
change in packet inter-arrival time. As with latency, it is measured in seconds (or milliseconds). Many
applications, especially interactive ones such as voice, video, and gaming, expect network traffic to
arrive at regular intervals. When traffic does not arrive at such intervals, either the user experience
is degraded, or the application must introduce a larger buffer to smooth out variation in packet inter-
arrival (thus converting jitter to latency).
• Packet loss concerns the fraction of packets that fail to reach their intended destination. Packet loss is
generally detrimental to performance. Transport protocols such as the Transmission Control Protocol
(TCP) interpret loss as congestion and slow down as a result of observing packet loss. Lost packets are
retransmitted leading to increased delay. Packet loss is also likely to be detrimental to user experience,
as lost packets represent lost data—which could correspond to frames in a video, words in a voice call,
and so forth.
All machine learning pipelines begin with the collection of relevant data. Data can come from many different
sources and can vary wildly in complexity. Most importantly, the data you gather, and how you represent
it through collections of features, should accurately represent the underlying phenomena you are trying to
model. Machine learning algorithms learn primarily through inductive reasoning, i.e., they learn general
rules by observing specific instances. If the specific instances in your data are not adequately representative
of the general phenomenon you are trying to model, the ML algorithm will not be able to learn the right
general rules and any predictions based on the model are less likely to be correct. Thus, decisions about
what data to acquire and how to acquire it are often critical to whether a model is accurate in practice, as
well as whether the model is efficient and robust enough to be deployed.
In this respect, domain knowledge is particularly important in the context of machine learning. For example,
knowing how patterns of malicious web requests as part of a denial of service attack or vulnerability scan
would exhibit different characteristics from legitimate web requests is essential for thinking about the types
of features that may be useful in a model. More generally, when designing machine learning pipelines for
specific tasks, domain knowledge about the problem you are trying to model can increase the likelihood that
the model you design will work well in practice.
When beginning a ML pipeline, it is common to ask, “how much data is necessary to achieve high-quality
results?” Unfortunately, answering this question is not straightforward, as it depends on the phenomenon
you are trying to model, the machine learning algorithm you select, and the quality of the data you are able
to gather. Conceptually, you will need to collect enough data such that all of the relevant variability in the
underlying phenomenon is represented in the data.
From the perspective of model accuracy, identifying features of the traffic that are likely to result in accurate
predictions often requires knowledge of the underlying phenomena; from the perspective of practical de-
ployment, considerations for data acquisition can also go beyond accuracy, because some features are more
costly or cumbersome than others. For example, packet traces may contain significantly more information
that statistical summaries of network traffic, yet packet traces are orders of magnitude larger, resulting in
higher capture overhead, larger memory and storage costs, longer model training time, and so forth. On the
other hand, summary statistics are more efficient to collect and store, and could reduce training time, yet they
may obscure certain features or characteristics which could ultimately degrade model accuracy. Determining
which feature subsets strike the appropriate balance between model accuracy and systems-related modeling
costs is a general, practical challenge, and choosing the best ways to optimize for this tradeoff remains an
open problem.
As we noted above, there are two primary ways to measure networks: active measurement and passive mea-
surement. We will now discuss the two types of measurement in more detail, and explore the different ways
that each type of measurement can be used to gather data.
One way to measure the characteristics of a network is to send traffic across the network towards a particular
endpoint and observe the characteristics of those measurements in terms of the metrics that we discussed
in the previous section. Specifically, one can measure throughput, latency, jitter, packet loss, and other
properties of a specific Internet path by sending traffic across the network and observing these metrics of the
traffic along that path.
Active measurement is a broad topic. It refers to any type of measurement that involves sending traffic into
the network—sometimes referred to as “probes”—and observing how different network endpoints respond
to these probes. As far as gathering network traffic data is concerned, active measurements involve various
tradeoffs. On one hand, conducting active measurements does not typically involve privacy considerations,
since it does not involve capturing user traffic that may include sensitive information (e.g., the websites that
a user is visiting). Additionally, because active measurements can be run from any network endpoint, they
are typically easier to gather than passive measurements (which can often require the installation of specific
passive network traffic capture infrastructure). On the other hand, active measurements are themselves a load
on the network, and so may affect the performance characteristics we are trying to measure.
There are many different types of active measurements one could perform—too many types to cover in this
chapter. We will, however, have a brief look at three types of active Internet measurements because they are
so common:
1. Internet “speed tests”
2. Domain name system (DNS) queries and responses;
3. Probing and scanning.
These three classes of active measurements are not exhaustive, but they do capture many cases of active
measurement research and practice. We give a brief overview of each in this chapter and offer pointers to
more detailed treatment of this material elsewhere.
Types of Active Measurement
Internet speed tests. One of the canonical application performance measurements is the Internet “speed
test”, which typically refers to a test of application throughput. Users commonly measure throughput with a
client-based “speed test”; a common example is speedtest.net, operated by Ookla. While the detailed designs
of these tests may vary, the general principle of these tests is conceptually simple: they typically try to send
(or receive) as much data as possible to (or from) some Internet destination as possible and measure the time
taken to transfer a particular amount of data over some window. The tests themselves often involve sending a
fixed amount of data and measuring the transfer time for that data. Measuring the time it takes for a client to
send data to some Internet server corresponds to “upload speed”; measuring the time for a client to receive
data from destination corresponds to the “download speed”. Naturally, one critical design choice for an active
measurement like a speed test is where to measure to or from; this design consideration is outside the scope
of this book, but it is important to note that the quality of the measurements can depend significantly on the
choice of endpoint to measure.
In addition to Ookla, many other organizations operate speed tests, including Measurement Lab’s Network
Diagnostic Test (NDT), Fast.com (operated by Netflix), and the Federal Communications Commission’s
Measuring Broadband America program (which operates a router-based speed test from SamKnows). The
design of Internet speed tests has been varied and contentious over the years; it is the topic of many research
papers and outside the scope of this book. We refer the interested reader to related research. Important
for the purposes of analysis and machine learning, however, is that these tests often produce two numbers,
corresponding to throughput in each direction along a network path, typically measured in megabits per
second (or gigabits per second). With the increasing acknowledgement that other metrics are important for
understanding network performance (as discussed above), Internet “speed tests” are evolving to include and
report on those metrics as well.
A common way to measure latency is using the ping tool, which sends a continuous stream of small packets
towards a particular Internet destination, waits for the response, and measures how much time that request-
response round trip required. Although ping is a very common tool, it uses a protocol called the Internet
control message protocol (ICMP), which can sometimes be blocked by security firewalls and is also some-
times treated completely differently by on-path Internet routers (e.g., prioritized or de-prioritized). Thus,
although active measurements are always susceptible to being non-representative of user traffic and expe-
rience, ICMP traffic runs this risk even more. As such, an alternative (and common) latency measurement
involves sending TCP SYN packets and waiting for the SYN-ACK reply. These transport-level probes al-
low for sending traffic on ports that are often not filtered by firewalls and more likely to be processed by
routers as normal traffic would be. Another commonly used active measurement tool is traceroute, which
sends TTL-limited probes to attempt to measure an end-to-end Internet path. As with ping, the nature of how
traceroute is implemented introduces various aspects of uncertainty: the probes or the replies may be blocked
by middleboxes, and the forward and reverse paths may not be symmetric, leading to various measurement
inaccuracies. We refer the reader to related research to understand the limitations of traceroute.
Domain Name System (DNS) queries and responses. The performance of many Internet applications ul-
timately depends on the performance of the Domain Name System (DNS), the system that maps Internet
domain names to IP addresses, among other tasks. Various tools have been developed to measure the per-
formance and characteristics of the DNS, including how DNS relates to application performance (e.g., web
page load time), and tests that can detect whether DNS responses have been manipulated (e.g., by an Internet
censor or other adversary). With the rise of encrypted DNS protocols (e.g., DNS over HTTPS), there has
been increasing attention into the performance of these protocols. As such, new tools have been developed
and released to measure both the performance characteristics and behavior of various DNS resolvers.
Internet scans. One of the more established (and general) ways of performing active Internet measurements
is to perform a so-called Internet “scan”, which typically involves sending a small amount of traffic (some-
times a single, specially crafted packet) to a single network endpoint and measuring the response. Internet
scans could include simple measurements like ping (to measure uptime, or as a simple latency measurement)
though commonly they can involve crafting more sophisticated probes that are designed to elicit a particular
response. A common type of probe is to send a TCP SYN packet (i.e., the first packet in a TCP connec-
tion) to a list of “target” IP addresses and observe how these endpoints respond, in order to perform some
kind of conclusion or inference. Such techniques and variants have been performed to conduct Internet-wide
characterizations of Internet endpoints (e.g., uptime of network endpoints, operating systems of endpoints).
Sophisticated target scans can also be used as active DNS measurements: the Internet contains so-called
“open DNS resolvers” that respond to remote active probes, and sending such probes and measuring how
resolvers respond can also be useful in characterizing certain aspects of DNS across the Internet. While the
most common tools for scanning the Internet are “ping” and “traceroute”, a wide array of scanning tools has
been developed, including nmap and more recently zmap.
Applications of Active Measurement
Active measurements can yield information about network endpoints, or about paths between endpoints.
As previously mentioned, one common approach to gathering data through active network measurements is
through a scan, whereby a single (or small number) of machines on the network send traffic to other network
endpoints in an attempt to elicit a response from those endpoints. That response is typically a response that
corresponds to the protocol traffic that was sent to the device (e.g., a reply to an ICMP ping, a response to
a TCP SYN packet, a response to an HTTP(S) request, a response to a TLS handshake request). Based on
those response packets, the measurement device that initiated the active measurement can collect responses.
The existence (or lack thereof), format, structure, or content of those responses can then subsequently be
used as features in supervised or unsupervised machine learning algorithms for a variety of tasks.
A common application of machine learning to network data gathered from active measurements is finger-
printing. Just about anything in the network can be fingerprinted, or identified—examples include network
infrastructure (e.g., routers, middleboxes), device types (e.g., laptop, desktop, mobile), operating system,
TLS version, and so forth. One classical network fingerprinting tool is called nmap; the tool scans another
network endpoint and, based on the format of reply packets, can then determine with reasonable accuracy
the operating system (and OS version) of the endpoint. The conventional version of nmap used a static set
of rules to determine the likely operating system using features that were discovered and encoded by hand
(e.g., TCP options, congestion control parameters). More recently, research has demonstrated that, given
operating system labels and TCP SYN-ACK packets (i.e., responses to the initial part of the TCP three-way
handshake), it is possible to train machine learning models that can automatically discover the operating
system of a network endpoint, to a much greater degree of accuracy than nmap.
Network measurement has other forms of active measurement that could ultimately be amenable to ma-
chine learning, although these areas are somewhat less explored. One particular possible application area
is anomaly detection, specifically unsupervised learning techniques to detect unusual, actionable changes
in network performance. For example, network operators commonly perform active measurements to mea-
sure network characteristics such as availability (i.e., uptime), round-trip latency, and so forth. Due to the
packet-switched nature of the Internet, some amount of latency variation and packet loss is normal, and to be
expected. However, more unusual or severe latency aberrations might warrant operator attention. To date, it
is less well-understood how to automatically examine longitudinal latency performance measurements and
1. reliably determine when an aberration requires operator attention vs. being part of normal network
operations or conditions (e.g., buffering);
2. determine the underlying cause of the latency problem (and, in particular, whether the source of that
latency aberration is due to a problem within the ISP network, the user’s home network, an endpoint
device, or some other cause).
In contrast to active measurement, which introduces new traffic into the network to measure performance,
passive measurement monitors existing traffic in the network and infers properties about the network from
the measured traffic. Passive measurement has the characteristic (and, in many ways, the advantage) that the
resulting data corresponds to actual traffic that is traversing the network (e.g., a user’s actual web browsing
or streaming video session) and thus the traffic that is captured may more accurately reflect the properties,
behavior, and performance of application performance or user experience.
Types of Passive Measurement
Packet-level monitoring. The most universal form of passive measurement is packet capture, sometimes also
called pcap for short. Packet capture is essentially the equivalent of “listening in” on a network conversation,
akin to a wiretap of sorts. A packet capture will (as the name suggests), capture all bytes that traverse a link in
the network. Common programs that perform packet capture are tcpdump and wireshark. These programs
will not only capture the raw bytes as they traverse the network, but they also will parse the bytes into a
format that humans (or programs) can read. For example, these programs are equipped with libraries that
can parse protocols like Ethernet, TCP, IP, and a variety of application protocols, making it possible to read
and understand traffic as it traverses the network.
Packet capture does introduce a few tradeoffs. One disadvantage of pcaps is that they can be quite voluminous,
particularly on high-speed links. The volume of packet captures can make it difficult not only to capture all of
the packets as they traverse the network, but also to store, transfer, and later process the data. For this reason,
packet captures are sometimes truncated—capturing only the packet headers, as opposed to the complete
packet with its payload.
Flow-level monitoring. Other aggregated representations of packet captures are often used to reduce volume,
storage, or processing overhead. One common example of aggregation is IPFIX (commonly also known as
NetFlow, which is the Cisco-proprietary format on which the IPFIX standard was based). IPFIX reports only
aggregate statistics for each flow in the network, where a flow is defined by a unique five-tuple—source and
destination IP address, source and destination port, and protocol—as well as a time window of activity (e.g.,
continuous transfer without an interruption for more than a certain time interval, as well as a maximum time
length). For each flow, an IPFIX record contains salient features, such as the start and end time of the flow,
the five-tuple of the flow, the number of packets and bytes in the flow, any TCP flags that may have been
observed over the duration of the flow, and the router interface on which the traffic was observed. IPFIX has
the advantage of being more compact than packet captures, but the format does not represent certain aspects
of network traffic that can sometimes be useful for statistical inference and machine learning. For example,
IPFIX does not represent individual packet interarrival times, the sizes of individual packets, short-timescale
traffic bursts, and so forth.
For many years, packet captures were prohibitively expensive, and even complete IPFIX records were often
too voluminous, leading to “sampled IPFIX” or “sampled NetFlow”, probabilistic samples of traffic that
traversed a router interface. On high-speed links in transit networks, these representations are still common.
Yet, increasingly, and especially at the network edge, packet capture at very high rates is now possible.
Technologies such as eBPF, PF_RING, and other libraries now make it possible to capture packet traces
at speeds of more than 40 Gbps, for example. These new technologies are causing us to reconsider how
network traffic might be captured—and represented—for machine learning models. For decades, machine
learning models for networking tasks, from performance to security, had to rely in large part on IPFIX-based
representations as input. Now, these new capture and monitoring technologies are making it possible to
capture raw data at very high rates—and causing us to rethink what the “best” representations might be for
a wide array of networking tasks that might rely on machine learning.
Programmatic Interfaces. Analyzing packet captures programmatically can require sophisticated libraries
and APIs. While many of these programmatic options are typically restricted to running on general-purpose
computing devices (e.g., servers, middleboxes), as opposed to network-specific hardware, software-based
options provide significant flexibility in manipulating and analyzing packet capture—including generating
features that can be used for machine learning models.
A good place to become familiar with low-level software packet capture libraries, particularly if you are
new to the area, is Python’s scapy library. scapy will get you going quickly, but its performance may leave
something to be desired if you wish to conduct real-time measurements, parsing, and analysis. Other options
have a steeper learning curve; these include eBPF, dpkt, and PF_RING. We have built on these libraries in
a tool and accompanying Python library, netml, which we will use to conduct many of the exercises in this
book.
Applications of Passive Measurement
Beyond the metrics that we outlined above, there is also a desire to measure what is known as quality of
experience, or QoE. QoE has several connotations depending on the application—it can be used to refer to
application quality metrics such as video resolution, startup delay, or re-buffering in the case of streaming
video, for example. More generally, it can also refer to a user’s general satisfaction level or, rather, quality of
experience, when interacting with a particular application; a common metric for this type of QoE is a mean
opinion score (MOS), which has its roots in the telephony world. Sometimes, we use the term “application
QoE” to differentiate application quality metrics from general user experience.
Tradeoffs: Active vs. Passive Measurement Active measurements may be convenient to implement but
have their own drawbacks:
1. they require measuring to a particular network endpoint, which in certain cases (e.g., a speed test,
an active measurement of network performance) requires setting up a server on the other end of the
measurement;
2. they involve introducing additional traffic into the network, which may change the performance of the
network and in some cases may be prohibitive;
3. because the measurements are synthetic, the measurements may not be representative of what a user
actually experiences.
As noted above, there is also an increasing desire to measure the quality of specific applications, such as
web browsing, video streaming, real-time video conferencing, and so forth. One way of assessing the quality
of these applications is using active measurements that emulate the behavior of these specific applications.
For example, the Federal Communications Commission (FCC) incorporates active measurements for web
performance and video streaming performance for specific services into their Measuring Broadband America
(MBA) test suite.
There are several challenges with using application-specific measurements to assess application performance.
First, application probes must be designed for each specific application, and the probes themselves must
mimic the behavior of that application. Accordingly, different applications must have custom probes to accu-
rately measure the performance of the respective application. Furthermore, accurate performance measure-
ment of each application also requires the appropriate measurement infrastructure, such as servers, located
in the appropriate locations (e.g., content delivery networks). As a result of these complications, a preferred
method for measuring application performance is to passively measure the traffic of the respective applica-
tions and infer the application quality. This endeavor turns out to be both a measurement problem and a
machine learning problem! We will discuss applications of machine learning to passively collected network
traffic in subsequent sections.
Now that we understand the different forms that network data (and, in particular, network traffic measure-
ments) can take, we turn to understanding some of the tools that enable the capture of network traffic data,
as well as how we can use the data captured using those tools to perform analysis and prediction.
We defer specific discussion of machine learning pipelines to the next chapter, since the application of ma-
chine learning models to data presents its own set of challenges. Rather, we will focus on common ways
to analyze network traffic using familiar software tools. In this book, we will focus on the use of Python to
perform this type of data analysis. Although similar analysis is possible in other frameworks and languages,
many of the tools that we will explore in this book are primarily available through Python APIs, so Python
is a good place to start.
A common tool to analyze passive traffic measurements is Wireshark. Wireshark is an application that can
be installed on many common operating systems, including Mac, Windows, and Linux. Underlying the
Wireshark tool itself is a library called libpcap, which provides an interface between the network interface
card and the software applications that read data from packets into memory and process them.
Figure 3.1 shows an example analysis from Wireshark. Each row in the application represents a single packet,
and each column represents a particular field. Clicking on any particular row will yield additional information
in the bottom half of the split screen. The information is arranged according to the packet’s “layers”, with
the Ethernet information at the top, followed by the IP information, the TCP information, and finally any
information that might be contained in the payload. Because much modern Internet traffic is encrypted, most
of the information that we will use (and apply) with machine learning examples in this book will come from
the header information.
Taking a closer look at the fields in each packet, particular information turns out to frequently be useful for a
wide variety of machine learning tasks. Relevant features include, for example, the length of each packet and
the time that packet was observed. Even before diving into further detail, observe that these two pieces of
information alone allow for a wide range of processing options for generating features. For example, packet
lengths can be represented individually, as sequences, as summary statistics, as distributions, and so forth.
Combined with timing information, it is possible to compute various traffic rates (e.g., packets per second,
bytes per second over various time intervals).
The Appendix provides an activity that allows you to explore the information that can be derived from
passive measurement of network traffic, specifically packets captured from the network. You will load
the data into a Pandas dataframe, and then explore the data to answer a number of high-level questions
about the data.
Combining these basic statistics with additional information from the packet header makes it possible to
generate these features for individual flows: Specifically, individual traffic flows are typically defined by five
values across the packet header, sometimes called a 5-tuple: source and destination IP address, source and
destination port, and protocol. Deriving the aforementioned statistics according to packets that belong to the
same “flow” (i.e., share a common 5-tuple) also makes it possible to generate (and analyze statistics) across
flows, or even groups of related flows. The tools that generate this type of flow statistics generally operate
on network devices, such as routers and switches. For example, on Cisco routers, the software that is used to
produce flow records is commonly referred to as NetFlow. As noted above, the standard flow record format
based on NetFlow is called IPFIX.
Given network traffic gathered using packet capture tools, the next step is to convert the captured data into a
format that can be used for further analysis, including eventual input into machine learning models.
One common format for analyzing data is the Python library Pandas. Pandas provides a data structure called
a dataframe, which is a two-dimensional structure that can be used to store data in a tabular format. The
columns of the dataframe are analogous to the columns in a spreadsheet, and the rows are analogous to the
rows in a spreadsheet. The data in each column is of the same type, and the data in each row is of the same
type.
The example below shows an example of a Pandas dataframe that contains information about the number of
packets and bytes observed in a particular packet capture. We have used the pcap2pandas function in the
netml library to extract various information from the packet capture and store it in a Pandas dataframe.
The function produces a set of columns that are listed in the example below. We have printed out the first ten
rows of the dataframe, with some of these columns extracted, to illustrate basic use of the library.
pcap = PCAP('../examples/notebooks/data/http.pcap')
pcap.pcap2pandas()
(continues on next page)
After data acquisition, the next step is preparing the data for input to train models. Data preparation in-
volves organizing the data into observations, or examples; generating features and (in the case of supervised
learning) a label for each observation; and preparing the dataset by dividing the data into training and test
sets (typically with an additional sub-division of the training set to tune model parameters).
Many supervised machine learning problems represent the training examples as a two-dimensional matrix.
Each row of the matrix contains a single example. Each column of the matrix corresponds to a single feature
or attribute. This means that each example (row) is a vector of features (columns). This definition is quite
flexible and works for many networks problems at different scales. For example, each example might corre-
spond to an IPFIX flow record, with features corresponding to the number of packets in the flow, the start
and end times of the flow, the source and destination IP addresses, etc. In a different problem, each example
might correspond to a single user, with features representing average throughput and latency to their home
router in 1-hour intervals. This flexibility is a boon, because it allows developers of machine learning algo-
rithms to work with a consistent mathematical formulation—a matrix (or table) of examples and features. In
this book, we will generally refer to a matrix of training examples as X. Sometimes in code examples we will
also use features to refer to the matrix of training examples.
Representations of network traffic that are most effective for prediction and inference problems in networking
detection are typically a priori unclear. Representation choices typically involve various statistics concerning
traffic volumes, packet sampling rates, and aggregation intervals. Yet, which of these transformations retain
predictive information is typically initially unknown. One example challenge is choosing the granularity on
which to aggregate a feature: for instance, if we decide to measure and represent the number of packets per
second as a feature, then changes in traffic patterns at the millisecond level might go undetected; on the other
hand, recording at this granularity creates higher dimensional observations introducing high-dimensional
data that creates challenges with both scale and accuracy. (The so-called curse of dimensionality in machine
learning is the observed fact that the higher the dimension of the input, the harder it is to achieve good
accuracy; in particular, we might then require an amount of training data that is an exponential function of
dimension.)
The process of feature extraction from raw network traffic affects the accuracy of a variety of models. There
are many possible ways to represent raw traffic traces (i.e., packets) as inputs. The data itself can be ag-
gregated on various time windows, elided, transformed into different bases, sampled, and so forth. In the
following example, we focus on representations of network traffic data that do not incorporate semantics from
the header itself, such as IP addresses, port numbers, sequence numbers, or other specific values in the packet
headers themselves. Representations that are agnostic to specific values in the headers lead to models that
are trained to recognize novel behavior regardless of the specific source, destination, application, or location
in the network. By aiming for generality we potentially can achieve results that are applicable to a range of
problems.
netml is a network anomaly detection tool and library written in Python. It consists of two sub-modules:
(1) a pcap parser that produces flows using Scapy/dpkt and (2) a novelty detection module that applies the
novelty detection algorithms. The library takes network packet traces as input, transforms them into various
data representations, and can be incorporated as a Python library or invoked directly from the command line.
The Appendix provides an initial opportunity to use the netml library to extract features from a network
trace. The netml library is publicly available on GitHub as an open-source Python library. It has been
packaged on PyPi for easy installation.
The netml software package includes a Python library, the netml command-line utility, and the rest of the code
in support of various utilities (e.g., testing). netml is a useful tool for converting network traffic captures to
features that can be used for various network analysis tasks. It has been designed, implemented, and released
as an open-source tool to help researchers and practitioners in the field of network security.
Below is an example of the netml library’s capability for converting passive traffic measurements to 12 differ-
ent features for each traffic flow. The example below shows the features associated with the first three flows
in the packet trace. Those features are: flow duration, number of packets sent per second, number of bytes
per second, and various statistics on packet sizes within each flow: mean, standard deviation, inter-quartile
range, minimum, and maximum, the total number of packets, and total number of bytes. The netml library
supports a variety of different feature extraction methods.
pcap = PCAP(
'../examples/notebooks/data/http.pcap',
flow_ptks_thres=2,
random_state=42,
verbose=10,
)
pcap.pcap2flows()
features = pd.DataFrame(pcap.features)
features.head(3)
pcap_file: ../examples/notebooks/data/http.pcap
ith_packets: 0
ith_packets: 10000
ith_packets: 20000
len(flows): 593
total number of flows: 593. Num of flows < 2 pkts: 300, and >=2 pkts: 293␣
˓→without timeout splitting.
kept flows: 293. Each of them has at least 2 pkts after timeout splitting.
flow_durations.shape: (293, 1)
col_0
count 293.000
mean 11.629
std 15.820
min 0.000
25% 0.076
50% 0.455
75% 20.097
max 46.235
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 293 entries, 0 to 292
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 col_0 293 non-null float64
dtypes: float64(1)
memory usage: 2.4 KB
None
0th_flow: len(pkts): 4
After splitting flows, the number of subflows: 291 and each of them has at least␣
˓→2 packets.
True
FOUR
Now that we have a good understanding of how to gather different types of network data through measure-
ment, we can begin to think about how to transform this data for input to machine learning models. While we
often focus on the modeling process as the core of machine learning, in fact the process of machine learning
constitutes a larger pipeline that involves a sequence of the following steps:
• Data Engineering
• Model Training
• Model Evaluation
We refer to this process as the machine learning pipeline. In the last chapter, we discussed the mechanics
of the first part of this pipeline: data acquisition, in the context of network data—that is, different types
of network measurement. In this chapter, we will discuss the pipeline as a whole. Generally, the machine
learning pipeline applies to models that learn from labeled examples (supervised learning) as well as from
data that does not have corresponding labels (unsupervised learning). We will focus initially in this chapter
on how the pipeline applies to supervised learning, before turning to the unsupervised context.
Figure 4.1 illustrates the machine learning pipeline. Data engineering consists of taking data input and
generating a representation appropriate for model training, associating labels with the examples in the data,
and splitting the data into training and validation sets. Model training consists of establishing a model from
the training data. Model evaluation consists of evaluating the model on the validation data and, if necessary,
tuning the model to improve its performance. The output of the pipeline is a model that can be used to predict
labels for new examples.
29
Machine Learning for Networking, Release Version 0.1
The first step in training any machine learning model is gathering the data and preparing it for input to a
model. In the context of network data, the last chapter taught us how to gather different types of network data
using different types of network measurement, as well as how to transform this data into features that may
be relevant for training machine learning models, such as flows. In this chapter, we will take that process a
step further, applying those techniques to create a labeled dataset for model input.
In the context of machine learning, a dataset is a collection of examples that are used to train a model. Each
example is a collection of features that are used to predict a label. For example, in the context of network data,
a dataset might consist of a collection of flows, where each flow is represented by a collection of features,
and the label is the type of application that generated the flow.
The process of preparing a dataset for input to a model (i.e., for training) is sometimes called data engineering.
This process consists of a number of steps, including cleaning the data, associating labels with the examples,
and splitting the data into training, test, and validation sets. In this section, we will discuss each of these
steps.
Machine learning models are only as good as the data that we provide them with for training. If the data
itself contains errors, including incorrect or missing values, the resulting trained models will reflect those
errors. Understanding the quality of data gathered from network measurements, such as those discussed in
the last chapter, is an important part of the data cleaning process and is a relatively well-understood practice
within the context of Internet measurement. Vern Paxson’s “Strategies for Sound Internet Measurement”
provides a more extensive introduction to the topic of understanding and cleaning data gathered from network
measurements. Here, we offer a few brief pointers on cleaning network data, inspired by some of the insights
from Paxson’s paper, with additional insights as they pertain to machine learning.
An important step towards using data for machine learning is understanding the nature and characteristics of
the data. This step is important for a number of reasons. First, it helps us understand the quality of the data,
including critical aspects such as the size of the dataset and where the data was gathered. Second, a deeper
exploration can help us understand the characteristics of the data that may ultimately lead us to defining
features that improve classification accuracy.
Let’s continue with the simple example from last chapter, where we loaded a packet trace into a Pandas
dataframe and generated features. Before we proceed with feature generation or machine learning, we should
understand some basics from out dataset, including, for example, how many packets are in the dataset, the
start and end times of the dataset, minimum and maximum packet lengths and so forth. We can use the
describe method of the Pandas dataframe to get a summary of the dataset. The example below shows an
example of the output of the describe method for the dataframe from the last chapter.
The output from this exercise is a helpful sanity check. The “shape” command shows us how many rows and
columns we have in our data (specifically, the number of rows, corresponding to the number of packets; and
the number of columns, corresponding to the number of features). The describe command shows us some
basic statistics about the data, including the minimum and maximum values for each feature, as well as the
mean and standard deviation for each feature. The head command shows us the first few rows of the dataset;
the tail command shows us the last few rows of the dataset; and the info command shows us the data types
of each column, all of which can be helpful for sanity checking the data.
For example, we can see the minimum and maximum values for the length of each packet: 60 bytes and 1514
bytes, respectively. These values should make sense, given our domain knowledge about network traffic and
packet maximum transmission unit (MTU). Similarly, we see that the minimum port number is 53 (a well-
known service port for DNS), and the maximum port is less than 65535 (the maximum allowable port number,
given that the port field is only 16 bits). These checks show that the data is not obviously unreasonable.
pcap = PCAP('../examples/notebooks/data/http.pcap')
pcap.pcap2pandas()
pdf = pcap.df
pdf = pdf[['datetime', 'ip_src', 'port_src', 'ip_dst', 'port_dst', 'length']]
Understanding the distribution of features can also help us understand which features may be useful for a
particular classification problem. For example, in the example above, we might hypothesize that there is
a relationship between the length of a packet and the type of application that generated the packet, or the
direction of traffic. We can test this hypothesis by plotting the distribution of packet lengths for each port
number, as shown below.
Inevitably, data gathered from real-world network measurements will often contain outliers. Outliers exist
in many datasets, and there are many possible ways to deal with them. The first step in dealing with outliers
is to identify them. One way to identify outliers is to use a box plot, which is a common way to visualize
the distribution of data. A box plot is a graphical representation of the distribution of data, where the box
represents the inter-quartile range of the data, and the whiskers represent the minimum and maximum values
of the data. Outliers are typically represented as points outside of the whiskers. The example below shows a
box plot of the minimum and maximum packet lengths for each flow in the dataset we explored with netml
from the last chapter.
Supervised machine learning requires training models on labeled data. In this section we’ll talk about why
data needs labels, as well as the process of labeling data for different types of prediction problems.
The goal of supervised machine learning is to use labeled data to train a model that can the predict correct
labels (sometimes referred to as targets or target predictions) for data points that the model did not see when
it was being trained.
Supervised machine learning is (somewhat) analogous to how humans learn to solve a general problem by
seeing the “correct” answer to several specific instances and identifying more general patterns that are useful
for solving any instance of the task. For example, when driving, you know how to recognize a stop light and
that the correct action when seeing one is to bring your car to a halt, even though you may have never seen that
specific stop light at that location before. In the context of networking, then, many such examples exist. A
machine learning model might automatically recognize an attack or traffic from a specific application based
on “signature” properties, even though the model has never before seen that specific traffic trace.
Any use of supervised learning to solve a problem thus necessarily involves the collection of labeled training
data. Training data typically has two aspects: examples (i.e., multi-dimensional data points) and labels (i.e.,
the “correct answers”). The training examples are observations or data points. The training labels consist
of the “correct” categories or values that a trained machine learning algorithm should assign to each of the
training examples.
In addition to training examples, creating a supervised machine learning model also requires an associated
label for each set of features. For single-label learning tasks, there will be a single label for every example
in the data set. So if you have a matrix of training examples with 20 rows, you will need to have 20 labels
(one for each example). The training labels are thus typically represented as a vector with the same number
of elements as the number of training examples. For instance, each training label for a security task might
be a 1 if the corresponding example corresponds to malicious traffic, or a 0 if the corresponding example is
benign.
For multi-label learning tasks, the data may be associated with multiple label vectors or a combined label
matrix. For example, you might want to predict users’ responses to a quality of experience questionnaire.
For each question on the questionnaire, you would need a label corresponding to each user. For convenience,
the vector (or matrix, for multi-label tasks) of training labels are usually given the variable name y. The first
dimension of the training labels is the same cardinality as the first dimension of the training examples.
The example below shows an example of labeling data, based on two loaded packet captures. The first trace
is a packet capture of benign traffic, and the second is a trace of HTTP scans for the log4j vulnerability. The
example shows the use of netml to load the packet captures and extract features and the subsequent process
of adding a label to each of the resulting data frames, before concatenating the two frames and separating
into X and y, for eventual input to training a machine learning model.
hds = hd.loc[:,:4]
hds.shape
The format of the training labels distinguishes regression problems from classification problems. Regression
problems involve labels that are continuous numbers. Classification problems involve discrete labels (e.g.,
integers or strings). Classification tasks can be further categorized into binary classification and multi-
class classification by the number of distinct labels. While some supervised machine learning algorithms
can be trained to perform either classification or regression tasks, others only work for one or the other.
When choosing a machine learning algorithm, you should pay attention to whether you are attempting a
classification or regression task. Our description of common machine learning models in the next chapter
clearly states whether each algorithm is applicable to classification, regression, or both.
• Regression. Training labels for regression problems are simply encoded as real-value numbers, usu-
ally as a floating point type. Regression may be familiar if you have experience fitting lines or curves to
existing data in order to interpolate or extrapolate missing values using statistics techniques. Regres-
sion problems can also be performed with integer labels, but be aware that most off-the-shelf regression
algorithms will return predictions as floating point types. If you want the predictions as integers, you
will need to round or truncate them after the fact.
• Binary Classification. Training labels for binary classification problems are encoded as 1s and 0s.
The developer decides which class corresponds to 1 and which class corresponds to 0, making sure to
keep this consistent throughout the ML training and deployment process.
• Multi-class Classification: Training labels for multi-class classification problems are encoded as dis-
crete values, with at least three different values represented in the vector of training labels y. Note
that multi-class classification (more than two discrete classes) is distinct from multi-label classifica-
tion (multiple independent classification tasks, each with their own set of discrete classes). There are
multiple options for encoding training labels in multi-class classification problems. Since most ma-
chine learning algorithms require training labels as numeric types (e.g., int, float, double), string class
names usually need to be encoded before they can be used for ML training (decision trees are a notable
exception to this requirement). For example, if you have labels indicating users’ quality of experience
on a “poor”, “acceptable”, “excellent” scale, you will need to convert each response into a number in
order to train an algorithm to predict the QoE of new users.
One way to perform encoding for multi-class classification is to assign each class to a unique integer (ordinal
encoding). In the example above, this mapping could convert “poor” –> 0, “acceptable” –> 1, and “excellent”
–> 2. The fully encoded training labels would then be a vector of integers with one element for each training
example. Another way to encode multi-class training labels is to perform one-hot encoding, which represents
each class as a binary vector. If there are N possible classes, the one-hot encoding of the ith class will be
an N-element binary vector with a 1 in element N and a 0 in all other elements. In the example above, this
mapping could convert “poor” –> [1,0,0], “acceptable” –> [0,1,0], and “excellent” –> [0,0,1].
Although it is useful to understand these subcategories of supervised learning, it is also wise not to assign
excessive importance to their distinctions. Many machine learning algorithms are effective for either clas-
sification or regression tasks with minimal modifications. Additionally, it is often possible to translate a
classification task into a regression task, or vice versa. Consider the example of predicting improvements in
last-mile network throughput for different possible infrastructure upgrades. While it is reasonable to express
throughput as a real-valued number in Mbps, you may not care about small distinctions of a few bits per
second. Instead, you could re-frame the problem as a classification task and categorize potential upgrades
into a few quality of experience classes.
Once the data is labeled, we can use the data to train a machine learning model. Before we can do so, we need
to divide the data into training and testing sets. The training set is used to train the model, and the testing set
is used to evaluate the model’s performance. Additionally, in the process of training, a subset of the training
data is often withheld from the model to perform validation (i.e., to evaluate the model’s performance during
training), to help tune a model’s parameters. We explore each of these concepts with examples below.
The goal of supervised learning is to train a model that takes observations (examples) and predicts labels
for these examples that are as close as possible to the actual labels. For instance, a model might take traffic
summary statistics (e.g., IPFIX records) as input features and predict quality of experience labels (or target
predictions), with the goal that the predicted labels are as “close” as possible to the actual quality experienced
by the customer.
However, if you don’t know the correct labels for new observations, how do you measure whether the model
is succeeding? This is an important question and one that must be considered at the beginning of the ML
process. Imagine that you’ve trained a machine learning algorithm with training examples and training labels,
then you deploy it in the wild. The algorithm starts seeing new data and producing predictions for that data.
How can you evaluate whether it is working?
The way to solve this problem is to test the performance of the trained algorithm on additional data that it
has never seen, but for which you already know the correct labels. This requires that you train the algorithm
using only a portion of the entire labeled dataset (the training set) and withhold the rest of the labeled data
(the test set) for testing how well the model generalizes to new information.
The examples that you reserve for the test set should be randomly selected to avoid bias in most cases.
However, some types of data require more care when choosing test set examples. If you have data for which
the order of the examples is particularly important, then you should select a random contiguous subset of that
data as your test set. That maintains the internal structure that is relevant for understanding the test examples.
This is particularly important for chronological data (e.g. packets from a network flow). If you do not select
a contiguous subset, you may end up with a nonsensical test set with examples from one day followed by
examples from a year later followed by examples from a few months after that.
This brings us to the golden rule of supervised machine learning: never train on the test set! Nothing in
the final model or data pre-processing pipeline should be influenced by the examples in the test set. This
includes model parameters and hyperparameters (which are parameters of the learning process itself—more
on these terms below). The test set is only for testing and reporting your final performance. This allows the
performance evaluation from the test set to be as close an approximation as possible to the generalization
error of the model, or how well the model would perform on new data that is independent but identically
distributed (i.i.d.) to the test set. For example, when you report that your model performed with 95% accuracy
on a test set, this means that you also expect 95% accuracy on new i.i.d. data in the wild. This caveat about
i.i.d. matters, because if your training data and/or test set is not representative of the real-world data seen by
the model, the model is unlikely to perform as well on the real-world data (imagine a trained Chess player
suddenly being asked to play in a Go tournament).
The best way to avoid breaking the golden rule is to program your training pipeline to ensure that once you
separate the test set from the training set, the test set is never used until the final test of the final version of
the model. This seems straightforward, but there are many ways to break this rule.
A common mistake involves performing a pre-processing transformation that is based on all labeled data
you’ve collected, including the test set. A common culprit is standardization, which involves normalizing
each feature (column) of your dataset such that variance matches a pre-specified distribution with mean of
zero. This is a very common pre-processing step, but if you compute standardization factors based on the
combined training and test set, your test set accuracy will not be a true measure of the generalization error.
The correct approach is to compute standardization factors based only on the training set and then use these
factors to standardize the test set.
Another common mistake involves modifying some hyperparameter of the model after computing the test
error. Imagine you train your algorithm on the training set, you test it on the test set, and then you think,
“Wow, my test accuracy was terrible. Let me go back and tweak some of these parameters and see if I can
improve it.” That’s an admirable goal, but poor machine learning practice, because then you have tuned your
model to your particular test set, and its performance on the test set is no longer a good estimate of its error
on new data.
Validation Sets
The correct way to test generalization error and continue to iterate, while still being able to test the final
generalization error of the completed model, involves dividing your labeled data into three parts:
1) a training set, which you use to train the algorithm,
2) a test set, which is used to test the generalization error of the final model, and
3) a validation set, which is used to test the generalization error of intermediate versions of the model as
you tune the hyperparameters.
This way you can check the generalization performance of the model while still reserving some completely
new data for reporting the final performance on the test set.
If the final model performs well across the training, validation, and test sets, you can be quite confident that
it will perform well on other new data as well. Alternatively, your final model might perform well on the
training and validation sets but poorly on the test set. That would indicate that the model is overly tuned to the
specificities of the training and validation sets, causing it to generalize poorly to new data. This phenomenon
is called overfitting and is a pervasive problem for supervised machine learning. Many techniques have been
developed to avoid and mitigate overfitting, several of which we describe below.
While dividing your labeled data into training, validation, and test sets works well conceptually, it has a
practical drawback: the amount of data used to actually train the model (the training set) is significantly
reduced. For example, if you divide your data into 60% training, 20% validation, and 20% test, that leaves
only 60% of your data for actual training. This can reduce performance: the majority of ML models perform
better with more training data.
One solution to this is to use a process called cross validation, which allows you to combine the training
and validation sets through a series of sequential model trainings called folds. In each fold, you pull out a
subset of the training data for validation and train the model on the rest. You repeat this process with non-
overlapping subsets for each fold, such that every training example has been in a validation fold once. In a
5-fold cross-validation, each fold would use 20% of the non-test data as a validation set and the remaining
80% for training. This would result in 5 different validation performance measurements (one for each folds)
that you can average together for the average cross-validation performance.
Cross-validation provides a more robust measure of generalization performance than fixed training and val-
idation sets. It is so common that supervised machine learning results are often reported in terms of the
average performance on an N-fold cross-validation.
So how many folds are enough? Generally, the more folds, the better estimation of model performance. In
the extreme case, called leave-one-out cross-validation, each fold uses a validation set with just one example,
allowing the rest to be used for training. Each fold tests whether the model correctly predicts this one example
when trained on all other examples. This provides the best estimate of generalization performance, because
you’re training with the most data possible. If you have labeled 100 examples in your non-test set, leave-one-
out cross-validation involves performing 100 folds and averaging the accuracy across across all 100. You can
then repeat this process while tuning model hyperparameters to check the effects on generalization accuracy.
Unfortunately, leave-one-out cross-validation has its own drawbacks. Every cross-validation fold involves
training a model, and if the training process is resource intensive, leave-one-out cross-validation can take a
long time. Some models, especially neural network models, take a long time to train, so doing more than
a few folds becomes impractical. For such models, a smaller number of folds (often 5 or 10) are used in
practice.
Training a supervised machine learning model involves adjusting the model’s parameters to improve its ability
to predict the training labels from the training examples. This adjustment process takes place automatically,
typically using an iterative algorithm (such as gradient descent) to gradually optimize the model’s parameters.
If the training process works correctly, it will produce a model that performs as well as possible on the training
set. Whether or not this trained model will generalize to new data must be tested using a validation and/or
test set.
Model training.
rf = RandomForestClassifier(random_state=0,
n_jobs=-1,
n_estimators=100,
class_weight='balanced')
A few combinations of models and error functions have closed-form solutions for minimization, allowing you
to directly solve for the optimal model parameters. However, most combinations used in practice either don’t
have a closed-form solution, or the computational cost of the closed-form solution is infeasible. The common
solution is to apply an iterative process, usually a version of gradient descent, starting with arbitrarily chosen
initial parameter values and incrementally improving them until values close to the optima are found.
The Appendix provides an activity that involves training a simple machine model to distinguish two
classes of traffic, based on features derived from netml. This activity can get you familiar with the basics
of processing a packet capture into features using netml and training a model using scikit-learn.
Gradient descent is a mathematical representation of the idea that if you can’t see the solution to a problem,
just repeatedly take one step in the right direction. You can think of it as analogous to walking downhill
to reach the lowest point in a landscape. By computing or approximating the gradient of the error of a
model with respect to the model’s parameters, it is possible to update the parameters “down” the gradient
and improve the model’s performance. Computing or approximating a gradient is often easier than directly
solving for a minimum, making gradient descent a feasible way to train many types of complex models.
The size of “step” that each parameter takes during each gradient descent iteration can be turned using a
hyperparameter called the learning rate. A high learning rate will update the parameters a relatively large
amount with every step, while a low learning rate will update your parameters a small amount with every step.
There exist many algorithms for dynamically tuning the learning rate during the training process to improve
training speed while avoiding some common pitfalls of gradient descent (more on these below). In practice,
you will likely use one of these existing algorithms rather than controlling the learning rate manually.
The gradient descent process continues until the model parameters stop changing between iteration (or at
least until they change less than a predetermined small value). When this happens, the parameters define a
model that is in a local minimum of the error function, which is hopefully close to the optimum.
There are a few variations to gradient descent that gain even more computational efficiency. The first, batch
gradient descent uses the entire training data set to compute the gradient at each iteration. This provides an
accurate gradient value, but may be computationally expensive. An alternative approach, stochastic gradient
descent uses only a single example from the training set to estimate the gradient at each iteration. After N
iterations, all N examples in the training set will have been used for a gradient estimate once. Stochastic
gradient descent allows for faster parameter update iterations than batch gradient descent, but the estimated
gradients will be less accurate. Whether or not this results in a faster overall training time depends on the
specifics of the model and the training set. Mini-batch gradient descent is a third option that allows for
a tradeoff between the extremes of batch and stochastic gradient descent. In mini-batch gradient descent,
a hyperparameter called the batch size determines how many training examples are used to estimate the
gradient during each iteration.
This provides a great deal of flexibility, but introduces the need for one more term frequently used when
discussing model training: the epoch. One epoch is one sequence of gradient descent iterations for which
each example in the training set is used to estimate exactly one gradient. In batch gradient descent, iterations
are equivalent to epochs, because all training data is used for the gradient calculation. In stochastic gradient
descent, each epoch consists of N iterations for a training set with N examples. The number of iterations per
epoch in mini-batch gradient descent depends on the batch size. With N training examples and a batch size
of B, there will be ceiling(N/B) iterations per epoch.
Gradient descent is the heart of machine learning, but it’s not without problems. Many ML models, espe-
cially deep learning models, have gradients that are non-convex, so rather than converging on the global
minimum, gradient descent gets stuck in a local minimum or plateau and is unable to make further progress.
Getting stuck in a plateau is known as the vanishing gradient problem and especially plagues deep learning
models. Fortunately, algorithms that dynamically modify the learning rate during training can help avoid
these situations. For example, the simulated annealing algorithm starts with a large learning rate and grad-
uate decreases the rate throughout the learning process, making it less likely that the process gets stuck far
away from the global minimum.
Overfitting is a common problem for many supervised machine learning models. An “overfit” model has
been overly tuned to idiosyncrasies of the training set and doesn’t generalize well to new data.
Machine learning models can either be underfit or overfit. An underfit model is too restrictive and fails to
capture the relevant structure or details of the data being modeled. For example, fitting a linear function to a
dataset with nonlinear characteristics would result in an underfit model. Such a model will demonstrate bias:
errors in its predictive accuracy. On the other hand, an overfit model captures noise or random fluctuations in
the data, which may not generalize well to new, unseen data. An overfit model will display high variance as
its parameters will change with every new training data set. Hence we talk about the bias-variance tradeoff:
the effort to find a model that is complex enough to fit the data without being overfit.
The balance between bias and variance is crucial in machine learning. If a model has high bias, it may result
in poor performance on both the training and test data. On the other hand, if a model has high variance, it
may overfit the training data and perform poorly on the test data, not to mention in production when it starts
to see new data that doesn’t conform to the overfit model.
It’s useful to be able to detect when your model is overfitting. One way is to compare the model’s training
error and validation error when trained on increasing fractions of the training set or for increasing number of
gradient descent steps. As long as the training error is close to the validation error, you can be fairly confident
that the model isn’t overfitting (and is likely underfitting, requiring more training data or training iterations
to improve its performance). If the validation error is significantly worse than the training error, the model
is likely overfitting.
There are several ways to deal with overfitting. The best approach is to collect more training data such that
your training set is more representative of any new data the model will see in the wild. Another approach,
early stopping involves stopping the training process when the training error and validation error start to
diverge. Regularization is another approach that places a penalty on parameter complexity during the training
process, helping to prevent the model from becoming overly tuned to the training examples.
Another approach is to use ensemble methods, such as bagging and boosting, which combine multiple models
to reduce variance and improve overall performance. Bagging involves training multiple instances of the same
model on different subsets of the training data, while boosting involves combining the predictions of multiple
weak models to create a stronger model. Additionally, cross-validation among different models can be used
to assess the performance of a model and select the best hyperparameters or model configurations.
Before we dive into learning about some specific machine learning algorithms, we need to define a few more
concepts to help us determine whether these algorithms are meeting their goals. First, we need to define
what we mean by “close to the real label.” This is another way of saying “how do we measure the success
(or accuracy) of the model?”
There are many performance metrics that we can use to evaluate a machine learning model. In general, these
metrics compare predictions that a trained model produces to the correct labels and produce a value or other
indication (e.g., a curve) that reflects the performance of the model.
For regression problems, you’ll often use an error function such as mean squared error or mean absolute
error. In this example, we have some data points, we’ve plotted a linear regression line, we could see the
errors of the points off of the line. And you could report the average of those errors or the average of those
squared errors as the performance of your model. The training process will attempt to minimize this error to
produce a regression model that produces predictions which are as close as possible, on average, to all data
points.
The Appendix provides an activity to train and evaluate a model end-to-end, including evaluating a
supervised learning model with various metrics.
For classification, we can’t use mean or absolute errors directly, because the class labels are not real-valued
and may not be ordinal.
• Accuracy. is popular performance metric for classification problem because it is intuitive. The accu-
racy of a model is the ratio of correct predictions to all predictions. For some problems, this works
pretty well. But for others just reporting accuracy is deceptive.
Imagine you’re training a machine learning algorithm to take in packets and classify them as “ma-
licious” or “benign”. If you just use the naive algorithm “always predict benign”, the accuracy will
be high, because most most packets aren’t malicious. So if you have a problem for which the actual
classes aren’t evenly balanced among all of the possible classes, your accuracy score isn’t going to be
as intuitive as it should be. For example, you shouldn’t necessarily assume that anything over 50%
accuracy is great. 50% only a meaningful metric if you are doing binary classification where you’re
expecting even numbers of examples of both classes. In multi-class problems or tasks where certain
classes are rare, accuracy isn’t a great way to report your results.
Accuracy
# Compute accuracy
(continues on next page)
Accuracy: 0.9891944990176817
Receiver operating characteristic for evaluating the results of a classifier on test data.
# Assuming you have already trained and tested a binary classification model,
# and have true labels y_true and predicted labels y_pred:
# Plot PR curve
plt.plot(recall, precision, color='blue', label='Precision-Recall curve')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend()
plt.savefig('pipeline_pr-curve.png')
plt.show()
One possible was to choose where to land on a curve such as this would be to se-
lect the Pareto optimal setting, where any improvement you make in one direction is
outweighed by the tradeoff that you’d make in another direction. However, as noted
above, this choice is likely also to depend on the application’s tolerance for different
classes of error.
• F1 Score. Sometimes you want to report a single number that says how well your algorithm
performs instead of both precision and recall scores. The standard approach is to report an
F1 score, which is the harmonic mean of precision and recall. The harmonic mean causes
the F1 score to weigh low values higher. This means that a high precision won’t offset a
particularly low recall or vice versa.
F1 Score
f1 = f1_score(y_test, y_pred)
print("F1 Score: {}".format(f1))
F1 Score: 0.9942257217847769
• Receiver Operating Characteristic. The receiver operating characteristic (ROC) is another perfor-
mance measure that uses the true positive rate and the false positive rate. The true positive rate is
exactly the same as recall: the ratio between the correctly labeled members of the class and the total
size of the class. The false positive rate is a little different from precision. The false positive rate is the
fraction of examples that we labeled as the class of interest but we were incorrect about.
As with precision and recall, you can tune hyperparameters to affect the true positive rate and false
positive rate. You can then plot their relationship on an ROC curve. As you change the parameter to
increase the true positive rate, you may end up with more false positives, too. If you want a single metric
to use for the algorithm performance that incorporates the fact that you can tune these parameters, you
can use the area under the ROC curve. Since your algorithm would ideally be able to achieve a high
true positive rate with a very low false positive rate, you’d have a curve that is right along the edge of
the graph. The ideal area under the curve (AUC) is 1.
Receiver operating characteristic for evaluating the results of a classifier on test data.
plt.savefig('pipeline_roc-curve.png')
plt.show()
• Confusion Matrices. Confusion matrices go beyond a single number or a scalar metric for classifica-
tion performance and instead provide details about how your model is making mistakes. A confusion
matrix plots the frequency or the number of examples that were given a particular label versus the
correct label. If your model is doing perfectly, you will end up with a diagonal matrix where all of the
examples are in the correct class. If your model is sometimes making mistakes, there will be numbers
outside of this diagonal. You can then investigate these examples to see why they might be more dif-
ficult for the model and focus on these cases to improve your performance. Confusion matrices are
useful for debugging and iterative improvements.
print(metrics.confusion_matrix(y_test, y_pred))
[[ 60 8]
[ 3 947]]
FIVE
In this chapter, we will discuss supervised learning, the process by which a machine learning model can
learn from labeled data (sometimes called labeled examples). Supervised learning requires having access
to one or more labeled datasets—data that has not only the features, but also an associated label for each
data point. For example, in the case of malware classification, the features might include metrics from the
network traffic (e.g., bytes per second, packets per second, number of IP addresses contacted), and the labels
could be whether the traffic is being generated by a malicious software software program (“malware”).
In this chapter, we will describe a variety of supervised learning models, using examples from networking
as an illustrative guide. We do not assume you’ve seen these models before, and so readers who want to get
basic intuition behind different models and how they can be applied in different network settings should find
this chapter illuminating. Readers who are already familiar with these models may also find the discussion
helpful, as the examples in the chapter present cases where particular models or types of models are suited
to different classification problems, as well as cases in the networking domain where these models have been
successfully applied.
We organize the discussion of supervised learning in this chapter into the following categories:
1. non-parametric models (i.e., models where the size of the model grows with the size of the dataset);
2. linear models (i.e., models based on a linear prediction function, including possible basis expansion);
3. tree-based models;
4. ensemble methods (i.e., models that make predictions by combining the predictions of simpler models);
and
5. deep learning models (i.e., those that use deep neural networks to derive representations of the data).
Non-parametric models grow as the size of the dataset grows. Perhaps the most well-known and widely used
non-parametric model is k-nearest neighbors (k-NN). We describe this model below as well as examples
where k-NN has been used in networking. We also provide examples for you to try.
49
Machine Learning for Networking, Release Version 0.1
With k-nearest neighbors (or k-NN), the training data is treated as a set of vectors in an n-dimensional space,
where n is the number of features in the data elements. When trying to predict a label for a new example,
the k-nearest members of the training data set are considered. We predict the correct label by examining the
labels of those nearest neighbors. For example, if the majority of close neigbors of a new example are labeled
“malware” then we would predict that the new element should also be labeled “malware”.
There are multiple ways to define the distance function (e.g., Euclidian distance can be used) to determine
the k-nearest neighbors, and k is a hyperparameter. Once we have those neighbors, we use their labels to
predict the correct label for the new example, and again their are multiple ways to do this (such as mean or
mode).
Training a k-nearest neighbors model is simple and efficient: there is nothing to do other than store the
training data in a data structure, such as a k-dimenstional tree (or k-d tree), that makes it easy to compare
distances between a data point and observations in the training data. All of the work and computational effort
thus occurs during prediction, where these distances are computed.
The choice of k, a hyperparameter, can be made using a standard tuning approach, such as a line search. This
choice can be validated using a validation set.
Because k-NN is sometimes impractical (e.g., storing the training set may be prohibitive, inference time can
be large), the model is often used as a baseline to compare model performance against other, more practical
models. k-NN is relatively easy to train and optimize, but the computation performance is poor if the training
set has many examples—and gets worse as the training set grows.
kNN example here
Because k-NN operates by distance computations in a multi-dimensional space, the model works best when
features are standardized to have zero mean and unit variance. Otherwise, the closest examples may be
overwhelmed by one feature that just happens to have values in smaller units.
Unfortunately, k-NN also scales poorly with the number of features in each example (i.e., the dimensionality
of the feature set). Many functions that can be used to compute distance between vectors don’t work very well
in high dimensions, because in higher dimensional spaces, everything starts to be far away from everything
else. Even if two examples happen to be close to each other on one dimension, the fact that there are so many
dimensions in the data set means that they are likely to be far away on others. As you’re adding the distances
along each dimension or taking the square of the distances along each dimension, those distances can start
to blow up. This phenomenon is called the curse of dimensionality.
You may sometimes try to address this by using some different distance functions that are tailored for different
types of data, but generally, this curse of dimensionality is just a problem for geometric models that rely on
vector similarity or vector distance.
Despite their simplicity, k-NN models have been successfully used in various contexts to perform basic
classification tasks using network data. Examples include: (1) positioning and geolocation; (2) website or
device fingerprinting; and (3) attack classification and detection (e.g., DDoS detection). Such classifiers are
common in research papers, particularly as a baseline approach or model, and in attack papers where the
attack, such as website fingerprinting, need not be efficient (e.g., if it can be performed offline, or simply
to demonstrate feasibility of an attack). In practice, other models are more common, particularly due to the
computational requirements that can make k-NN inefficient in practice.
Linear models assume a linear relationship between the input features and the target prediction. In other
words, the model assumes that the outcome or variable being predicted can be expressed as a linear rela-
tionship among the input features. Many prediction problems are well-suited to linear models, or simple
extensions to linear models, which we will discuss in more detail below.
When predicting a continuous target variable, linear regression can often be a simple, efficient, and effective
model. When predicting a categorical target variable, logistic regression is often a good choice. We will
discuss both linear regression and logistic regression in this chapter, and provide example applications in the
networking domain where each technique can be effective.
The Appendix provides an activity to train and evaluate a linear regression model to predict various
characteristics of network traffic from input features (e.g., predicting flow size from flow duration).
Training a linear regression model involves finding the weight and bias parameters to minimize the error
between the predicted labels and the actual labels for the training set. There are many error functions that
can be used for this training minimization. A common choice, the mean-squared error, is also convenient for
1 ∑︀𝑚
training: 𝐸𝑟𝑟𝑜𝑟 = 𝑚 𝑇
𝑖=1 (w x − 𝑦)
2
We can use either a closed-form solution or gradient descent to find values of w that minimize this error
across the data, x, in the training set. The example below shows an example of fitting a linear regression
model based on the linear relationship between an input feature, number of packets, and a target variable,
number of bytes. While this example is simple, it illustrates how a linear model can be used to predict a
target variable based on a single input feature.
###########
# Step 1: Create Linear Regression Model
(continues on next page)
linear_regression.fit(x.reshape(-1,1), y)
# Step 3: Input to predict function are x values that the model has␣
˓→never seen.
y_hat = linear_regression.predict(np.array([-2,4000]).reshape(-1,1))
###########
# Plot the results along with the points and previous line
# Plot the original points.
plt.plot(x, y, '.', color='blue', markersize=12)
plt.grid(linestyle='--', alpha=0.6)
plt.xlabel('Packets')
plt.ylabel('Bytes')
Linear regression is an appropriate modeling tool when there is some linear relationship between the input
features and the target variable. This is sometimes the case in various network applications and scenarios,
as we see above. Sometimes, even if the relationships are more complex than a strict linear relationship,
linear models can still be used as the basis for a model. Below, we will explore how linear regression can be
used with an approach called basis expansion to allow a linear regression model to continue to capture more
complex relationships between features and target predictions.
Sometimes, the relationship between the input features and the target variable is not linear, but can be better
approximated by a polynomial. In this case, we can use a linear model to predict the target variable, but use
a non-linear transformation of the input features. This is called polynomial regression, or basis expansion.
Polynomial regression involves preprocessing the features in the original dataset to include polynomial com-
binations of existing features. For example, polynomial basis expansion could add to the initial set of features
the square of each feature, some higher-order polynomial expansion of each feature, the pairwise products
of the features, and so forth. Training a linear regression that takes a set of features that is a polynomial
expansion of features, effectively performs training on a quadratic (or higher degree) function because the
model is attempting to fit linear combinations of second-degree combinations of parameters to predict the
target outcome.
Polynomial expansion generalizes to higher-order degree polynomials. It is possible to perform this operation
for features that include all of the third-degree combinations of the original features. Doing so would quickly
increase the number of features as you increase the degree of the polynomials, but makes it possible to apply
the same (simple) linear regression training task, while at the same time modeling higher degree patterns
between the features and target variables. This process makes it possible to learn relationships that aren’t
just strictly linear relationships, while preserving the ability to apply standard linear models to prediction
problems.
Another common form of regression using linear models is logistic regression. Logistic regression operates
almost exactly the same process as linear regression, but with one minor change: Rather than just performing
a linear combination of parameters and features, the output of that linear combination is used as input to a
sigmoid function, which constrains the output prediction to the [0,1] range. The output of the linear com-
bination is thus either very close to 0 or 1, with a region in the center where that transition happens fairly
quickly. Such a model can be very useful for performing binary classification problems.
hpcap.pcap2pandas()
pcap = hpcap.df
dns_packets = pcap.loc[pcap['is_dns'] == True];
dns_ft = dns_packets.loc[:,['length','dns_resp']]
dns_ft['response'] = dns_ft['dns_resp'].astype(bool)
dns_ft['response'] = dns_ft['response'].astype(int)
x = dns_ft['length'].values
y = dns_ft['response'].values
# Plot data
# z is a simple number line 1 to 600
z = np.arange(1, 600, 0.5).reshape(-1, 1);
# prediction: plot the number line against the predictions for those values
plt.plot(z, regr.predict_proba(z)[:, 1], color='green');
plt.plot(x,y, 'x', color='red')
plt.ylabel("Request (0) or Response (1)")
plt.xlabel("DNS Packet Length")
plt.grid(linestyle='--', alpha=0.6)
plt.savefig('supervised_logistic-regression.png')
plt.show()
The Appendix provides an activity to train and evaluate a simple logistic regression model to predict
whether a packet is a DNS request or response based on its size.
When performing classification, you want to know whether a data point is in a specific class or not, so by
wrapping model output in a sigmoid, you say if the output is >0.5 assume class 1, and if the output is <0.5
assume class 0. In most cases, you’ll already be very close to 1 or 0. Everything else in the training process
works the same way, you just compute the gradient of the error function using this as your predictor. When
you compute the gradients, you can use the chain rule to computer the partial derivates.
The sigmoid function is continuous and differentiable so that’s not a problem. Everything else, all the gradient
descent works the same way, just with the exact equation slightly different as a result for the sigmoid. The
book chapter concludes by taking the logistic regression and generalizing it top the multi-class case. The
generalization of logistic regression is called “softmax regression”, which we will not talk about today. We
will get to the softmax function later, especially in deep learning, but this is good for today. We will use
some of these things in programming practice on Thursday. We’ll also do a bit about nearest neighbors,
which won’t take us very long at the start.
5.2.4 Regularization
Ridge regression is linear regression, with the L2 norm of the coefficients/parameters added to the error
function. This weight of this term is controlled with a hyperparameter that makes it possible to tune the
relative emphasis given to the simplicity of the model (L2 penalty) versus the fit of the model. Setting this
hyperparameter to a large value, creates the incentive to keep the parameter values smaller in magnitude, and
thus results in a simpler. A smaller value for this hyperparameter will incentivize the optimization to closely
fit the training data, even if it results in a more complex model with more parameters.
Lasso Regression is linear regression with the L1 norm of the parameters added to the error function. Instead
of using the Euclidean magnitude of the parameter vector as the penalty, Lasso regression uses the sum of
the absolute values of the parameters. One benefit of using the L1 norm is that it can cause the values of
the parameters that aren’t particularly important to be set to zero. In these cases, these features could be
removed altogether, further simplifying the model. Unfortunately, Lasso regression gradients can start to act
erratically if there are many correlations between features, which can result in a situation where as you get
closer to the minimum using gradient descent, your updates start to oscillate, rather than settling into the
final value.
To get the benefits of both Lasso and ridge regression, it is possible to combine them into ElasticNet. The
cost function for ElasticNet includes the original error function for linear regression, the term for L1 penalty
(from Lasso), the term for L2 penalty (from ridge), and another hyperparamater, r, that determines how to
mix the two penalties. The larger the value of r, the more it behaves like Lasso. The smaller the value of r,
the more that the model looks like ridge regression. For the most part, if your dataset is simple enough for
a model to perform well using any one of these approaches, it will also likely perform well using any of the
others. Data is generally either amenable to one of these linear models, or these models just don’t provide
enough expressivity and it won’t matter which regularization option you choose [CITATION NEEDED].
If you are trying to perform a binary (or multi-class) classification task with linearly separable data, the
optimal model will consist of a line (or plane, or hyperplane) that divides the feature space such that all of
the examples on one side of the line are in one class and all of the examples on the other side of the line (or
plane, or hyperplane) are in the other class. When asked to predict the class of a new example, you make the
prediction based on which side of the line the example occurs.
SVMs are very common, and they perform very well for small datasets. A dataset that is small but contains
features that separate well in linear space may be very amenable to SVMs. SVMs can produce predictions
that are robust to overfitting and often good at generalizing. It is also possible to perform regression with
SVMs. To do so, the aim is to fit all of the data within the margin instead of outside the margin. The decision
line (or plane) thus becomes a regression line.
The question remains, how do we choose which line or plane to use for this model? There might be an
infinite number of planes that separate the data, how do you choose the one with the best chance of optimizing
prediction accuracy? Remember that prediction accuracy comes down to how well the model generalizes to
new data outside the training set. The core intuition behind an SVM is that the best separating line or plane
is the one with the most space between training examples of different classes—the maximum margin, or the
“max margin”.
Training an SVM involves finding that line that maximizes the margin, i.e. the space separating the line from
the training data. The examples end up closest to this optimal line are called the support vectors. These
are the examples that determine the position of a line. If you were to collect a lot more data, but all of that
data were to fall on further from the separating line than the existing support vectors, it wouldn’t change the
position of the line. This makes support vector machines fairly robust to overfitting because the only data
that affects the ultimate position of the model are those examples closest to these margin boundaries.
The SVM training process involves finding the separating line (or plane) with the maximum margin. Of
course, real datasets are rarely linearly separable, so we add another variable to the model that allows for
some slack, i.e. for some training examples to fall on the wrong side of the line or plane. The primary
goals of training are to find parameters W and b that minimize the error between the predictions and the
actual values that also maximize the margin. This goal can be achieved either with a quadratic programming
solver or with gradient descent, which are typically programmed into the SVM models in machine learning
libraries. For a linear SVM, the predicted labels y_hat is a piecewise function based off a linear combination
of the features with weights w and biases b. If that linear combination is less than zero, we predict class 0. If
this linear combination is greater than or equal to zero, we predict class 1. SHOW FORMULA. Show how
this combines the algebra and the geometry (sign of the dot product) to be “above” or “below” the line.
Regularization is also possible with SVM models: The higher the value of hyperparameter C, the more
importance the model places on getting the classifications right (i.e. all examples on the correct sides of the
margins). The lower you make C, the less importance the model places on a few incorrect classifications as
it attempts to find the largest marign possible.
Finally, there are multiple ways to use SVMs to perform multiclass classification. A simple approach is one-
versus-rest, where if you have N classes, you train N SVM classifiers, each binary. The first classifier predicts
whether a example is in class 1 or some different class. The second classifier whether an example is in class
2 or some different class. Each of those classifiers would give you a different decision line and you’d have
to do a prediction with all of them and decide which prediction was best for that dataset. Other approaches
include one-versus-one.
Of course, many datasets are not linearly separable. If you have a dataset that relies on nonlinear interactions
between features, a linear SVM is going to perform poorly. One option, as we saw with polynomial regression,
is to take the existing features and compute polynomial combinations of them. However, such an expansion
can generate in too many features for higher degree combinations, degrading both performance and model
accuracy.
One solution to this problem is to apply the kernel trick, which relies on the fact that data which isn’t linearly
seperable in low dimensions may be linearly seperable in high dimensions. Additionally, the SVM training
algorithm can be reformulated (into the dual format) to involve only similarity metrics between example
features, never on the exact values of the features themselves. This allows you to use a kernel function in
your model, which computes the distance between examples in a higher-dimensional space without ever
actually having to project those examples to the higher dimensional space.
SHOW FORMULA OF SVM WITH KERNEL K.
There are many well-studied kernel functions that existing machine learning libraries provide, each with pros
and cons. Nonetheless, they all allow us to adapt linear SVMs to nonlinear data.
Let’s now explore a classifier called the naive Bayes classifier. The naive Bayes classifier is a type of classi-
fication method in a family called kernel density classification. Naive Bayes works well on small data sets,
it’s computationally efficient for both training and prediction, and it works well in a variety of settings.
It typically requires knowing or estimating the probability density of the feature space from which we’re
trying to perform the estimation. There’s a notion called an optimal base classifier, where if you knew that
the data had an underlying probability distribution you could formulate an optimal classifier as follows given
a set of observations X, you could pick the Y that maximizes the posterior probability of observing Y being
a particular class given those observations. Doing so requires knowing the probability distributions of the
features on which we’re performing this estimation, and that distribution typically isn’t known, but we can
make some assumptions. What’s commonly done is to select a family of parametric distributions such as a
Gaussian, Bernoulli, or multinomial distribution, and then try to estimate the parameters of that distribution
once we’ve estimated the density of those features. We can then perform this computation.
The Appendix provides an activity to train and evaluate a Naive Bayes classifier to perform spam classi-
fication.
If we’re trying to estimate the probability of Y being a particular value or class given a set of observations,
X, we can use Bayes rule to express that as the posterior probability of X given Y times the prior probability
of Y, divided by the prior probability of observing those X values. The naive Bayes classifier is called a
probabilistic classifier, because not only does it make a class prediction, but it can also predict the likelihood
of an observation being of a particular class, given a set of observations. The naive Bayes classifier compares
posterior probabilities for each possible class. Because those probabilities for each observation does not
depend on prior probability of observing the feature values, X, we can drop the denominator and just compare
those numerator values and the class value for y that has the largest posterior numerator is the predicted class.
The Naive Bayes classifier makes two important assumptions. One is that each feature X and its correspond-
ing likelihood is independent; that is, there’s no relationship between the likelihoods of pairs or groups of
features. That’s an assumption that is commonly violated in practice but making this assumption allows
Bayes rule to avoid the curse of dimensionality, where the size of the required training set often grows expo-
nentially with the number of features in the model. The second assumption is that for each feature we have
to assume some statistical distribution on the features themselves; in other words, we have to estimate the
kernel density for those features.
There are three common distributions that are used in naive Bayes classification. When we have numerical
features it’s common to assume a Gaussian distribution. If there are categorical features we may use a
Bernoulli distribution. If there are discrete or count features, it’s common to use a multinomial distribution.
Why do we have to make these statistical distribution assumptions? If we look at the quantity we’re trying
to compute again it involves computing two quantities: one is the likelihood of Y, or the prior probability of
Y. That’s easy: we just figure out how often each class occurs in our data set. The second quantity we need
to know is the probability of observing a set of X’s given Y; unless we have an extremely large number of
data points we’re not going to be able to compute this directly, and so we’ll have to make some assumptions
about this particular distribution.
This is where we’re going to make assumptions about the distribution of X, and specifically the distribution
of X conditioned on Y. The second assumption that we need to make is that those probabilities x given y are
independent. In other words, we can compute the joint probability distribution of those X’s conditioned on
Y by computing the probabilities of each individual feature X given Y and multiplying them together. This
independence assumption greatly simplifies our computation, so now all we need to do when maximizing
this posterior probability is to compute the following: we choose the value for y that maximizes the prior
probability of observing that value of Y times the probability of each X_i observation, given that particular
class value of Y.
The Naive Bayes classifier has various advantages and disadvantages. It’s efficient and scalable: it’s a very
simple algorithm based essentially on counting frequencies of occurrence. It works on very small data sets.
It’s interpretable: each distribution is estimated as a one-dimensional distribution because the probabilities
of each feature occurring are assumed to be independent. Unfortunately, because the Naive Bayes classifier
assumes independence of features, it can’t learn relationships between those features, which may sometimes
be something you want to do in practice.
Another way of performing classification or prediction is through performing a sequence of decisions based
on features. Each decision, or step, sub-divides the data into regions. The goal is to end up with regions that
contain only data points of a single class. Training a decision tree involves making a sequence of decisions
that sub-divide the data into these regions. The model is called a tree because sequences of decisions are
easy to interpret when depicted as trees. Once the tree is trained, classification is simple, just start at the
root node and follow the links corresponding to the example you wish to classify until you reach a leaf node.
Decision trees can also be used for both classification and regression.
5.5.1 Training
The goal of training a decision tree is to train a balanced tree that has the minimal training error, in other
words, the minimal difference between the predicted classes in the training set and the true classes. Balancing
the tree reduces the computational complexity of the prediction process, because it reduces the maximum
number of questions, or splits, that are required to go from the root of the tree to a leaf. Unfortunately, the
problem of finding the optimally balanced tree for an arbitrary dataset is NP-complete. In practice, decision
trees rely on iterative algorithms that attempt to optimize for balance at each step, but do not guarantee that
the final tree is as balanced as possible.
One early decision tree algorithm, the classification and regression tree algorithm (CART), iteratively selects
a feature and finds the boolean comparison or numeric threshold that all of the examples in the training set as
evenly as possible by number and as uniformly as possible by class. In other words, the algorithm attempts
to choose a feature and a question/threshold that divides the examples into a left child node and a right child
node. CART is a greedy algorithm that starts at the root of the tree with all of the training examples ands
repeats the same process with all of the child nodes. This continues until each leaf node contains examples
from a single class only.
Gini vs. Entropy – probably unnecessary
Unfortunately, decision trees can be prone to overfitting. As with k-nearest neighbors, decision trees are
nonparametric, which means that they can be trained to fit the training data perfectly. In the limit, training
will result in a decision tree where every leaf node has examples from only one class. One way to limit
overfitting with decision trees is to set a maximum depth (or a minimum split) hyperparameter, which sets
the maximum depth of the decision tree. For any remaining nodes with training examples from more than
one class, the mode of the classes in each node serves as the prediction label. Another approach to limiting
overfitting is “pruning,” which trains a complete decision tree and removes splits that causes relatively small
decreases in the cost function.
Decision trees require very little data pre-processing. It doesn’t matter whether your features are numeric,
binary, or nominal, you can still have conditions in the nodes that work for those classes. For example, one
node could split based on a numeric feature, if packets_per_flow > 1 or == 1. Another node could split on a
nominal feature, “is there an ACK packet in the flow.o Ypu don’t need to do a one hot encoding, you don’t
need to do an ordinal encoding, you can just feed them right into the tree training algorithm. And it’ll work
no matter what format your features are.
You also don’t need to do any standardization or normalization, as there’s no notion of this decision tree
being geometric, so we don’t need to ensure that our features are mean zero and variance one. Decision trees
are easily interpretable by humans. It is possible to look at a decision tree and understand how it arrived at
a particlar prediction.
Decision trees make it easy to compute and compare the relative importance of features. We often want to
know which features of our dataset are particularly important for a particular classification. For example,
we might want to know whether the number of packets in a flow is crucially important or peripheral to
our problem. In addition to providing a better understanding of the model, this can also provide a better
understanding of the underlying phenomenon. For EXAMPLE.
If you can train one classifier, why not train more and improve your accuracy by combining their predictions
together? The core idea behind ensemble learning is that if you have a complex phenomenon that you’re try-
ing to understand, you can do a better job of by training a bunch of simpler models with different perspectives
instead of a single complex model. This is analogous to the “wisdom of the crowd”.
5.6.1 Voting
A “voting classifier” uses several different classes of models (e.g. decision tree, SVM, kNN, etc.) and predicts
the majority vote class predicted by each of these models. If the phenomenon is complicated enough, there
may not always be one algorithm which does best on new examples. So by having a bunch of algorithms try
it, as long as the majority of them do the right thing, you can still give you the right answer in the end.
It is also possible to use the confidence of these models to weight the votes (a so-called “soft voting classi-
fier”).
The next approach, bagging and pasting, trains different instances of the same algorithm on different subsets
of your training set. Bagging and pasting help to reduce classification variance by sampling from the training
set with replacement to create N new training sets that are all slightly different. You train a different model
on each set and use the majority vote prediction of all these models.
Random forests are a particularly important version of bagging, in which you train many small decision trees
limited to a maximum depth. (decision trees limited to a single set of child nodes are called decision stumps).
Random forests have distinction of being a very, very practical high performance algorithm. Random forests
can compete with deep learning algorithms, especially when you’re given datasets that have obvious features.
Deep learning really shines when you’re given data that is sort of raw unpasteurised things like images or
natural language. But if you’re given a dataset with clear existing features, in many cases a random forest will
do as well as a deep learning algorithm on that data. Random forests also have many fewer hyperparameters
than a neural network. They are also robust to overfitting.
The Appendix provides an activity to train and evaluate an ensemble method (i.e., random forest) to
predict certain activities from Internet of things traffic.
Bagging and random forests are really amenable to parallelization, you can do the sampling. And then you
can put each training on a different core, in your data center, train up all in models in parallel
5.6.4 Boosting
Another ensemble method callse Boosting has a different motivation than bagging or random forests. Bag-
ging, pasting, and random forests seek to reduce prediction variance. However, Boosting attempts to prevent
bias errors, which happen when you choose a model that is unable to represent the complexity of the data. In
Boosting, you train one algorithm to make a prediction and then you train another algorithm to try to correct
the prediction made by the first one. You can repeat this as many times as you like such that by chaining
simple models together, you can end up with something which is quite complicated and is able to represent
the data very well, even if the data itself is complex. The name comes from the fact that each successive
classifier in the sequence is trying to boost the performance of the previous ones.
In gradient boosting, you start off with your training data, you start by training a model, usually a decision
tree. This tree gets some of the training set predictions right, and it gets some of them wrong, allowing you
to compute a residual between the actual value predicted and the correct value for each examples. Then you
train another decision tree to predict the residual of the first tree. If that prediction is accurate, you can take
the prediction of the first tree, correct for the error predicted vy the second tree, and get the right prediction
overall. You can also train a third tree to predict the error of the second tree, etc.
Because boosting is inherently sequential, it is not that amenable to parallelized training. Typically, you make
each individual classifier very simple and fast to train, so the entire boosted classifier is also efficient.
Another type of boosting, AdaBoost, also uses sequential classifiers that try to improve each other’s perfor-
mance. The weight given to each example in the training set is increased if the previous classifier got that
example incorrect and decreased otherwise. This means that successive classifiers put more effort into cor-
rectly predicting examples that were missed by earlier classifiers. Each successive classifier is itself weighted
by how well it performs on the entire training set.
There is a proof that AdaBoost combined with any weak learning algorithm, i.e. any classifier that does better
than random guessing, will eventually produce a model which perfectly fits your training data. Empirically,
this also improves test error as well.
Gradient Boosting. . .
In this section, we will begin our exploration of deep learning by discussing a particular type of neural
network architecture called feed forward neural networks, which are also sometimes referred to as multi-layer
perceptrons. But before we dive into the technical details of feed forward neural networks, it is essential to
understand the context of deep learning.
Deep learning is a subset of machine learning that is concerned with representation learning. Unlike tra-
ditional machine learning methods, where the features used as input to the model are manually defined by
the designer of the model, representation learning relies on the algorithm to learn the best representation of
the inputs. An example of a specific type of algorithm that does this is an autoencoder. The idea behind
representation learning is that the model should learn the best representation for the input, rather than the
designer of the algorithm having to figure out how to represent the inputs to the model.
The Appendix provides an activity to train and evaluate a deep learning model to predict attacks from
Internet of things devices, and compare the performance of that model to more conventional models.
Deep learning takes this concept of representation learning one step further by introducing many transfor-
mations or layers in the model. Hence the name “deep” learning. The basic unit of deep learning is called a
neuron, which takes a multidimensional input and applies weights to each input feature. The output of this
computation is then passed through an activation function, which maps the output to 0, 1, or 2. Different
shapes of activation functions can be used for this purpose.
Deep learning is concerned with representation learning and involves many transformations or layers in the
model. A feed forward neural network is a particular type of neural net architecture that is used in deep
learning. The training process of a neural net is iterative, involving both forward and backward propagation
through the network. The weights of the neural net are adjusted to reduce loss, and a typical training process
might include hundreds or even thousands of epochs.
A multi-layer perceptron takes multidimensional input features, applies weights to those inputs, passes each
input feature through a neuron, and ultimately aggregates the output of the hidden layer to a single neuron that
performs the prediction based on the output of an activation function. The deep learning training process
attempts to find good values for each of these weights in the network, and each neuron has an activation
function. The three most common shapes of activation functions are the sigmoid function, the hyperbolic
tangent function, and the rectified linear unit function.
Training the weights of a neural net is an iterative process that involves both forward and backward propaga-
tion through the network. Each epoch involves both backpropagation and forward propagation. In forward
propagation, we evaluate the output against the true value of y and compute a loss or error function. Back-
propagation involves adjusting each of the weights in the neural net to correct for the resulting error. There
is no typical closed-form optimal solution for the weights. Generally, it is an iterative search process, and a
typical training process might include hundreds or even thousands of epochs.
SIX
In this chapter, we introduce unsupervised learning, the process by which a machine learning model can
learn from unlabeled examples. The goal of unsupervised learning is to identify patterns in data that are
useful for understanding the data or processing the data further.
Most data in the world are unlabeled, including most network data. For example, packet captures do not
inherently include labels of perceived quality of service or presence of malware. In fact, network manage-
ment infrastructure makes it so easy to collect unlabeled data, including packet captures, flow records, BGP
tables, etc., that data interpretation becomes the primary challenge. Large and complex datasets can seem
overwhelming – anyone who has ever opened a spreadsheet file with thousands to millions of rows of num-
bers understands the futility of manual analysis of raw unlabeled data. Fortunately, unsupervised learning
provides many approaches for finding and highlighting important patterns in the data that can support a holis-
tic understanding of the data set and suggest deeper truths about the network or phenomenon that generated
the data. The prevalence of unlabeled data makes unsupervised learning a powerful tool for data analytics.
Throughout this chapter, we will describe a variety of unsupervised learning models, using networking ex-
amples as a guide. This book does not necessarily assume you’ve seen these models before, so readers who
are aiming to get basic intuition behind different models will find this chapter helpful. Readers who are al-
ready familiar with these models may still find these examples helpful, as they present cases where particular
models or types of models are suited to different problems, as well as cases in the networking domain where
these models have been successfully applied in the past.
We organize our discussion of unsupervised learning into the following categories: (1) dimensionality re-
duction (i.e., models that reduce the number of features in a data set to those most useful for a task); (2)
clustering (i.e., models that group data based on similarity); and (3) semi-supervised learning (i.e., models
that use unsupervised techniques to prepare data for supervised learning).
Networking data sets are often high dimensional, meaning that each example in the data set has many features.
This is typically true regardless of the measurement approach used to collect the data set. Individual packets
have many headers. IPFIX records have many fields for each recorded flow. Internet scans results contain a
variety of information about each endpoint or IP addressed scanned. While high-dimensional data provides
lots of information that is useful for maching learining, it also poses two distinct challenges.
First, high-dimensional data can’t be easily plotted on a 2D or 3D graph. While there are many ways to
squeeze a few more dimensions into a 2D or 3D graph (using color, non-traditional axes, simulated surfaces,
65
Machine Learning for Networking, Release Version 0.1
etc.), datasets with tens to hundreds of dimensions (or more!) cannot be graphed in their entirety in a concise,
interpretable format. This makes it difficult to explore the data visually and gain an intuition about impor-
tant patterns that may be essential to understanding the underlying meaning of the data. Producing useful
visualizations of high-dimensional data therefore requires reducing the number of features such that 2D or
3D visualizations are possible.
Second, the training time of most supervised and unsupervised machine learning models increases with the
number of features in a data set. For very high dimensional data, it may be desirable to reduce the number
of features as a preprocessing step to make training computationally feasible. While this may result in a
reduction in model performance (more on this below), it is preferrable to a model that can’t be feasibly
trained at all.
Dimensionality reduction algorithms can be used to address either of these challenges. These algorithms
work by removing or combining features to produce a new version of the dataset in a lower target dimen-
sionality while attempting to preserve important patterns within the data.
An essential fact about dimensionaliy reduction algorithms is that they are inherently lossy. In nearly all
cases, reducing the dimensionality of a dataset will necessarily cause some of the information in the dataset
to be lost. The greater the dimensionality reduction, the more information loss is likely to occur as the
algorithm is unable to pack the same amount of information into fewer dimensions. Some information-
sparse datasets may not have this problem, but it is a generally safe to assume that dimensionality reduction
is a tradeoff between the information content of the dataset and the practicality of having lower dimensional
data. This has two important implications:
1) Dimensionality reduction to only 2 or 3 dimensions for visual presentation often results in a large
amount of information loss. Therefore, the resulting visualizations may not provide useful insight
about the data or may erroneously imply that patterns within the data do not exist. A general rule of
thumb is that the patterns visible after dimensionality reduction to 2D or 3D are a subset of the patterns
that actually exist in the data.
2) Dimensionality reduction as a preprocessing step to improve the computational feasbility of other
machine learning algorithms may result in worse performance of these algorithms (as measured by
accuracy, precision/recall, etc.) if important information contained in the dataset is lost. Optimizing
this tradeoff typically requires experimentation and will depend on available computing resources, the
machine learning algorithm, and the dataset in question.
The design of dimensionality reduction algorithms has therefore focused on methods for identifying and
retaining the information that is most relevant to the task at hand while discarding only irrelevant details.
There are many dimensionality reduction algorithms, far more than we can discuss in this book, so we focus
on three commonly used algorithms that can be readily applied to network data.
The Appendix provides an activity to perform dimensionality reduction on a previous classification prob-
lem to reduce input complexity.
The goal of principal component analysis (PCA) is to transform the data to have a new, smaller set of features
derived from the original features. The PCA algorithm attempts to minimize the amount that individual data
points change as a result of the transformation, thereby reducing the amount of information lost as a result
of the dimensionality reduction. However, some modification of data points is typically necessary, so the
PCA algorithm makes these modifications such that it maximizes the variance of the data points in the new,
lower dimensionality. This keeps the data points “spread out” in the new, lower dimensional space, hopefully
retaining any important patterns present in their distribution in the original higher dimensionality.
The new set of features in the target dimensionality are called principle components. Each of the principle
components for each data point is a combination of the data point’s original features. Regular PCA limits the
principle components to linear combinations of the original features, but alternatives, such as kernel PCA
can create principle components from non-linear combinations of the original features, better preserving
non-linear relationships in the data.
PCA is both non-parametric and deterministic. This means that PCA does not have any parameters that re-
quire training and that independent applications of PCA to the same dataset with the same target dimension-
ality will produce the same results. Applying PCA requires choosing the target dimensionality and whether
you would like to use the linear or kernel version of the algorithm. Kernel PCA also requires the choice of a
kernel function, for which a polynomial or radial basis function (RBF) kernel is often a good place to start.
The following example shows the use of PCA to reduce a packet capture dataset from 5 dimensions (IP source
address, IP destination address, TCP source port, TCP destination port, and packet length) to 2 dimensions
for visualization on a 2D scatter plot.
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.colors
from netml.pparser.parser import PCAP
from sklearn.decomposition import PCA, KernelPCA
from sklearn.preprocessing import StandardScaler
pcap = PCAP('../examples/notebooks/data/http.pcap')
pcap.pcap2pandas()
pdf = pcap.df[['ip_src_int', 'port_src', 'ip_dst_int', 'port_dst', 'length']]
is_dns = pcap.df['is_dns']
dns_colors = pd.factorize(is_dns)[0]
flow_colors = pdf.apply(lambda row: hash(tuple(row)), axis=1)
pdf = StandardScaler().fit_transform(pdf)
pca_linear = PCA(n_components=2)
pca_kernel = KernelPCA(n_components=2, kernel="rbf")
linear_result = pca_linear.fit_transform(pdf)
(continues on next page)
ax1.set_title("Linear PCA")
ax2.set_title("Kernel PCA")
plt.savefig("unsupervised_pca.png")
plt.show()
T-distributed stochastic neighbor embedding (T-SNE) is a dimensionality reduction algorithm that typically
produces much cleaner visualizations in two or three dimensions than PCA. T-SNE is particularly useful
when you want to visualize your data to gain intuition about underlying patterns that might prove informative
for supervised models or clustering.
T-SNE uses probability distributions to spread out dissimilar points while keeping similar points near each
other in the target diminsionality. The algorithm involves three main steps:
1. Fitting a normal (Gaussian) distribution to the distances between pairs of points in the original data
2. Mapping the normal distribution in the original high-dimensional space to a T-distribution in the target
dimensional space that minimizes the divergence between the distributions
3. Select new locations for the points in the target dimensional space by drawing from this T-distribution
Because T-distributions have more probility density in the tails than a normal distribution, this spreads out
dissimilar points in the target dimensionality while keeping similar points in closer proximity. Visualizations
produced using T-SNE show distinct clustering if such structure exists in the original high dimensional data.
T-SNE is a non-parametric and stochasic dimensionality reduction algorithm. This means that T-SNE does
not have any parameters that require training. However, unlike PCA, the use of a randomized draw to place
data points in the target dimensionality space means that independent applications of T-SNE to the same data
set may produce different results. In practice, you can force T-SNE to produce the same results consistently
by fixing the seeds used for random number generation.
The following example shows the use of T-SNE to reduct a packet capture dataset from 5 dimensions (IP
source address, IP destination address, TCP source port, TCP destination port, and packet length) to 2 di-
mensions for visualization on a 2D scatter plot. The points in the plot are colored by packet flow (left) and
by whether each data point corresponds to a DNS packet. As seen, T-SNE is able to produce 2D data in
which the similarity between DNS packets is retained, as is the similarity between data points corresponding
to several flows with the most packets.
pcap = PCAP('../examples/notebooks/data/http.pcap')
pcap.pcap2pandas()
pdf = pcap.df[['ip_dst_int', 'port_dst', 'ip_src_int', 'port_src']]
is_dns = pcap.df['is_dns']
tsne = TSNE(n_components=2)
result = tsne.fit_transform(pdf)
dns_colors = pandas.factorize(is_dns)[0]
flow_colors = pdf.apply(lambda row: hash(tuple(row)), axis=1)
/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/
˓→sklearn/manifold/_t_sne.py:800: FutureWarning: The default initialization in␣
warnings.warn(
/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/
˓→sklearn/manifold/_t_sne.py:810: FutureWarning: The default learning rate in␣
warnings.warn(
6.1.3 Autoencoders
Autoencoders are unsupervised neural network models that perform dimensionality reduction.
An autoencoder network has input and output layers that are the same size as the number of features in the
data. The intermediate layers of the network have an “hourglass” shape. The “encoder” half of the network
has decreasing numbers of nodes from the input layer to a central “encoding” layer. The “decoder” half of
the network has and increasing numbers of nodes from the encoding layer to the output layer. This reduction
in layer size forces information loss as each data point passes through the autoencoder. In effect, each data
point is squeezed down to fit through the smaller intermediate layers, reaching its most compressed at the
encoding layer.
Autoencoders are trained to reproduce input examples as closely as possible in their output. In other words,
the same data is used as both the training examples and the training labels. The training process therefore in-
centivizes the network to find parameters such that the encoding layer retains the most important information
about the input features. This allows the “decoder” half of the network to use this information to reconstruct
the data points with as much fidelity as possible at the output layer. The size of the encoding layer is selected
beforehand to match the target dimensionality.
Once the network is trained, the “encoder” half of the network can be used for dimensionality reduction. Data
points are fed to the input layer and the result at the encoding layer is the low dimensional representation.
The “decoder” half of the network is not needed for this process – it is only used during training.
Unlike regular PCA, autoencoders can discover highly nonlinear relationships between features in the original
dataset and use these relationships to find good lower-dimensionality representations. Unlike PCA and T-
SNE, autoencoders are parametric, meaning that they require training. In practice, autoencoders are more
frequenly used to reduce the number of features to make training computationally feasible rather than to
produce a 2D or 3D version of the data for visualization.
The success of autoencoders has resulted in the creation of many autoencoder variants, including variational
autoencoders, sparse autoencoders, denoising autoencoders, and others, each designed for specific tasks
and/or data types. These are rarely used for network data, but may be applicable in some situations.
The following example shows the use of an autoencoder to reduce a packet capture dataset from 5 dimensions
(IP source address, IP destination address, TCP source port, TCP destination port, and packet length) to 2
dimensions for visualization on a 2D scatter plot.
pcap = PCAP('../examples/notebooks/data/http.pcap')
pcap.pcap2pandas()
pdf = pcap.df[['ip_dst_int', 'port_dst', 'ip_src_int', 'port_src']]
is_dns = pcap.df['is_dns']
input_dim = 4
encoding_dim = 2
encoder = keras.Sequential([
keras.Input(shape=(input_dim,)),
layers.Dense(3, activation='relu'),
layers.Dense(encoding_dim, activation='relu', activity_
˓→regularizer=regularizers.l1(10e-6))
])
autoencoder = keras.Sequential([
encoder,
decoder
])
data_encoded = encoder.predict(pdf)
dns_colors = pandas.factorize(is_dns)[0]
flow_colors = pdf.apply(lambda row: hash(tuple(row)), axis=1)
Epoch 2/10
97/97 [==============================] - 0s 839us/step - loss:␣
˓→2474967139019128832.0000
Epoch 3/10
97/97 [==============================] - 0s 861us/step - loss:␣
˓→1568473403109670912.0000
Epoch 4/10
97/97 [==============================] - 0s 811us/step - loss:␣
˓→1128528220669345792.0000
Epoch 5/10
97/97 [==============================] - 0s 825us/step - loss:␣
˓→1047429411235692544.0000
Epoch 6/10
97/97 [==============================] - 0s 823us/step - loss:␣
˓→1026246632531820544.0000
Epoch 7/10
97/97 [==============================] - 0s 828us/step - loss:␣
˓→1020018380196806656.0000
Epoch 8/10
97/97 [==============================] - 0s 816us/step - loss:␣
(continues on next page)
Epoch 10/10
97/97 [==============================] - 0s 809us/step - loss:␣
˓→1014380496886431744.0000
6.2 Clustering
Clustering algorithms group data points by similarity, identifying latent structure in the dataset.
Clustering algorithms are extremely useful for data exploration, as understanding a data set often requires
understanding similarities among groups of data points. For example, it might be valuable to know that a
dataset of network flows can be naturally grouped into two clusters: “elephant” flows consuming lots of band-
width and “mouse” flows consuming relatively little bandwidth. Similarly, it might be valuable to learn that
a dataset of packets naturally clusters into 3 groups: network configuration packets, user application packets,
and malicious packets. If the clusters found by a clustering algorithm do not match your understanding of
the data, it may be that there is something more interesting going on in the data set that motivates further
exploration.
Clustering algorithms are also useful for anomaly detection, a machine learning task that involves identifying
anomalous data points that are dissimilar to most other points in the data set. Anomalous points are outside
or far from the center of the clusters that define the rest of the data. This is pratically relevant for security
tasks, as anomalous packets or flows might be due to novel network attacks.
When using clustering algorithms in practice, it is essential to remember that clustering algorithms are un-
supervised, meaning that the clusters they identify do not necessarily correspond to meaningful patterns in
the data. Rather, they are a data exploration tool, and it is the responsbility of the user to investigate any
identified clusters to see if they provide useful insights about the data.
There are many clustering algorithms, far more than we can discuss in this book, so we focus on three
commonly used algorithms that can be readily applied to network data.
Activity: Clustering
The Appendix provides an activity to apply different clustering algorithms on a network traffic trace that
contains both benign and attack traffic.
6.2.1 K-Means
K-means is a fairly simple algorithm that groups a dataset into K clusters by identifying K “centroid” locations
that define the center of each cluster. Each cluster consists of the data points that are closer to the centroid
of that cluster than to the centroid of any other cluster. The algorthim works as follows:
1. Choose a target number of clusters K
2. Choose K random data points as starting locations for the centroids
3. Assign all other points in the data set to the cluster with the closest centroid
4. Update the centroids to the average locations of each of the points in their cluster
5. Repeat steps 3 and 4 until the centroid locations stop changing
This algorithm is fast and always converges. However, it has drawbacks that can limit its applicability. Most
importantly, you have to choose the number of clusters. This can be straightfoward if you have existing
knowledge about the structure of the dataset. For example, if you have a dataset of IP packets that you want
to cluster into TCP and UDP traffic, you would choose K=2 and then check whether the discovered clusters
actually match these protocols.
If you don’t know the number of clusters, you can run K-means with increasing cluster numbers to see which
produces the cleanest clustering, but you might be better off choosing a different algorithm that does not
require an a priori choice of the number of clusters.
K-means also performs poorly for non-spherical clusters or clusters of varying density (where some clusters
have points that are much more similar to each other than the points in other clusters). If your data falls into
either of these categories, you might also be better off choosing a different algorithm.
When using K-means, it is important to standardize your data prior to applying the clustering algorithm.
Standardization centers and rescales each feature in the data to have a mean of 0 and a variance of 1 across
all data points. This ensures that the K-means algorithm gives each feature equal importance in the clustering
process.
6.2. Clustering 75
Machine Learning for Networking, Release Version 0.1
import matplotlib
import matplotlib.pyplot as plt
import pandas
import numpy
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OrdinalEncoder
from sklearn.decomposition import KernelPCA
data = pandas.read_csv("../examples/notebooks/data/tcp_flood.csv")
data = data[["Source", "Destination", "Src_port", "Dst_port", "Length"]]
data = OrdinalEncoder(encoded_missing_value=-1).fit_transform(data)
data = data[0:20000,:]
data = StandardScaler().fit_transform(data)
kmeans = KMeans(n_clusters=2).fit(data)
fig = plt.figure(figsize=(4,4))
plt.scatter(pca[:,0], pca[:,1], s=8, c=kmeans.labels_, cmap=matplotlib.colors.
˓→ListedColormap(['blue', 'red']))
plt.tick_params(axis='both', labelsize=0)
plt.title("KMeans Result: Packets Colored by Cluster")
plt.savefig("unsupervised_kmeans.png")
plt.show()
The following example shows the use of K-means to cluster a packet capture dataset into 2 clusters. This
particular dataset was collected during a simulated (CHECK) TCP flood attack, and the two clusters identified
by the algorithm correspond nearly perfectly to attack traffic and benign traffic. The figure uses kernel PCA
to display the originally 5-dimensional data in 2 dimensions.
This alternative to K-means defines clusters not just by their center point (centroid) but also by the variance
of the distribution of points in each cluster. This assumes that the locations of points in each cluster follow
a normal (Gaussian) distribution. While this may not be strictly true for some data sets, it is often a good
approximation for large data sets due to the central limit theorem.
The process of applying Gaussain mixture models (GMM) is fairly similar to K-means. You must choose a
number of clusters (or repeat the algorithm iteratively with different number of clusters), and the algorithm
will find a normal distribution corresponding to each cluster with a mean and variance that best fits the data.
Gaussian mixture models can also be used to generate new data by drawing new data points from the normal
distributions corresponding to each cluster. This allows you to create new data with similar characteristics
as your training data, which can be useful for augmenting a data set to provide enough data for training a
supervised algorithm.
6.2. Clustering 77
Machine Learning for Networking, Release Version 0.1
pcap = PCAP('../examples/notebooks/data/netflix.pcap')
pcap.pcap2pandas()
pdf_standardized = StandardScaler().fit_transform(pdf)
gmm = GaussianMixture(n_components=10).fit(pdf_standardized)
samples = gmm.sample(1000)
The example below shows the use of GMMs to generate new data that is similarly distributed to an exist-
ing packet capture data set. The figure uses kernel PCA to display the originally 5-dimensional data in 2
dimensions.
DBSCAN uses datapoint density to identify clusters similarly to how humans visually identify clusters of
points on a plot. High-density groups of points (groups with relatively many points a relatively small distance
from each other) become clusters. These clusters are defined by a core example and a neighborhood distance.
DBSCAN has a lot of advantages. It does not force you to choose the number of clusters beforehand; it
will find as many groups of nearby dense points as it can. It also works for datasets that aren’t spherical.
DBSCAN is frequently used for anomaly detection, because it can automatically identify points that don’t
fit in to any existing clusters. This is very useful in networks problems, such as malicious traffic detection,
where identifying unusual examples is valuable.
DBSCAN has some disadvantages due to its dependency on data density. If you have some clusters that are
tightly packed and other clusters that are more spread out, DBSCAN may be unable to achieve the desired
clustering. DBSCAN can also struggle with high dimensional data because the ‘’curse of dimensionality”
means that all data points appear far apart in high dimensional space.
6.2. Clustering 79
Machine Learning for Networking, Release Version 0.1
pcap = PCAP('../examples/notebooks/data/netflix.pcap')
pcap.pcap2pandas()
pdf_standardized = StandardScaler().fit_transform(pdf)
dbscan = DBSCAN().fit(pdf_standardized)
fig = plt.figure(figsize=(4,4))
plt.scatter(pca[:,0], pca[:,1], c=dbscan.labels_, s=8, cmap="tab20")
plt.title("DBSCAN Result: Packets Colored by Cluster")
plt.tick_params(axis='both', labelsize=0)
plt.savefig("unsupervised_dbscan.png")
plt.show()
The example below shows the use of DBSCAN to cluster data from an existing packet capture data set.
DBSCAN identifies The figure uses kernel PCA to display the originally 5-dimensional data in 2 dimensions.
Hierarchical clustering algorithms contruct a ‘’dendrogram,” or tree diagram, that illustrates how examples
can be progressively grouped based on similarity. This provides a nice visualization of your dataset indi-
cating which points are more closely related than others. You can choose what similarity metric is used to
construct the dendrogram (Euclidean distance is a common choice) based on how you want to interpret data
point similarity. For example, you might want to hierarchically cluster a packet capture dataset based on the
proximity of packets in the IP address space. In this case, you could choose the Hamming distance metric to
measure the number of bit positions in which IP addresses differ.
If you want to create a specific set of clusters from a hierarchical dendrogram, you can divide the tree at
a specific similarity threshold. All data points at least that similar to each other are then part of the same
cluster.
The following example shows the use of hierarchical clustering with a Euclidean distance metric to visualize
the similarity of data points in a packet capture dataset collected during a simulated (CHECK) TCP flood
attack. The dendrogram clearly shows that there are two distinct groups of packets that are more similar to
other packets in the same group than to packets in the other group. These two groups correspond to attack
versus benign traffic (CHECK). The dendrogram also shows that there is more similarity structure within
each of the two main clusters, which could be a target of further analysis.
6.2. Clustering 81
Machine Learning for Networking, Release Version 0.1
linkage_matrix = numpy.column_stack(
[model.children_, model.distances_, counts]
).astype(float)
data = pandas.read_csv("../examples/notebooks/data/tcp_flood.csv")
data = data[["Source", "Destination", "Src_port", "Dst_port", "Length"]]
data = OrdinalEncoder(encoded_missing_value=-1).fit_transform(data)
data = data[0:20000,:]
Semi-supervised learning leverages unsupervised learning to speed up the process of identifying labels for a
supervised model. In nearly all fields of ML, manual labeling is tedious. This is especially true for network
data. Imagine going through a packet capture dataset to manually apply a label to every individual packet.
Semi-supervised learning allows you to combine a relatively small number of manual labels with a clustering
algorithm to produce a fully labeled dataset.
Semi-supervised starts by applying a clustering algorithm to group the unlabeled training data. You then
manually label a few randomly selected points from each cluster and propagate the most frequent manual
label in each cluster to the other points in the cluster. This gives you a fully labeled data set even though you
only had to manually label a few data points.
Ideally, the clustering algorithm produces clusters in which all points are from the same class. In practice,
some clusters may have examples from multiple classes. You can perform semi-supervised learning recur-
sively to address this issue by running the clustering algorithm on individual clusters to identify sub-clusters,
manually labeling some randomly selected points in each sub-cluster, and propagating the most common la-
bels to all points in the sub-cluster. This can be repeated until all manually labeled points in the same cluster
are in the same class or until a desired preponderance of points are in the same class.
TODO: EXAMPLE OF SEMI-SUPERVISED LEARNING
SEVEN
7.1 Background
Large language models have gained increasing prevalence in many aspects of society today, with interactive
sessions such as OpenAI’s ChatGPT and Google’s Bard allowing users to query and receive answers for a
diverse set of general downstream tasks. Although they are highly developed and tailored, the underlying
architecture of the models for systems such as GPT and LaMDA is based on a transformer model. In such a
model, each word in a given input is represented by one or more tokens, and each input is represented with
an embedding, a high-dimensional representation of the input that captures its semantics. Self-attention is
then used to assign a weight for each token in an input based on its importance to its surrounding tokens.
This mechanism allows transformer models to capture fine-grained relationships in input semantics that are
not captured by other machine learning models, including conventional neural networks.
The use of large-language models in computer networks is rapidly expanding. One of the areas where these
models show promise is in the areas of network security and performance troubleshooting. In this chapter,
we will explore some of these early examples in more detail, as well as discuss some of the practical hurdles
towards deploying LLMs in production networks.
Large language models typically operate on a vocabulary of words. Since this book is about applications
of machine learning to networking, ultimately the models we work with will operate on network data (e.g.,
packets, elements from network traffic), not words or text in a language. Nonetheless, before we talk about
applications of LLMs to networking, it helps to understand the basic design of LLMs and how they operate
on text data. We will provide this overview by providing background on two key concepts in LLMs: vectors
and transformers.
7.1.1 Vectors
Language models represent each word as a long array of numbers called a word vector. Each word has a
corresponding word vector, and each word thus represents a point in a high-dimensional space. This repre-
sentation allows models to reason about spatial relationships between words. For example, the word vector
for “cat” might be close to the word vector for “dog”, since these words are semantically similar. In contrast,
the word vector for “cat” might be far from the word vector for “computer”, since these words are semanti-
cally different. In the mid-2010s, Google’s word2vec project led to significant advances in the quality of word
vectors; specifically, these vectors allowed various semantic relationships, such as analogies, to be captured
in the spatial relationships.
85
Machine Learning for Networking, Release Version 0.1
While word vectors, and simple arithmetic operations on these vectors, have turned out out to be useful
for capturing these relationships, they missed another important characteristic, which is that words can
change meaning depending on context (e.g., the word “sound” might mean very different things depend-
ing on whether we were talking about a proof or a musical performance). Fortunately, word vectors have
also been useful as input to more complex large language models that are capable of reasoning about the
meaning of words from context. These models are capable of capturing the meaning of sentences and para-
graphs, and are the basis for many modern machine learning applications. LLMs comprise many layers of
transformers, a concept we will discuss next.
7.1.2 Transformers
The fundamental building block of a large language model is the transformer. In large language models,
each token is represented as a high-dimensional vector. In GPT-3, for example, each token is represented
by a vector of nearly 13,000 dimensions. The model first applies what is referred to as an attention layer to
assign weights to each token in the input based on its relationships to the tokens in the rest of the input. In
the attention layer, so-called attention heads retrieve information from earlier words in the prompt.
Second, the feed-forward portion of the model then uses the results from the attention layer to predict the next
token in a sequence given the previous tokens. This process is accomplished using the weights calculated
by the self-attention mechanism to calculate a weighted average of the token vectors in the input. This
weighted average is then used to predict the next token in the sequence. The feed-forward layers in some
sense represent a databased of information that the model has learned from from the training data; feed-
forward layers effectively encode relationships between tokens as seen elsewhere in the training data.
Large language models tend to have many sets of attention and feed-forward layers, resulting in the ability to
make fairly complex predictions on text. Of course, network traffic does not have the same form or structure
as text, but if packets are treated as tokens, and the sequence of packets is treated as a sequence of tokens,
then the same mechanism can be used to predict the next packet in a sequence given the previous packets.
This is the basic idea behind the use of large language models in network traffic analysis.
A key distinction of large-language models from other types of machine learning approaches that we’ve read
about in previous chapters is that training them doesn’t rely on having explicitly labeled data. Instead, the
model is trained on a large corpus of text, and the model learns to predict the next word in a sequence given
the previous words. This is, in some sense, another form of unsupervised learning.
Transformers tend to work well on problems that (1) can be represented with sequences of structured input;
and (2) have large input spaces that any one feature set cannot sufficiently represent. In computer networking,
several areas, including protocol analysis and traffic analysis, bear some of these characteristics. In both
of these cases, manual analysis of network traffic can be cumbersome. Yet, some of the other machine
learning models and approaches we have covered in previous chapters can also be difficult for certain types
of problems. For example, mappings of byte offsets or header fields and their data types for all protocols,
as well as considering all values a field may take, may yield prohibitively large feature spaces. For example,
detecting and mitigating protocol misconfiguration can be well-suited to transformer models, where small
nuances, interactions, or misinterpretations of protocol settings can lead to complicated corner cases and
unexpected behavior that may be challenging to encode in either static rule sets or formal methods approaches.
BERT is popular transformer-based model that has been successfully extended to a number of domains, with
modifications to the underlying vocabulary used during training. At a high level, BERT operates in two
phases: pre-training and fine-tuning. In the pre-training phase, BERT is trained over unlabeled input, and is
evaluated on two downstream tasks to verify its understanding of the input. After pre-training, BERT models
may then be fine-tuned with labeled data to perform tasks such as classification (or, in other domains, text
generation) that have the same input format.
In recent years, transformer-based models have been applied to large text corpora to perform a variety of
tasks, including question answering, text generation, and translation. On the other hand, their utility outside
of the context of text—and especially in the context of data that does not constitute English words—remains
an active area of exploration.
The utility of large language models for practical network management applications is an active area of
research. In this section, we will explore a particular early-stage example of the use of large language models
for a number of downstream applications.
We will explore a recent example from Chu et al., who explored the use of large language models to detect
vulnerable or misconfigured versions of the TLS protocol. In this work, BERT was trained using a dataset
of TLS handshakes.
A significant challenge in applying large language models to network data is to build a vocabulary and cor-
responding training set that would allow the model to understand TLS handshakes. This step is necessary,
and important, because existing LLMs are typically trained on text data, and the vocabulary used in these
models is typically based on the vocabulary of the English language. To train a model that understands TLS
handshakes, the first step involved building a vocabulary that would allow the model to understand TLS hand-
shakes. In this case, the input to the model is a concatenation of values in the headers of the server_hello and
server_hello_done messages, as well as any optional server steps in the TLS handshake. The resulting input
was normalized (i.e., to lowercase ASCII characters) and tokenized.
The resulting trained model was evaluated against a set of labeled TLS handshakes, with examples of known
misconfigurations coming from the Qualys SSL Server Test website. The model was able to correctly identify
TLS misconfigurations with near-perfect accuracy.
There have been a number of examples that apply various LLM architectures towards generating synthetic
network traffic data. A significant, domain specific challenge for applying LLM models towards this task is
the issue of resource complexity for these models. Recall that the deep-models used for these problems rely
on maintaining and updating a compressed context or representation of input data (attention embedding for
transformers, state for state-space models). This is largely acceptable in the natural language domain, where
localized information (e.g., a paragraph of text) can be captured with a context that remains computationally
tractable (e.g., ~16,000 tokens). Unfortunately, in the networking domain, this is not the case as even seconds
of communication between hosts can span hundreds or thousands of packets, depending on the network.
Thus, there are a number of design choices that must be considered when approaching this problem.
We will discuss three notable examples for this problem: Traffic GPT, NetMamba, and NetDiffusion, which
are based on the Transformer, Mamba, and Denoising Diffusion architectures/approaches, respectively. All
three approaches output fine-grained, raw networking data that is parsable by common packet-level analysis
tools (e.g., Wireshark). The resulting data from these models can then be used to inform a number of down-
stream applications (e.g., augmenting training data for models, verifying traffic classification performance,
etc.)
Traffic GPT
NetMamba
NetDiffusion
Large language model architectures have also been applied to the problem of predicting flow-level statistics
for varying workloads, and network topologies. Specifically, we will highlight in this section the recent m3
system, which uses a transformer to create a “context” (i.e., embedding) that captures network dynamics that
are influencing the flow completion times of incoming flows.
The resulting m3 system outperforms the previous, non-transformer based approach in both speed, and ac-
curacy.
m3
Large language models for traffic classification have appeared as one of the more popular research directions
in recent work. A number of transformer based approaches have achieved better state-of-the-art classification
performance on both non-encrypted, and encrypted traffic, across varying protocols, settings, and workloads.
We will examine a number of efforts that attempt to both pretrain large language models from scratch, as well
as finetune existing “off the shelf” models for networking tasks.
EIGHT
In this chapter, we introduce reinforcement learning, the process by which a machine learning model acts in
an environment and learns to improve its performance based on feedback. The goal of reinforcement learning
is to identify a strategy for choosing actions that will maximize the model’s expected rewards over time.
Unlike supervised learning, which requires pairs of examples and correct labels, reinforcement learning
only requires feedback when the model does something relevant to the task. Unlike both supervised and
unsupervised learning, many reinforcement learning models are able to explore their environment, making
tradeoffs between searching for new approaches and optimizing existing knowledge.
Throughout this chapter, we will describe the fundamental concepts of reinforcement learning as well as
several reinforcement learning models, using networking examples as a guide. We will focus on a class of
reinforcement learning models called Q-learning, as these models are relatively straightforward and perform
well on many diverse tasks.
We do not assume you’ve seen these models before, so readers who are aiming to get the basic intuition
behind different models will find this chapter helpful. Readers who are already familiar with these models
may still find the examples in this chapter helpful, as they present cases where particular models or types of
models are suited to different problems, as well as cases in the networking domain where these models have
been successfully applied in the past.
Reinforcement learning models are set in a environment that defines the possible states where the model can
exist, what actions the model can take, and the results of those actions.
The notion of a state is flexible and readily adapted to different problems. Generally, a state is a unique
configuration of the information available to and stored by the model that is relevant to the task. States can
include input values, internal data structures, measurement data, etc. The programmer of a reinforcement
learning model will typically have define the meaning of a state for the particular task. In the networking
context, states might be routing configurations and user quality-of-experience ratings, sets of firewall rules
and traffic volumes, etc. Importantly, the current state should contain enough information for the model to
decide what to do next without needing to refer back to previous states. This constraint is known as the
Markov assumption and is essential for the reinforcement learning algorithms we discuss below. If historical
information is needed for the task, the data structure for the current state should include this information.
The concept of actions is similarly flexible, including all outputs, actuation, and behaviors that the model can
take. Actions are typically programmer-defined and determine the possible ways that the model can affect
89
Machine Learning for Networking, Release Version 0.1
itself and its environment. Different problems usually require different sets of actions. In the networking
context, actions might include advertising a new BGP route announcement, modifying a forwarding rule on
a software-defined switch, removing an IP address from a firewall block list, etc.
Given these definitions, we can formulate an environment as a graph that connects states (vertices) with
actions (edges). This formulation is called a Markov decision process, and is a common way to represent
a problem for reinforcement learning models. For every combination of two states and an action, there is
a transition probability. If the model is in the first state and takes the action, this is the probability that it
will end up in the second state. If the environment is deterministic, the transition probabilities are all either
0 or 1, indicating that there is no uncertainty about the results of actions. Real-world environments are
typically nondeterministic or are deterministic but rely on values inaccessible to the model. In these cases,
the transition probabilities are real-valued and reflect the fact that actions have uncertain outcomes from the
model’s perspective.
Many tasks can be expressed in this framework. For example, we could imagine a reinforcement learning
model for optimizing forwarding rules on a SDN switch. The states would be the current table of forwarding
rules and the current measured traffic volumes. The possible actions could be adding or removing a rule
from the table. All transition probabilities would be 0 or 1, because the model is always successful in adding
or removing a rule unless the table is empty or full.
Unfortunately, many real world problems exist in environmnts with combinatorial or infinite numbers of
states or actions. This makes reinforcement learning more challenging, as models in these circumstances
are not able maintain a data structure with information about all possible scenarios. Whether or not a par-
ticular reinforcement learning algorithm can handle environments with large state or action spaces must be
considered during the model selection process, as we will see below.
In addition to the environment (consisting of states, actions, and transition probabilities), two additional
values are essential to guiding the learning process of reinforcement learning models:
A reward function provides positive feedback to the model when it does something that is useful for the task
and negative feedback when it does something that is counterproductive to the task. For every combination
of two states and an action, the reward function defines a (positive or negative) “reward” that the model
should receive if it starts in the first state, takes the action, and ends up in the second state. The creation of
an effective reward function is essential to the learning process, as positive and negative rewards guide the
model towards learning a strategy to perform the desired task.
Continuing our example from above, the reward function could reflect increases or decreases in user quality-
of-experience as a result of forwarding rule changes. The model would then be incentivized to find a sequence
of actions (forwarding rule changes) that maximizes the reward, thereby maximizing user QoE. The design of
the reward function is often critical to the success or failure of the reinforcement learning algorithm overall.
The function must be directly related to the problem that the programmer wants the model to solve, otherwise
the model may learn to optmize the reward function without actually solving the problem. The function must
also have enough nuance such that the reward is not all or nothing. If the model only receives positive
feedback when it completely solves the problem, it will not have any direction as to what strategies are
promising along the way. This can be challenging, as many problems can seem binary (“solved” or “not
solved”) at first glance. Experienced reinforcement learning practicioners will identify ways to break down
problems into smaller, rewardable portions. As an example, imagine that you wish to use reinforcement
learning to stop a distributed denial of service attack that is coming from a rapidly changing set of remote
hosts. A naive approach to the reward function might give a reward of 1 if the attack is stopped after the
model takes an action and a reward of 0 if the attack is ongoing. This is unlikely to produce a good model
unless the training process accidentally stumbles across a specific strategy that stops the attack. A much better
approach would be to use a function that assigns a continuous reward value from 0 to 1 that is proportional
to the reduction in attack traffic caused by an action. This will guide the learning process, as actions that
cause a greater reduction in attack traffic are more likely to be included in the final strategy, even if they do
not completely block the attack on their own.
The discount factor is a scalar hyperparameter that quantifies how much the model should prioritize imme-
diate rewards over future rewards. A high discount factor will incentivize the model to learn a strategy that
maximizes rewards now (or in the near future), potentially lowering its total reward received over time (if it
were allowed to run long enough to collect rewards in the distant future). A low discount factor will prioritize
maxmizing the total reward received over time, but might result in lower rewards received in the near future.
For example, imagine someone asking if you want a small sum of money now versus a large sum of money
later. T he extent to which you are willing to take the small sum immediately reflects your personal discount
factor. The discount factor is set by the programmer and is typically task-dependent.
8.3 Q-Values
To make things easier (and because it is essential for the models discussed in the next sections), we will
introduce one additional composite value: the q-value. Q-values are defined with respect to state/action
pairs. The q-value of a state/action pair is the sum of the discounted future rewards that the model can expect
if it chooses to take that action when it is in that state.
This is a bit complicated, so let’s break it down. First, imagine that the model is in a particular state s. It
can potentially choose among several actions, but let’s imagine that it chooses action a. As a result of this
choice, the model will end up in some other state (based on the transition probabilities) and might receive
some immediate reward (based on the reward function). The model will then go on its merry way, continuing
to choose actions and receive rewards on into the future. In this scenario, just how good was the choice of
action a from state s? To quantify this, we have to account for any immediate reward received and any
future rewards that the model might have set itself up for by making this choice. Since we don’t want this
q-value to depend on the future (it should only depend on s and a), let’s assume that the model only makes
perfect choices in the future, receiving the maximum amount of reward possible. Of course, these future
rewards might be less enticing to the model than immediate rewards based on the discount factor. The q-
value therefore combines both the immediate reward and the discounted optimal future rewards into a single
number indicating the quality of the decision of taking action a in state s.
The q-value is important because a model that knows (or learns) accurate q-values can perform optimally.
When it finds itself in any state, it simply chooses the action for which the corresponding state/action q-value
is the highest. Because the q-values have already accounted for the effect of this action on the model’s future
options, the model doesn’t need to worry about the future when making any individual decision – it just
chooses an action based on the q-values associated with its current state.
This formulation of q-values reframes reinforcement learning into a task of determining the most accurate
q-values for all state-action combinations that the model might encounter. Fortunately, this has guided model
design and led to many effective reinforcement learning algorithms, including those we discuss below. Un-
fortunately, this formulation does not address the fundamental limitations addressed in the previous section:
8.3. Q-Values 91
Machine Learning for Networking, Release Version 0.1
if there are an infinite number of states and/or actions in an environment, there will also be an infinite number
of q-values. The following approaches address this issue for environments of increasing complexity.
Some tasks have narrowly defined environments in which it is possible to determine and store all relevant
details, including the states, actions, transition probabilities, reward function, and discount factor prior to
any learning taking place.
In these cases, the agent can run a offline iterative algorithm to determine all q-values without needing to
explore or otherwise interact with the environment whatsoever. Starting with all q-values set to 0, the model
computes the following equation iteratively for each state/action pair until the q-values converge. In the
equation, 𝑠 is a state, 𝑎 is a action, 𝑠′ is a different state, 𝑎′ is a different action, 𝑇 is the transition probability,
𝑅 is the reward function, and 𝛾 is the discount factor. This processes uses the q-value computed at iteration
𝑘 to compute the q-value at iteration 𝑘 + 1.
∑︁
𝑄𝑘+1 (𝑠, 𝑎) ← 𝑇 (𝑠, 𝑎, 𝑠′ )[𝑅(𝑠, 𝑎, 𝑠′ ) + 𝛾 max′
𝑄𝑘 (𝑠′ , 𝑎′ )]
𝑎
𝑠′
Intuitively, the right side of the equation goes through every possible state 𝑠′ the model might immediately
reach if it takes action 𝑎 from state 𝑠, adds the immediate reward and the estimate (q-value) of the optimal dis-
counted future rewards from 𝑠′ going forward, and weights that value by the liklihood (transition probability)
of reaching 𝑠′ .
Once all q-values have converged during this offline training process, the agent can then use the q-values to
act optimally in the environment. Regardless of which state 𝑠 the agent finds itself, it chooses the action 𝑎
for which the q-value of the pair (𝑠, 𝑎) is the highest.
8.5 Q-Learning
For most reinforcement learning tasks, the agent does not know much (or anything) about the environment
at the outset. Instead, the agent must explore the environment to discover the states and rewards. This is an
online process, meaning that the q-values, and therefore the best strategy, cannot be calculated a priori. Only
by engaging with the environment can the agent learn the q-values. Starting with estimates for all q-values
at 0, the agent performs the following iterative update after each exploration step, where 𝛼 is a learning rate
parameter that controls how fast new exploration steps update the existing q-value estimates.
This equation also describes a weighted sum. If 𝛼 is low, the first term will dominate, and the update will
prioritize the previous q-value estimate. If 𝛼 is high, the update will priortize the newly discovered reward
𝑟 (and the discounted estimate of the optimal future reward). The choice of 𝛼 depends on how responsive
you want the model to be to changes in the environment. A low alpha will cause the q-value estimates to
take longer to approach the correct values, but they will be less susceptible to noise. A high alpha will cause
the q-value estimates to approach the correct values more quickly, but they will fluctuate more severly with
changes in the environment.
This process tells the agent how to update q-value estimates, but it doesn’t say anything about which actions
to take while exploring. This is an example of an “exploration/exploitation” tradeoff. On one hand, it makes
sense to venture into new areas of the environment to see if they yield new approaches to the solution. On
the other hand, it makes sense to fine tune strategies already known to work well. In practice, Q-learning
agents will define a (potentially dynamic) parameter 0 ≤ 𝜖 ≤ 1 such that random actions are chosen with
probability 𝜖 (new strategies) and the action with the highest estimated q-value at the current state chosen
with probability 1 − 𝜖. Once the agent has finished exploring and is satisfied with its q-value estimates, it
can enter a deployment phase where it always chooses the action that maximizes the estimated q-value.
In addition to requiring environment exploration, many real world tasks have large environments and many
degrees of freedom. In analog real-world environments, there are a (practically) infinite number of states.
Even digitally represented environments often have more possible states than can be feasibly represented in
a data structure. This makes it impossible for a reinforcement learning agent to enumerate all (state, action)
pairs or store estimates of all q-values in memory.
Rather than attempting to store all q-values, approximate q-learning uses a machine learning model to predict
the q-value for the current state and all possible actions. This approach significantly reduces the amount of
storage needed, as the agent need only maintain the model, the current state, and the set of possible actions.
When the agent enters a new state, it asks the model for a new set of q-value predictions and then chooses
the action corresponding to the largest predicted q-value.
A version of approximate Q-learning called deep q-learning uses a supervised neural network model to
predict q-values. The training data for the model comes from environment exploration. Just as in Q-Learning,
the agent explores the environment using a parameterized exploration/exploitation tradeoff in order to learn
what combinations of states and actions produce rewards. If the agent takes action 𝑎 from state 𝑠 to state 𝑠′
and receives reward 𝑟, it updates the label (target) q-value for pair (𝑠, 𝑎) to the following:
After taking several exploratory steps and updating several q-value targets, the agent re-trains the deep learn-
ing model such that its predicted q-values are closer to the targets. Once the agent has finished exploring
and is satisfied with the final trained deep learning model, it can enter a deployment phase where it always
chooses the action with the highest predicted q-value at the current state.
The application of Deep Q-Learning (and deep reinforcement learning more generally) to networking prob-
lems is an active area of research. Models have been trained to dynamically allocate spectrum, perform access
and rate control, improve caching for better QoS, optimize data and computation offloading, configure traffic
routing, share resources across nodes, and counter a variety of security threats.