OceanofPDF - Com Practical Guide To Modern Networking Telemetry - Avi Freedman
OceanofPDF - Com Practical Guide To Modern Networking Telemetry - Avi Freedman
OceanofPDF.com
Practical Guide To Modern
Networking Telemetry
How Telemetry Can Be Used to See Into Your
Network’s Performance and Usage Patterns
With Early Release ebooks, you get books in their earliest form—the
authors’ raw and unedited content as they write—so you can take
advantage of these technologies long before the official release of these
titles.
OceanofPDF.com
Practical Guide To Modern Networking Telemetry
by Avi Freedman and Leon Adato
Copyright © 2025 O’Reilly Media, Inc. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North,
Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales
promotional use. Online editions are also available for most titles
(http://oreilly.com). For more information, contact our
corporate/institutional sales department: 800-998-9938 or
[email protected].
OceanofPDF.com
Brief Table of Contents (Not Yet
Final)
Introduction (available)
Chapter 1: Network and Telemetry Introduction (available)
Chapter 2: Wrangling Telemetry (unavailable)
Chapter 3: Intro to Using Telemetry (unavailable)
Chapter 4: Using Telemetry, Individually (unavailable)
Chapter 5: Using Telemetry, Together (Network Layer) (unavailable)
Chapter 6: Using Network Telemetry, Combined with Other Layers
(unavailable)
OceanofPDF.com
Introduction
Cost
At current infrastructure scale, cost can be enormous for many
companies, and optimizing the network infrastructure often is a full time
job - or more. Combining network telemetry with business data about
cost can drive huge savings that often fund the entire network
observability stack.
Security
Network telemetry remains a fast and great way to identify most
cybersecurity issues, such as DDoS attempts, compromise and lateral
movement, and the impact of botnets.
About You (“Is this book for me?”)
This book is for you if any of the following things are (or might be) true:
You’d describe yourself as a “learn and do” kind of person.
You are comfortable with application monitoring and observability,
but not networking, and you’d like to find out how network
monitoring and observability are different (and beneficial!)
You are comfortable with networking, but not monitoring and
observability, and you’d like to find out how network monitoring
and observability are different (and beneficial!)
You build, maintain, support, or are simply curious about “the
network” and the ways in which network performance impacts
everything that rides on top of it, from the data to the application to
the overall user experience.
You know how to look at a dashboard and interpret data presented
in charts and graphs, but you want to understand how monitoring
and observability data are represented in those forms.
On the flip side, what does this book presume you already know? To be
honest, there’s not a lot of requirements. Throughout this guide, we’ll not
only provide detailed information on terms and technologies, we’ll point
you to external content when we think some readers might appreciate a
deeper dive than we have pages to cover.
That said, you will be most comfortable with the information we’re sharing
if the following things are generally true about you:
We’ll If you aren’t rock-solid on those topics, DO NOT PANIC (also, don’t
put this book back on the shelf. We’re not done paying off our kids’
orthodontist yet.). Throughout this guide we’ll offer information,
instruction, and examples. And if you need more, we’ll also provide links to
background and deeper dives on these and other topics as we cover them.
But this guide isn’t just geared to increasing your awareness. We’d also like
to believe that we’ll provide you with skills you can actively apply.
Therefore, after reading this book we also hope the reader will be able to:
OceanofPDF.com
Chapter 1. Network and
Telemetry Introduction
Anatomy of a Network
Let’s be honest - networks are composed of simple base components but at
scale anything but simple. Sure, most of the network diagrams you see in
class are 3 routers (inevitably named “Spring”, “Summer”, and “Fall”),
connected to a switch, which is connected to “the cloud”.
And yet, in the real world, networks are composed of many (MANY!)
different device types in multiple configurations and use cases.
Cloud VPCs
Virtual private clouds in your public cloud infrastructure. Includes
subnets and the container environments where you deploy your
microservices-enabled applications.
Transport devices
Many modern transport devices (layers 1 and 2 in the OSI Model1 in
fiber, broadband, and mobile networks now support active and passive
telemetry.
TAP/SPAN/NPB devices
Physical and virtual test access points (TAP), switch port analyzers
(SPAN), and network packet brokers (NPB) that provide port mirroring,
testing, and monitoring.
But a few of those deserve a more detailed description. Once again, the
purpose of this guide is not to teach you every aspect of network design,
architecture, implementation or management. Instead, we want to describe
the devices below in terms of the telemetry they emit and the insights that
telemetry provides.
It’s also important to note at the outset that each of these devices has their
own hardware metrics that shed light on the network - everything from
To a greater or lesser extent, combining that insight with the other details
can tell you where problems are occurring or, conversely, when a problem
is actually downstream of a device which seems to be complaining but is in
actuality simply unable to communicate with the next hop in the chain.
The final caveat before diving into the specifics of each device type is that
even something as seemingly innocuous as inventory (especially when
visualized as a map) can have a profound impact on your ability to
understand how a network is performing and where the root cause of a
problem may lie.
Tunnels / VPNs
Originally used for more “exotic” configurations, tunnels are now
commonplace and are protocol-based wormholes that connect different
parts of a network together. Common tunneling protocols include GRE,
IP (in) IP, and Wireguard. When these are exposed to users often they’re
just called VPNs, but people raised in networking often think of them as
tunnels.
Device Types
Routers
Routers are hardware that generally run a Unix-based OS that interacts
with users and other networking elements, and instructs specialized
hardware (if present) how to forward, filter, and report on the packets
going through.
Routers have interfaces - physical or logical - and the physical
interfaces usually have optics or wired ports that can be monitored.
Many modern routers can do switch-like Layer 2 forwarding
themselves, but generally, (unlike a switch), a router segregates Layer 2
forwarding unless told to do otherwise via configuration.
Layer 2 Switches
Layer 2 switches move (typically) Ethernet frames around, but also
have OSes, CLI, protocols, tables, and telemetry they generate.
Layer 3 switches
Most switches today are very close to being routers, and do IP
forwarding as well at line rate.
Web Logs
Web servers can emit log lines per transaction that describe various
actions and transactions, both success and failure. These logs can shed a
great deal of light on the network since they show source and
destination IP addresses, application context, and performance
information about both application and TCP-layer performance.
Load balancers
With regard to the devices we’re exploring in this section, load
balancers may be the most fundamentally different as well as
functionally narrow-focused. They’re really Layer 4+ routers in some
sense, and the metrics and telemetry you should look for from them
bears at least a passing resemblance to that of routers and switches.
Unlike routers, since they are part of the Layer 4+ transactions, they
actually see and can report on response time, latency, throughput, error
information, and even the traffic patterns of the data being handled.
They can be great observation points but are often left out of network
telemetry sources by teams.
Service Meshes
Broadly speaking, service meshes are load balancers designed to talk to
other backend software elements, not browsers/users. They can do
health checks, load balancing, content rewriting, policy enforcement,
and telemetry just like load balancers, and are almost always delivered
as a software layer or service, unlike load balancers which are
sometimes still physical appliances.
Firewalls
When discussing routers, we mentioned that there are security controls
that might be involved. Firewalls take that behavior and make it their
entire raison d’etre. At the same time, there are still routing elements
involved.
Events
Strings of text and numbers stamped with a time and a source to show
that such-and-such occurred to so-and-so system and thus-and-this time.
Logs
Sources of messages and other output which might be aggregated across
many systems and comprise multiple layers of the architecture, from
low-level hardware to high level application.
Traces
A coherent collection of information that show how a “transaction”
within an application traverses multiple systems, and the ways the
transaction is performed at every step along that path - usually
augmented with a transaction or trade ID to allow correlation of all of
the steps in that transaction.
These categorizations are, by and large, fine. However they are (and always
have been) biased toward application telemetry. And that’s fine, but it
doesn’t work completely when discussing network telemetry.
Therefore, Avi and I are presenting a different framework for understanding
the different types of data that you’ll commonly encounter when monitoring
a network infrastructure.
Device metrics
These tell you the state or health of your physical and logical network
equipment. Sample formats include SNMP, syslog, and streaming
telemetry.
Events
This indicates events like an attempted login, a threshold has been met,
or a configuration has been changed. Sample formats include SNMP
trap and syslog.
Tables
Snapshots/state of the various tables in a router, mostly for
forwarding/routing.
Synthetic
These “synthetic” measurements reveal performance metrics such as
latency, packet loss, and jitter, and can be triggered or collected via
device telemetry interfaces. They span client and server endpoints,
network equipment, and internet-wide locations at both the network and
application layers.
Configuration
This (typically static) data represents the operating intent for all
configurable network elements such as topology information, IP
addresses, access control lists, location data, and even device details
such as hardware and software versions. Sample formats include XML,
YAML, and JSON files.
Business or operational
Often called “layer 8,” this data provides business, application, and
operational context about what the network is being used, and can be
added to telemetry to help network pros measure impact, understand
value of certain traffic, and prioritize their work.
DNS
DNS telemetry helps put other network data into context by indicating
from or to where traffic is coming or going. Most DNS information
comes in text-based files.
Drill-Down: Telemetry Types
The previous list of types of telemetry is concise and focused rather than
comprehensive. But now that we understand the definition and the value of
network telemetry, as well as the devices that make up a typical network,
we need to take a moment to list out all of the various data types available
for monitoring and observability.
This section will go into both the protocols themselves and, in some cases,
touch on the Device Health and Status: Syslog.
Syslog is a protocol which allows one machine to send a message (“log”) to
a server listening on TCP or UDP port 514. This is more often used and at
higher volume when monitoring network and *nix (Unix, Linux) devices,
but network and security devices such as firewalls and IDS/IPS systems
send system and component logs - and can be configured to send even more
detailed logs, though care is needed not to overwhelm the CPU (control
plane).
Syslog messages are similar to SNMP traps, but differ in that syslog
messages are relatively freeform and don’t depend on the MIB-OID
structure required by SNMP.
In addition to being more freeform, syslog tends to be “chattier” than
SNMP traps. However, it’s more flexible because many applications can be
set up to send syslog messages, whereas SNMP traps are generally used
much more sparingly, and most companies have much more broad and
robust log collection ability and scale than they do for SNMP traps.
Virtual Private VPC flow logs are equivalent to the flow records
Clouds (e.g., NetFlow, sFlow, etc.) in a traditional (on
premises) network. Various cloud network
components such as a VPC, a subnet, a network
interface, internet gateway, or a transit gateway,
can generate flow logs which are used in the
same way as NetFlow data. These started with
the core cloud network functionality (VPC,
NSG, VNET) but have versions now available
for many other cloud services like storage,
firewalls, and other layers of the stack.
Load Balancers On the other hand, there are some very specific
data types, like connections (active, passive, etc),
request count, information on healthy/unhealthy
hosts, and even the health and performance of
the load balancer infrastructure itself.
If the load balancer is in a cloud environment,
you might also be interested in request tracing,
changes to the load balancer itself (due to elastic
compute), and connectivity to other cloud-based
elements like storage, content engines, and more.
It’s also important to note that “load balancer” is
a general term for a device which might be
specifically designed for application, network, or
even gateways (virtual devices like firewalls, or
intrusion detection systems) traffic.
SNMP Traps
This is the pull-based option. They are triggered on the device being
monitored, and sent to another device that is listening for those messages (a
trap receiver or trap destination). Nothing is needed on the part of the
monitoring solution except to receive and store those messages, and then
correlate them with telemetry from other sources.
All the configuration for this is done on the device itself, meaning that
every device on your network that needs to send traps has to be configured
with the trap destination, along with the security elements (community
string or username/password). We’ll explain more about those options
below.
SNMP Get
As hinted at earlier, SNMP Get requests are pull-based. A remote system
sends a request to the machine being monitored, requesting one or more
pieces of data. The most basic form of this command is:
Rather than get into the weeds of the various methods of SNMP-Get (which
will largely be handled under the hood by whatever monitoring and
observability tool you use), the point is that SNMP works to bring data
about a remote system (whether numeric or text) into a local repository for
storage, tracking, and visualization
Streaming Telemetry
As described in the “Types of Telemetry” section, streaming Telemetry (ST)
is notable for having many of the same data points as SNMP, and more; and
a far more consistent data structure; and a significantly higher granularity
with a significantly lower impact on the device being monitored.
Originally developed as an alternative to SNMP, the goal was to move away
from poll-based observation (and each monitoring system separately polling
devices), and towards pushing defined data from network devices, to then
be consumed by all of those systems.
So what’s not to love? Well, getting it set up can be a bit of a challenge.
This mostly arises from the newness of the technology and the lack of
experience on the part of both engineers developing observability solutions;
and network professionals who are implementing ST in their environments.
Oh, and the fact that it’s not supported on a wide range of devices yet.
That’s another critical factor.
Telemetry Deep Dive: NetFlow, sFlow, and Other Traffic
Sources
NetFlow (and its variations like sFlow, JFlow, IPFIX, and others, which
we’ll refer to from hereon out collectively as “NetFlow” unless we’re
discussing something particular about the other variations) was practically
purpose-built with the goals of network telemetry and network observability
in mind. While NetFlow isn’t the only protocol a network engineer may
need, it is almost certainly the primary one they will refer to for the richest
level of insight.
NetFlow is push-based, meaning the device observing and generating the
telemetry (kind of. More on that in a minute) sends data to a listening
device.
As you can see, the single machine is able to report on conversations
between devices on the internal network, the internet, and many points
along the way such as peer routers.
1. Go to a website
2. “Click” on the login page
3. Log in with pre-set credentials to a dummy account
4. “Click” on the account balance page
5. Verify that the balance is $2.75
Not only will you be able to determine whether the app/system/site is up or
down, you will also gain insight as to the timing of the overall transaction
and each of the steps along the way.
The range of applications, systems, and conditions that can be checked via
synthetic transactions include network status and performance tests - from
the simple checks (ping and traceroute); to more complex for things like
BGP, ASNs, and CDNs; Internet protocol tests for DNS and http
responsiveness; Web-centric tests for things like page load time or API
responsiveness; and multi-step tests like the one described above.
Configuration management
Configurations sit at the border between monitoring and management. On
the one hand, a configuration identifies how a thing (a system, application,
or operation) works on a fundamental level. On the other hand, knowing
that a configuration has changed - and especially when it’s changed - can
often make the difference between having no idea why a system is suddenly
having an issue; and knowing when the problem REALLY started and
where to look.
At a high level, configuration management as it relates to monitoring and
observability is the process of first identifying which files or elements count
as “configuration”, doing an initial scan of that object, and then repeating
the process and noting changes.
Generally this is not done by watching files on a filesystem for the “big
vendor” routers and switches, though it may be on your Linux or BSD-
based services.
So in that case, the monitoring system would have to either use an API
(hopefully!) or log into the system using a terminal protocol like ssh
(because friends don’t let friends use telnet), running some variation of the
“show config” command, capturing (“scraping”) the results, and saving that
information to the monitoring system (whether as a file or a database entry).
That process would be repeated, and the two results scanned for differences.
But that’s not all. Those same configurations can be scanned for everything
from syntax errors to security errors.
If issues are found, it could trigger an alert; or it might show up on a
periodic report with the new, changed, deleted, or problematic elements
highlighted.
To summarize: Configurations may seem at first glance to be outside the
scope of what a monitoring and observability tool might care about. It
certainly doesn’t seem to fit the description of what “network telemetry”
includes.
But in truth configurations are so tightly bound up in system, application,
and sub-component stability and performance that NOT monitoring this
critical aspect of the environment seems foolhardy at the very least, and
possibly negligent when you consider the worst-case (which are sadly
becoming more common in these days of companies joining the security-
breach-of-the-week club) scenarios.
1 For those who need a reminder, the layers are: Physical, Data Link Layer, Network,
Transport, Session, Presentation, Application. A popular mnemonic for this is: “Please Do Not
Throw Sausage Pizza Away”. For those who need more than a reminder, there’s always
wikipedia (LINK TO: https://en.wikipedia.org/wiki/OSI_model)
OceanofPDF.com
About the Authors
Avi Freedman and Leon Adato have, collectively, over 70 years
experience in the tech industry, with particular focus on networking,
monitoring, and observability. Both recognize that, after the hard work of
building a solution is done - whether that be a network, a datacenter, or an
application - the hard work of keeping things running starts. And that’s
usually where the problems really start. Their decision to collaborate on this
book arose first and foremost to share all the samples, examples, stories,
and lessons they usually share in the booth at conferences, or in talks, or
when helping customers; but also to provide a resource to the readers
themselves: who might need to articulate those same lessons to colleagues,
managers, or the odd (very odd) person at a dinner party.
OceanofPDF.com