DataDog EBookMonitoringModernInfrastructure
DataDog EBookMonitoringModernInfrastructure
Infrastructure
TABLE OF CONTENTS
Chapter 1:
Constant Change pg. 1
Chapter 2:
Collecting the Right Data pg. 6
Chapter 3:
Alerting on What Matters pg.14
Chapter 4:
Investigating Performance Issues pg.20
Chapter 5:
Visualizing Metrics with Timeseries Graphs pg.24
Chapter 6:
Visualizing Metrics with Summary Graphs pg.34
Chapter 7:
Putting It All Together – Monitoring Kubernetes pg.43
Chapter 8:
Putting It All Together – Monitoring AWS Lambda pg.65
Chapter 9:
Datadog is Dynamic, Cloud-Scale Monitoring pg.82
Chapter 1:
Constant Change
The cloud has effectively removed the logistical and economic barriers to accessing
production-ready infrastructure. Now, any organization or individual can harness
the same technologies that power some of the biggest companies in the world.
The shift toward the cloud has brought about a fundamental change on the
operations side as well. We are now in an era of dynamic, constantly changing
infrastructure—and this requires new monitoring tools and methods.
1
CHAPTER 1: CONSTANT CHANGE
2
CHAPTER 1: CONSTANT CHANGE
The elastic nature of modern infrastructure means that individual components are
often ephemeral and/or single-purpose. Cloud computing instances can run for just
hours or days before being destroyed. The shift toward containerization has accelerated
this trend, as containers often have short lifetimes measured in minutes or hours.
Serverless has also experienced a considerable amount of growth. With its ability to
abstract away the complexity of provisioning and managing underlying resources,
serverless enables developers to deploy more efficient, independent services.
But these technologies have also required organizations to overhaul the way they
build applications in the cloud and monitor their performance.
In most cases, your servers, containers, and other cloud infrastructure components
can be thought of as "cattle." Therefore, when it comes to monitoring, you should
focus on the aggregate health and performance of services rather than on isolated
datapoints from your hosts. Rarely should you page an engineer in the middle of the
night for a host-level issue such as elevated CPU usage. On the other hand, if
latency for your web application starts to surge, you’ll want to take action immediately.
3
CHAPTER 1: CONSTANT CHANGE
These trends have also prompted a change in how DevOps personnel manage
continuous integration and delivery (CI/CD) pipelines and observability.
CONTINUOUS INTEGRATION AND DELIVERY
CI/CD is a cornerstone of many DevOps approaches. Rather than orchestrating
large, infrequent releases, teams that use CI/CD push small, incremental code
changes quickly and frequently. This simplifies the process of building, testing, and
merging new commits and allows development teams to release bug fixes and new
features much faster. It also enables engineers to quickly roll back any changes that
cause unforeseen issues in production.
OBSERVABILITY
In control theory, observability is the property of being able to describe or reconstruct
the internal state of a system using its external outputs. In practice, for an organization’s
infrastructure, this means instrumenting all compute resources, apps, and
services with ‟sensors” that dependably report metrics from those components.
It also means making those metrics available on a central, easily accessible
platform where observers can aggregate them to reconstruct a full picture of the
system’s status and operation. Observability dovetails with DevOps practices, as it
represents a cultural shift away from siloed, piecemeal views into critical systems
toward a detailed, comprehensive view of an organization's environment.
BUILT-IN AGGREGATION
Powerful tagging and labeling schemes allow engineers to arbitrarily segment and
aggregate metrics, so they can direct their focus at the service level rather than the
host level. Remember: cattle, not pets.
SCALABILITY
Modern, dynamic monitoring systems accommodate the fact that individual hosts
come and go, and scale gracefully with expanding or contracting infrastructure.
When a new host is launched, the system should detect it and start monitoring
it automatically. Strategies like "monitoring as code" accomplish these goals by
creating a repeatable process for deploying observability solutions alongside
infrastructure. This process ensures that any change to a system is instantly and
immediately monitored.
SOPHISTICATED ALERTING
Virtually every monitoring tool can fire off an alert when a metric crosses a set
threshold. But in rapidly scaling environments, such fixed alerts require constant
updating and tuning. More advanced monitoring systems offer flexible alerts that
adapt to changing baselines, including relative change alerts as well as automated
outlier and anomaly detection.
COLLABORATION
When issues arise, a monitoring system should help engineers discover and correct
the problem as quickly as possible. That means delivering alerts through a team’s
preferred communication channels and making it easy for incident responders to
share graphs, dashboards, events, and comments.
5
CHAPTER 2: COLLECTING THE RIGHT DATA
Chapter 2:
Collecting the
Right Data
Monitoring data comes in a variety of forms. Some systems pour out data continuously
and others only produce data when specific events occur. Some data is most useful
for identifying problems; some is primarily valuable for investigating problems.
This chapter covers which data to collect, and how to classify that data so that you can:
Whatever form your monitoring data takes, the unifying theme is this:
Collecting data is cheap, but not having it when you need it can be expensive,
so you should instrument everything, and collect all the useful data you
reasonably can.
Most monitoring data falls into one of two categories: metrics and events. Below
we'll explain each category, with examples, and describe their uses.
6
CHAPTER 2: COLLECTING THE RIGHT DATA
Metrics
Metrics capture a value pertaining to your systems at a specific point in time—for
example, the number of users currently logged in to a web application. Therefore,
metrics are usually collected at regular intervals (every 15 seconds, every minute,
etc.) to monitor a system over time.
There are two important categories of metrics in our framework: work metrics
and resource metrics. For each system in your infrastructure, consider which work
metrics and resource metrics are reasonably available, and collect them all.
WORK METRICS
Work metrics indicate the top-level health of your system by measuring its useful
output. These metrics are invaluable for surfacing real, often user-facing issues,
as we'll discuss in the following chapter. When considering your work metrics, it’s
often helpful to break them down into four subtypes:
— throughput is the amount of work the system is doing per unit time.
Throughput is usually recorded as an absolute number.
Below are example work metrics of all four subtypes for two common kinds of
systems: a web server and a data store.
EXAMPLE WORK METRICS: WEB SERVER (AT TIME 2016-05-24 08:13:01 UTC)
SUCCESS PERCENTAGE OF RESPONSES THAT ARE 2XX SINCE LAST MEASUREMENT 99.1
ERROR PERCENTAGE OF RESPONSES THAT ARE 5XX SINCE LAST MEASUREMENT 0.1
EXAMPLE WORK METRICS: DATA STORE (AT TIME 2016-05-24 08:13:01 UTC)
ERROR PERCENTAGE OF QUERIES RETURNING STALE DATA SINCE LAST MEASUREMENT 4.2
RESOURCE METRICS
Most components of your software infrastructure serve as a resource to other
systems. Some resources are low-level—for instance, a server’s resources include
such physical components as CPU, memory, disks, and network interfaces. But
a higher-level component, such as a database or a geolocation microservice,
can also be considered a resource if another system requires that component to
produce work.
Resource metrics are especially valuable for the investigation and diagnosis of
problems, which is the subject of chapter 4 of this book. For each resource in your
system, try to collect metrics that cover four key areas:
— errors represent internal errors that may not be observable in the work the
resource produces.
8
CHAPTER 2: COLLECTING THE RIGHT DATA
DISK IO % TIME THAT WAIT QUEUE LENGTH # DEVICE ERRORS % TIME WRITABLE
DEVICE WAS BUSY
OTHER METRICS
There are a few other types of metrics that are neither work nor resource metrics,
but that nonetheless may come in handy in diagnosing causes of problems.
Common examples include counts of cache hits or database locks. When in doubt,
capture the data.
Events
In addition to metrics, which are collected more or less continuously, some
monitoring systems can also capture events: discrete, infrequent occurrences that
provide crucial context for understanding changes in your system’s behavior.
Some examples:
An event usually carries enough information that it can be interpreted on its own,
unlike a single metric data point, which is generally only meaningful in context.
Events capture what happened, at a point in time, with optional additional
information. For example:
9
CHAPTER 2: COLLECTING THE RIGHT DATA
Tagging
As discussed in chapter 1, modern infrastructure is constantly in flux. Auto-scaling
servers die as quickly as they’re spawned, and containers come and go with even
greater frequency. With all of these transient changes, the signal-to-noise ratio in
monitoring data can be quite low.
In most cases, you can boost the signal by shifting your monitoring away from the
base level of hosts, VMs, or containers. After all, you don’t care if a specific EC2
instance goes down, but you do care if latency for a given service, category of
customers, or geographical region goes up.
Tagging your metrics enables you to reorient your monitoring along any lines you
choose. By adding tags to your metrics you can observe and alert on metrics from
different availability zones, instance types, software versions, services, roles—or
any other level you may require.
10
Datapoint
CHAPTER 2: COLLECTING THE RIGHT DATA
Datapoint
system.net.bytes_rcvd 3 2016–03–02 15:00:00
[’availability-zone:us-east-1a’,
’file-server’,
Tags allow you to filter and group your datapoints to generate exactly the view of
’hostname:foo’,
your data that matters most. They also allow you to aggregate your metrics on the
’instance-type:m3.xlarge’]
fly, without changing how the metrics are reported and collected.
metric name: metric value: timestamp: tags:
what?
FILTERING WITH howTAGS
SIMPLE METRIC much? when? where?
The following example shows a datapoint with the simple tag of file-server:
Datapoint
system.net.bytes_rcvd 4 2016–03–02 15:00:00 [’file-server’]
Simple tags can only be used to filter datapoints: either show the datapoint with a
given tag, or do not.
ne
Zo
y sa-east-1a
l it
bi
la
eu-west-1a
i
A va
us-east-1a
Role
11
CHAPTER 2: COLLECTING THE RIGHT DATA
If you add other metrics with the same key, but different values, those metrics will
automatically have new attributes in that dimension (e.g. m3.medium). Once your
key:value tags are added, you can then slice and dice in any dimension.
— Long-lived. If you discard data too soon, or if after a period of time your
monitoring system aggregates your metrics to reduce storage costs, then
you lose important information about what happened in the past. Retaining
your raw data for a year or more makes it much easier to know what
“normal” is, especially if your metrics have monthly, seasonal, or annual
variations.
12
CHAPTER 2: COLLECTING THE RIGHT DATA
— To maximize the value of your data, tag metrics and events with the
appropriate scopes, and retain them at full granularity for at least a year.
13
CHAPTER 3: ALERTING ON WHAT MATTERS
Chapter 3:
Alerting on What
Matters
Automated alerts are essential to monitoring. They allow you to spot problems
anywhere in your infrastructure, so that you can rapidly identify their causes and
minimize service degradation and disruption.
But alerts aren’t always as effective as they could be. In particular, real problems
are often lost in a sea of noisy alarms. This chapter describes a simple approach to
effective alerting, regardless of the scale and elasticity of the systems involved.
In short:
14
CHAPTER 3: ALERTING ON WHAT MATTERS
15
CHAPTER 3: ALERTING ON WHAT MATTERS
The table below maps examples of the different data types described in the previous
chapter to different levels of alerting urgency. Note that depending on severity,
a notification may be more appropriate than a page, or vice versa:
WORK METRIC: PAGE VALUE IS MUCH HIGHER OR LOWER THAN USUAL, OR THERE IS AN ANOMALOUS
THROUGHPUT RATE OF CHANGE
WORK METRIC: PAGE THE PERCENTAGE OF WORK THAT IS SUCCESSFULLY PROCESSED DROPS BELOW
SUCCESS A THRESHOLD
RESOURCE METRIC: RECORD NUMBER OF INTERNAL ERRORS DURING A FIXED PERIOD EXCEEDS A THRESHOLD
ERRORS
RESOURCE METRIC: RECORD THE RESOURCE IS UNAVAILABLE FOR A PERCENTAGE OF TIME THAT EXCEEDS
AVAILABILITY A THRESHOLD
EVENT: PAGE CRITICAL WORK THAT SHOULD HAVE BEEN COMPLETED IS REPORTED AS
WORK-RELATED INCOMPLETE OR FAILED
16
CHAPTER 3: ALERTING ON WHAT MATTERS
If the issue is indeed real, it should generate an alert. Even if the alert is
not linked to a notification, it should be recorded within your monitoring
system for later analysis and correlation.
17
CHAPTER 3: ALERTING ON WHAT MATTERS
PAGE ON SYMPTOMS
Pages deserve special mention: they are extremely effective for delivering
information, but they can be quite disruptive if overused, or if they are linked to
alerts that are prone to flapping. In general, a page is the most appropriate kind
of alert when the system you are responsible for stops doing useful work with
acceptable throughput, latency, or error rates. Those are the sort of problems that
you want to know about immediately.
The fact that your system stopped doing useful work is a symptom. It is a
manifestation of an issue that may have any number of different causes. For
example: if your website has been responding very slowly for the last three
minutes, that is a symptom. Possible causes include high database latency, failed
application servers, Memcached being down, high load, and so on. Whenever
possible, build your pages around symptoms rather than causes. The distinction
between work metrics and resource metrics introduced in chapter 2 is often useful
for separating symptoms and causes: work metrics are usually associated with
symptoms and resource metrics with causes.
Disk space is a classic example. Unlike running out of free memory or CPU, when
you run out of disk space, the system will not likely recover, and you probably will
have only a few seconds before your system hard stops. Of course, if you can notify
someone with plenty of lead time, then there is no need to wake anyone in the
middle of the night. Better yet, you can anticipate some situations when disk space
18
CHAPTER 3: ALERTING ON WHAT MATTERS
will run low and build automated remediation based on the data you can afford to
erase, such as logs or data that exists somewhere else.
— Send a page only when symptoms of urgent problems in your system’s work
are detected, or if a critical and finite resource limit is about to be reached.
19
CHAPTER 4: INVESTIGATING PERFORMANCE ISSUES
Chapter 4:
Investigating
Performance Issues
20
CHAPTER 4: INVESTIGATING PERFORMANCE ISSUES
As you'll recall from chapter 2, there are three main types of monitoring data that
can help you investigate the root causes of problems in your infrastructure:
By and large, work metrics will surface the most serious symptoms and should
therefore generate the most serious alerts, as discussed in the previous
chapter. But the other metric types are invaluable for investigating the causes
of those symptoms.
Thinking about which systems produce useful work, and which resources support
that work, can help you to efficiently get to the root of any issues that surface.
When an alert notifies you of a possible problem, the following process will help
you to approach your investigation systematically.
21
CHAPTER 4: INVESTIGATING PERFORMANCE ISSUES
Next examine the work metrics for the highest-level system that is
exhibiting problems. These metrics will often point to the source of the
problem, or at least set the direction for your investigation. For example,
if the percentage of work that is successfully processed drops below a
set threshold, diving into error metrics, and especially the types of errors
being returned, will often help narrow the focus of your investigation.
Alternatively, if latency is high, and the throughput of work being
requested by outside systems is also very high, perhaps the system is
simply overburdened.
In an outage, every minute is crucial. To speed your investigation and keep your
focus on the task at hand, set up dashboards in advance. You may want to set up
one dashboard for your high-level application metrics, and one dashboard for
each subsystem. Each system’s dashboard should render the work metrics of that
system, along with resource metrics of the system itself and key metrics of the
subsystems it depends on. If event data is available, overlay relevant events on the
graphs for correlation analysis.
We've now stepped through a high-level framework for data collection and
tagging (chapter 2), automated alerting (chapter 3), and incident response and
investigation (chapter 4). In the next chapter we'll go further into detail on
how to monitor your metrics using a variety of graphs and other visualizations.
23
CHAPTER 5: VISUALIZING METRICS WITH TIMESERIES GRAPHS
Chapter 5:
Visualizing Metrics with
Timeseries Graphs
In order to turn your metrics into actionable insights, it's important to choose
the right visualization for your data. There is no one-size-fits-all solution: you can
see different things in the same metric with different graph types.
To help you effectively visualize your metrics, this chapter explores four different
types of timeseries graphs: line graphs, stacked area graphs, bar graphs, and heat
maps. These graphs all have time on the x-axis and metric values on the y-axis.
For each graph type, we'll explain how it works, when to use it, and when to use
something else.
But first we'll quickly touch on aggregation in timeseries graphs, which is critical for
visualizing metrics from dynamic, cloud-scale infrastructure.
24
CHAPTER 5: VISUALIZING METRICS WITH TIMESERIES GRAPHS
Aggregation across space allows you to slice and dice your infrastructure to isolate
exactly the metrics that matter most to you. It also allows you to make otherwise
noisy graphs much more readable. For instance, it is hard to make sense of a
host-level graph of web requests, but the same data is easily interpreted when the
metrics are aggregated by availability zone:
25
CHAPTER 5: VISUALIZING METRICS WITH TIMESERIES GRAPHS
Line Graphs
Line graphs are the simplest way to translate metric data into visuals, but often
they’re used by default when a different graph would be more appropriate.
For instance, a graph of wildly fluctuating metrics from hundreds of hosts quickly
becomes harder to disentangle than steel wool.
THE SAME METRIC TO SPOT OUTLIERS AT A GLANCE CPU IDLE FOR EACH HOST IN A CLUSTER
REPORTED BY
DIFFERENT SCOPES
TRACKING SINGLE TO CLEARLY COMMUNICATE A KEY MEDIAN LATENCY ACROSS ALL WEB SERVERS
METRICS FROM ONE METRIC'S EVOLUTION OVER TIME
SOURCE, OR AS AN
AGGREGATE
METRICS FOR WHICH TO SPOT INDIVIDUAL DEVIATIONS DISK SPACE UTILIZATION PER
UNAGGREGATED INTO UNACCEPTABLE RANGES DATABASE NODE
VALUES FROM A
PARTICULAR SLICE OF
YOUR INFRASTRUCTURE
ARE ESPECIALLY
VALUABLE
26
CHAPTER 5: VISUALIZING METRICS WITH TIMESERIES GRAPHS
RELATED METRICS TO SPOT CORRELATIONS AT A GLANCE LATENCY FOR DISK READS AND DISK WRITES
SHARING THE SAME ON THE SAME MACHINE
UNITS
METRICS THAT HAVE TO EASILY SPOT SERVICE DEGRADATIONS LATENCY FOR PROCESSING WEB REQUESTS
A CLEAR ACCEPTABLE
DOMAIN
HIGHLY VARIABLE CPU FROM ALL HOSTS HEAT MAPS TO MAKE NOISY DATA MORE
METRICS REPORTED INTERPRETABLE
BY A LARGE NUMBER
OF SOURCES
METRICS THAT ARE WEB REQUESTS PER SECOND OVER DOZENS AREA GRAPHS TO AGGREGATE ACROSS
MORE ACTIONABLE OF WEB SERVERS TAGGED GROUPS
AS AGGREGATES THAN
AS SEPARATE DATA
POINTS
SPARSE METRICS COUNT OF RELATIVELY RARE S3 ACCESS BAR GRAPHS TO AVOID JUMPY
ERRORS INTERPOLATIONS
27
CHAPTER 5: VISUALIZING METRICS WITH TIMESERIES GRAPHS
Area graphs are similar to line graphs, except the metric values are represented by
two-dimensional bands rather than lines. Multiple timeseries can be summed together
simply by stacking the bands.
THE SAME METRIC FROM TO CHECK BOTH THE SUM AND THE CONTRIBUTION LOAD BALANCER REQUESTS PER AVAILABILITY ZONE
DIFFERENT SCOPES, OF EACH OF ITS PARTS AT A GLANCE
STACKED
SUMMING TO SEE HOW A FINITE RESOURCE IS BEING UTILIZED CPU UTILIZATION METRICS (USER, SYSTEM, IDLE,
COMPLEMENTARY ETC.)
METRICS THAT SHARE
THE SAME UNIT
28
CHAPTER 5: VISUALIZING METRICS WITH TIMESERIES GRAPHS
UNAGGREGATED METRICS THROUGHPUT METRICS ACROSS HUNDREDS OF LINE GRAPH OR SOLID-COLOR AREA GRAPH TO TRACK
FROM LARGE NUMBERS OF APP SERVERS TOTAL, AGGREGATE VALUE
HOSTS, MAKING THE
SLICES TOO THIN TO BE
MEANINGFUL
METRICS THAT CAN'T BE SYSTEM LOAD ACROSS MULTIPLE SERVERS LINE GRAPHS, OR HEAT MAPS FOR LARGE NUMBERS
ADDED SENSIBLY OF HOSTS
Bar Graphs
In a bar graph, each bar represents a metric rollup over a time interval. This feature
makes bar graphs ideal for representing counts. Unlike gauge metrics, which
represent an instantaneous value, count metrics only make sense when paired with
a time interval (e.g., 13 query errors in the past five minutes).
29
CHAPTER 5: VISUALIZING METRICS WITH TIMESERIES GRAPHS
Bar graphs require no interpolation to connect one interval to the next, making
them especially useful for representing sparse metrics. Like area graphs, they
naturally accommodate stacking and summing of metrics.
SPARSE METRICS TO CONVEY METRIC VALUES WITHOUT JUMPY OR BLOCKED TASKS IN CASSANDRA'S INTERNAL QUEUES
MISLEADING INTERPOLATIONS
METRICS THAT REPRESENT TO CONVEY BOTH THE TOTAL COUNT AND THE FAILED JOBS, BY DATA CENTER (4-HOUR INTERVALS)
A COUNT (RATHER THAN CORRESPONDING TIME INTERVAL
A GAUGE)
30
CHAPTER 5: VISUALIZING METRICS WITH TIMESERIES GRAPHS
METRICS THAT CAN'T BE AVERAGE LATENCY PER LOAD BALANCER LINE GRAPHS TO ISOLATE TIMESERIES FROM EACH
ADDED SENSIBLY HOST
UNAGGREGATED METRICS COMPLETED TASKS ACROSS DOZENS OF CASSANDRA SOLID-COLOR BARS TO TRACK TOTAL, AGGREGATE
FROM LARGE NUMBERS OF NODES METRIC VALUE
SOURCES, MAKING THE
SLICES TOO THIN TO BE
MEANINGFUL
31
CHAPTER 5: VISUALIZING METRICS WITH TIMESERIES GRAPHS
Heat Maps
Heat maps show the distribution of values for a metric evolving over time.
Specifically, each column represents a distribution of values during a particular
time slice. Each cell's shading corresponds to the number of entities reporting
that particular value during that particular time.
Heat maps are designed to visualize metrics from large numbers of entities,
so they are often used to graph unaggregated metrics at the individual host or
container level. Heat maps are closely related to distribution graphs, except that
heat maps show change over time, and distribution graphs are a snapshot of
a particular window of time. Distributions are covered in the following chapter.
SINGLE METRIC REPORTED TO CONVEY GENERAL TRENDS AT A GLANCE WEB LATENCY PER HOST
BY A LARGE NUMBER OF
GROUPS
32
CHAPTER 5: VISUALIZING METRICS WITH TIMESERIES GRAPHS
METRICS COMING FROM CPU UTILIZATION ACROSS A SMALL NUMBER OF LINE GRAPHS TO ISOLATE TIMESERIES FROM
ONLY A FEW INDIVIDUAL RDS INSTANCES EACH HOST
SOURCES
METRICS WHERE DISK UTILIZATION PER CASSANDRA COLUMN FAMILY AREA GRAPHS TO SUM VALUES ACROSS A SET OF
AGGREGATES MATTER TAGS
MORE THAN INDIVIDUAL
VALUES
In the following chapter, we'll explore summary graphs, which are visualizations
that compress time out of view to display a summary view of your metrics.
33
CHAPTER 6: VISUALIZING METRICS WITH SUMMARY GRAPHS
Chapter 6:
Visualizing Metrics with
Summary Graphs
For each graph type, we’ll explain how it works and when to use it. But first,
we’ll quickly discuss two concepts that are necessary to understand infrastructure
summary graphs: aggregation across time (which you can think of as ‟time
flattening” or ‟snapshotting”), and aggregation across space.
34
CHAPTER 6: VISUALIZING METRICS WITH SUMMARY GRAPHS
across time can mean simply displaying the latest value returned by a metric
query, or a more complex aggregation to return a computed value over a moving
time window.
For example, instead of displaying only the latest reported value for a metric query,
you may want to display the maximum value reported by each host over the past
60 minutes to surface problematic spikes:
Instead of listing peak Redis latencies at the host level as in the example pictured
above, it may be more useful to see peak latencies for each internal service that
is built on Redis. Or you could surface only the maximum value reported by any one
host in your infrastructure:
2.61 sobotka
0.72 lamar
0.01 delancie-query-alert
0.01 templeton
35
CHAPTER 6: VISUALIZING METRICS WITH SUMMARY GRAPHS
2.61s
Max Redis latency 1h
Single-Value Summaries
Single-value summaries display the current value of a given metric query, with
conditional formatting (such as a green/yellow/red background) to convey whether
or not the value is in the expected range. The value displayed by a single-value
summary need not represent an instantaneous measurement. The widget can
display the latest value reported, or an aggregate computed from all query values
across the time window.
431
Current number of OK hosts, prod 1h
hosts
36
CHAPTER 6: VISUALIZING METRICS WITH SUMMARY GRAPHS
WORK METRICS FROM A TO MAKE KEY METRICS IMMEDIATELY VISIBLE WEB SERVER REQUESTS PER SECOND
GIVEN SYSTEM
CRITICAL RESOURCE TO PROVIDE AN OVERVIEW OF RESOURCE STATUS HEALTHY HOSTS BEHIND LOAD BALANCER
METRICS AND HEALTH AT A GLANCE
COMPUTED METRIC TO COMMUNICATE KEY TRENDS CLEARLY HOSTS IN USE VERSUS ONE WEEK AGO
CHANGES AS COMPARED
TO PREVIOUS VALUES
Toplists
Toplists are ordered lists that allow you to rank hosts, clusters, or any other
segment of your infrastructure by their metric values. Because they are so easy
to interpret, toplists are especially useful in high-level status boards.
37
CHAPTER 6: VISUALIZING METRICS WITH SUMMARY GRAPHS
1.74 us-east-1a
0.7 us-east-1b
0.09 us-east-1e
WORK OR RESOURCE TO SPOT OUTLIERS, UNDERPERFORMERS, OR RESOURCE POINTS PROCESSED PER APP SERVER
METRICS TAKEN FROM OVERCONSUMERS AT A GLANCE
DIFFERENT HOSTS OR
GROUPS
CUSTOM METRICS TO CONVEY KPIS IN AN EASY-TO-READ FORMAT VERSIONS OF THE DATADOG AGENT IN USE
RETURNED AS A LIST OF (E.G. FOR STATUS BOARDS ON WALL-MOUNTED
VALUES DISPLAYS)
38
CHAPTER 6: VISUALIZING METRICS WITH SUMMARY GRAPHS
Change Graphs
Whereas toplists give you a summary of recent metric values, change graphs
compare a metric's current value against its value at a point in the past.
The key difference between change graphs and other visualizations is that change
graphs take two different timeframes as parameters: one for the size of the
evaluation window and one to set the lookback window.
29 oauth2 -47%
9 saml -53%
CYCLIC METRICS THAT TO SEPARATE GENUINE TRENDS FROM PERIODIC DATABASE WRITE THROUGHPUT, COMPARED TO SAME
RISE AND FALL DAILY, BASELINES TIME LAST WEEK
WEEKLY, OR MONTHLY
39
CHAPTER 6: VISUALIZING METRICS WITH SUMMARY GRAPHS
HIGH-LEVEL TO QUICKLY IDENTIFY LARGE-SCALE TRENDS TOTAL HOST COUNT, COMPARED TO SAME TIME
INFRASTRUCTURE YESTERDAY
METRICS
Host Maps
Host maps are a unique way to see your entire infrastructure, or any slice of it, at
a glance. However you slice and dice your infrastructure (by availability zone, by
service name, by instance type, etc.), you will see each host in the selected group
as a hexagon, color-coded and sized by any metrics reported by those hosts.
40
CHAPTER 6: VISUALIZING METRICS WITH SUMMARY GRAPHS
RESOURCE UTILIZATION TO SPOT OVERLOADED COMPONENTS AT A GLANCE LOAD PER APP HOST, GROUPED BY CLUSTER
METRICS
TO IDENTIFY RESOURCE MISALLOCATION (E.G. CPU USAGE PER EC2 INSTANCE TYPE
WHETHER ANY INSTANCES ARE OVER-
OR UNDERSIZED)
ERROR OR OTHER WORK TO QUICKLY IDENTIFY DEGRADED HOSTS HAPROXY 5XX ERRORS PER SERVER
METRICS
RELATED METRICS TO SEE CORRELATIONS IN A SINGLE GRAPH APP SERVER THROUGHPUT VERSUS MEMORY USED
Distributions
Distribution graphs show a histogram of a metric's value across a segment of
infrastructure. Each bar in the graph represents a range of binned values, and its
height corresponds to the number of entities reporting values in that range.
Distribution graphs are closely related to heat maps. The key difference between
the two is that heat maps show change over time, whereas distributions are a
summary of a time window. Like heat maps, distributions handily visualize large
numbers of entities reporting a particular metric, so they are often used to
graph metrics at the individual host or container level.
41
CHAPTER 6: VISUALIZING METRICS WITH SUMMARY GRAPHS
3 HISTOGRAM
Plot the distribution of hosts by
latency by bands
SINGLE METRIC REPORTED TO CONVEY GENERAL HEALTH OR STATUS WEB LATENCY PER HOST
BY A LARGE NUMBER OF AT A GLANCE
ENTITIES
In Summary
As you've seen here, each of these summary graphs has unique benefits and use
cases. Understanding all the visualizations in your toolkit, and when to use each
type, will help you convey actionable information clearly in your dashboards.
In the next chapters, we'll make these monitoring concepts more concrete by
applying them to two extremely popular infrastructure technologies: Kubernetes
and AWS Lambda.
42
CHAPTER 7: PUTTING IT ALL TOGETHER – MONITORING KUBERNETES
Chapter 7:
Putting It All Together –
Monitoring Kubernetes
Container technologies have taken the infrastructure world by storm. Ideal for
microservice architectures and environments that scale rapidly or have frequent
releases, containers have seen a rapid increase in usage in recent years. But
adopting Docker, containerd, or other container runtimes introduces significant
complexity in terms of orchestration. That’s where Kubernetes comes into play.
43
CHAPTER 7: PUTTING IT ALL TOGETHER – MONITORING KUBERNETES
Kubernetes can orchestrate your containers wherever they run, which facilitates
multi-cloud deployments and migrations between infrastructure platforms. Hosted
and self-managed flavors of Kubernetes abound, from enterprise-optimized
platforms such as OpenShift and Pivotal Container Service to cloud services such as
Google Kubernetes Engine, Amazon Elastic Kubernetes Service, Azure Kubernetes
Service, and Oracle’s Container Engine for Kubernetes.
This chapter walks through monitoring Kubernetes and is broken into three parts:
44
CHAPTER 7: PUTTING IT ALL TOGETHER – MONITORING KUBERNETES
In the pre-container world, tags and labels were important for monitoring your
infrastructure. They allowed you to group hosts and aggregate their metrics at any
level of abstraction. In particular, tags have proved extremely useful for tracking
the performance of dynamic cloud infrastructure and investigating issues that
arise there.
A container environment brings even larger numbers of objects to track, with even
shorter lifespans. The automation and scalability of Kubernetes only exaggerates
this difference. With so many moving pieces in a typical Kubernetes cluster, labels
provide the only reliable way to identify your pods and the applications within.
To make your observability data as useful as possible, you should label your pods in
a way that allows you to look at any aspect of your applications and infrastructure,
including:
45
CHAPTER 7: PUTTING IT ALL TOGETHER – MONITORING KUBERNETES
These user-generated labels are essential for monitoring because they are the
only way to slice and dice metrics and events across the different layers of your
Kubernetes architecture.
Kubernetes also exposes some labels from Docker. But note that you cannot apply
custom Docker labels to your images or containers when using Kubernetes. You can
only apply Kubernetes labels to your pods.
Thanks to these Kubernetes labels at the pod level and Docker labels at the
container level, you can easily slice and dice along any dimension to get a logical
view of your infrastructure and applications. You can examine every layer in your
stack (namespace, replica set, pod, or container) to aggregate your metrics and
drill down for investigation.
Because they are the only way to generate an accessible, up-to-date view of your
pods and applications, labels and tags should form the basis of your monitoring
and alerting strategies. The performance metrics you track won’t be attached to
hosts; instead, they are aggregated around labels that you will use to group or filter
the pods you are interested in. Make sure that you define a logical and easy-to-
follow schema for your namespaces and labels so that the meaning of a particular
label is easily comprehensible to anyone in your organization.
46
CHAPTER 7: PUTTING IT ALL TOGETHER – MONITORING KUBERNETES
— Your hosts, even if you don’t know which containers and applications they
are actually running
— Your containers, even if you don't know where they're running
— Your containerized applications
— The Kubernetes cluster itself
47
CHAPTER 7: PUTTING IT ALL TOGETHER – MONITORING KUBERNETES
A Kubernetes-aware monitoring tool with service discovery lets you make full use
of the scaling and automation built into Kubernetes, without sacrificing visibility.
Service discovery enables your monitoring system to detect any change in your
inventory of running pods and automatically reconfigure your data collection so
you can continuously monitor your containerized workloads even as they expand,
contract, and shift across hosts.
Metrics Server collects resource usage statistics from the kubelet on each node
and provides aggregated metrics through the Metrics API. Metrics Server stores
only near-real-time metrics in memory, so it is primarily valuable for spot checks
of CPU or memory usage, or for periodic querying by a full-featured monitoring
service that retains data over longer timespans.
We'll briefly discuss how Kubernetes generates metrics from these APIs. Then we'll
dig into the data you can collect to monitor the Kubernetes platform itself, focusing
on the following areas:
— Cluster state metrics
— Resource metrics from Kubernetes nodes and pods
— Work metrics from the Kubernetes Control Plane
We'll also touch on the value of collecting Kubernetes events.
For Kubernetes objects that are deployed to your cluster, several similar but distinct
metrics are available, depending on the type of controller that manages those
objects. Two important types of controllers are:
— Deployments, which create a specified number of pods (often combined with a
Service that creates a persistent point of access to the pods in the Deployment)
— DaemonSets, which ensure that a particular pod is running on every node
(or on a specified set of nodes)
You can learn about these and other types of controllers in the Kubernetes
documentation.
49
CHAPTER 7: PUTTING IT ALL TOGETHER – MONITORING KUBERNETES
Node status
This cluster state metric provides a high-level overview of a node's health and
whether the scheduler can place pods on that node. It runs checks on the following
node conditions:
— OutOfDisk
— Ready (node is ready to accept pods)
— MemoryPressure (node memory is too low)
— PIDPressure (too many running processes)
— DiskPressure (remaining disk capacity is too low)
— NetworkUnavailable
Each check returns true, false, or—if the worker node hasn't communicated with
the relevant Control Plane node for a grace period (which defaults to 40 seconds)—
unknown. In particular, the Ready and NetworkUnavailable checks can alert you
to nodes that are unavailable or otherwise not usable so that you can troubleshoot
further. If a node returns true for the MemoryPressure or DiskPressure check,
the kubelet attempts to reclaim resources. This includes garbage collection and
possibly deleting pods from the node.
Alternatively, a DaemonSet launches one pod on every node in your cluster (unless
you specify a subset of nodes). This is often useful for installing a monitoring agent
or other node-level utility across your cluster.
Kubernetes provides metrics that reflect the number of desired pods (e.g.,
kube_deployment_spec_replicas) and the number of currently running pods
(e.g., kube_deployment_status_replicas). Typically, these numbers should
match unless you are in the midst of a deployment or other transitional phase, so
comparing these metrics can alert you to issues with your cluster. In particular, a
large disparity between desired and running pods can point to bottlenecks, such
as your nodes lacking the resource capacity to schedule new pods. It could also
indicate a problem with your configuration that is causing pods to fail.
50
CHAPTER 7: PUTTING IT ALL TOGETHER – MONITORING KUBERNETES
Resource Metrics
Monitoring memory, CPU, and disk usage within nodes and pods can help you
detect and troubleshoot application-level problems. But monitoring the resource
usage of Kubernetes objects is not as straightforward as monitoring a traditional
application. In Kubernetes, you still need to track the actual resource usage of your
workloads, but those statistics become more actionable when you monitor them in
the context of resource requests and limits, which govern how Kubernetes manages
finite resources and schedules workloads across the cluster.
In a manifest, you can declare a request and a limit for CPU (measured in cores)
and memory (measured in bytes) for each container running on a pod. A request is
the minimum amount of CPU or memory that a node will allocate to the container; a
limit is the maximum amount that the container will be allowed to use. The requests
and limits for an entire pod are calculated from the sum of the requests and limits
of its constituent containers.
Requests and limits do not define a pod’s actual resource utilization, but they
significantly affect how Kubernetes schedules pods on nodes. Specifically,
new pods will only be placed on a node that can meet their requests. Requests
and limits are also integral to how a kubelet manages available resources by
terminating pods (stopping the processes running on its containers) or evicting
pods (deleting them from a node), which we'll cover in more detail below.
51
CHAPTER 7: PUTTING IT ALL TOGETHER – MONITORING KUBERNETES
You can read more in the Kubernetes documentation about configuring requests
and limits and how Kubernetes responds when resources run low.
Comparing resource utilization with resource requests and limits will provide a
more complete picture of whether your cluster has the capacity to run its workloads
and accommodate new ones. It’s important to keep track of resource usage at
different layers of your cluster, particularly for your nodes and for the pods running
on them.
52
CHAPTER 7: PUTTING IT ALL TOGETHER – MONITORING KUBERNETES
Comparing your pods' memory usage to their configured limits will alert you to
whether they are at risk of being OOM killed, as well as whether their limits make
sense. If a pod's limit is too close to its standard memory usage, the pod may get
terminated due to an unexpected spike. On the other hand, you may not want to set
a pod's limit to be significantly higher than its typical usage because that can lead
to poor scheduling decisions. For example, a pod with a memory request of 1 GiB
and a limit of 4 GiB can be scheduled on a node with 2 GiB of allocatable memory
(more than sufficient to meet its request). But if the pod suddenly needs 3 GiB of
memory, it will be killed even though it's well below its memory limit.
MEMORY UTILIZATION
Keeping an eye on memory usage at the pod and node levels can provide important
insight into your cluster's performance and its ability to successfully run workloads.
As we've seen, pods whose actual memory usage exceeds their limits will be
terminated. Additionally, if a node runs low on available memory, the kubelet flags
it as being under memory pressure and begins to reclaim resources.
In order to reclaim memory, the kubelet can evict pods, meaning it will delete these
pods from the node. The Control Plane will attempt to reschedule evicted pods on
another node with sufficient resources. If your pods' memory usage significantly
exceeds their defined requests, it can cause the kubelet to prioritize those pods
for eviction, so comparing requests with actual usage can help surface which pods
might be vulnerable to eviction.
Habitually exceeding requests could also indicate that your pods are not configured
appropriately. As mentioned above, scheduling is largely based on a pod's
request, so a pod with a bare-minimum memory request could be placed on a
node without enough resources to withstand any spikes or increases in memory
needs. Correlating and comparing each pod's actual utilization against its requests
can give insight into whether the requests and limits specified in your manifests
make sense, or if there might be some issue that is causing your pods to use more
resources than expected.
Monitoring overall memory utilization on your nodes can also help you determine
when you need to scale your cluster. If node-level usage is high, you may need to
add nodes to the cluster to share the workload.
53
CHAPTER 7: PUTTING IT ALL TOGETHER – MONITORING KUBERNETES
Comparing memory requests to capacity metrics can also help you troubleshoot
problems with launching and running the desired number of pods across your
cluster. If you notice that your cluster's count of current pods is significantly less
than the number of desired pods, these metrics might show you that your nodes
don't have the resource capacity to host new pods, so the Control Plane is failing to
find a node to assign desired pods to. One straightforward remedy for this issue is
to provision more nodes for your cluster.
DISK UTILIZATION
Like memory, disk space is a non-compressible resource, so if a kubelet detects
low disk space on its root volume, it can cause problems with scheduling pods. If
a node's remaining disk capacity crosses a certain resource threshold, it will get
flagged as under disk pressure. The following are the default resource thresholds
for a node to come under disk pressure:
IMAGEFS.AVAILABLE 15% AVAILABLE DISK SPACE FOR THE imagefs FILESYSTEM, USED FOR IMAGES AND CONTAINER-
WRITABLE LAYERS
Crossing one of these thresholds leads the kubelet to initiate garbage collection to
reclaim disk space by deleting unused images or dead containers. As a next step, if
it still needs to reclaim resources, it will start evicting pods.
54
CHAPTER 7: PUTTING IT ALL TOGETHER – MONITORING KUBERNETES
In addition to node-level disk utilization, you should also track the usage levels
of the volumes used by your pods. This helps you stay ahead of problems at the
application or service level. Once these volumes have been provisioned and
attached to a node, the node's kubelet exposes several volume-level disk utilization
metrics, such as the volume's capacity, utilization, and available space. These
volume metrics are available from Kubernetes's Metrics API.
If a volume runs out of remaining space, any applications that depend on that
volume will likely experience errors as they try to write new data to the volume.
Setting an alert to trigger when a volume reaches 80 percent usage can give you
time to create new volumes or scale up the storage request to avoid problems.
CPU UTILIZATION
Tracking the amount of CPU your pods are using compared to their configured
requests and limits, as well as CPU utilization at the node level, will give you
important insight into cluster performance. Much like a pod exceeding its CPU
limits, a lack of available CPU at the node level can lead to the node throttling the
amount of CPU allocated to each pod.
Measuring actual utilization compared to requests and limits per pod can help
determine if these are configured appropriately and your pods are requesting
enough CPU to run properly. Alternatively, consistently higher than expected CPU
usage might point to problems with the pod that need to be identified and addressed.
55
CHAPTER 7: PUTTING IT ALL TOGETHER – MONITORING KUBERNETES
Kubernetes exposes metrics for each of these components, which you can collect
and track to ensure that your cluster's central nervous system is healthy. Note
that in managed Kubernetes environments (such as Google Kubernetes Engine or
Amazon Elastic Kubernetes Service clusters), the Control Plane is managed by the
cloud provider, and you may not have access to all the components and metrics
listed below. Also, the availability or names of certain metrics may be different
depending on which version of Kubernetes you are using.
ETCD_SERVER_HAS_LEADER
Except during leader election events, the etcd cluster should always have a leader,
which is necessary for the operation of the key-value store. If a particular member
of an etcd cluster reports a value of 0 for etcd_server_has_leader (perhaps due
to network issues), that member of the cluster does not recognize a leader and is
unable to serve queries. Therefore, if every cluster member reports a value of 0, the
entire etcd cluster is down. A failed etcd key-value store deprives Kubernetes of
necessary information about the state of cluster objects, and prevents Kubernetes
from making any changes to cluster state. Because of its critical role in cluster
operations, etcd provides snapshot and recovery operations to mitigate the impact
of failure scenarios.
ETCD_SERVER_LEADER_CHANGES_SEEN_TOTAL
This metric tracks the number of leader transitions within the cluster. Frequent
leader changes, though not necessarily damaging on their own, can alert you to
issues with connectivity or resource limitations in the etcd cluster.
56
CHAPTER 7: PUTTING IT ALL TOGETHER – MONITORING KUBERNETES
APISERVER_REQUEST_LATENCIES_COUNT, APISERVER_REQUEST_LATENCIES_SUM
Kubernetes provides metrics on the number and duration of requests to the API
server for each combination of resource (e.g., pods, Deployments) and verb (e.g.,
GET, LIST, POST, DELETE). By dividing the summed latency for a specific type of
request by the number of requests of that type, you can compute a per-request
average latency. You can also track these metrics over time and divide their deltas
to compute a real-time average. By tracking the number and latency of specific
kinds of requests, you can see if the cluster is falling behind in executing any user-
initiated commands to create, delete, or query resources, likely due to the API
server being overwhelmed with requests.
WORKQUEUE_QUEUE_DURATION_SECONDS, WORKQUEUE_WORK_DURATION_SECONDS
These latency metrics provide insight into the performance of the controller
manager, which queues up each actionable item (such as the replication of a pod)
before it’s carried out. Each metric is tagged with the name of the relevant queue,
such as queue:daemonset or queue:node_lifecycle_controller. The metric
workqueue_queue_duration_seconds tracks how much time, in aggregate, items
in a specific queue have spent awaiting processing, whereas workqueue_work_
duration_seconds reports how much time it took to actually process those items.
If you see a latency increase in the automated actions of your controllers, you can
look at the logs for the controller manager to gather more details about the cause.
The end-to-end latency metrics report both how long it takes to select a node for a
particular pod, as well as how long it takes to notify the API server of the scheduling
decision so it can be applied to the cluster. If you notice a discrepancy between the
number of desired and current pods, you can dig into these latency metrics to see
if a scheduling issue is behind the lag. Note that the end-to-end latency is reported
differently depending on the version of Kubernetes you are running (pre-v1.14 or
1.14+), with different time units as well.
57
CHAPTER 7: PUTTING IT ALL TOGETHER – MONITORING KUBERNETES
Kubernetes Events
Collecting events from Kubernetes and from the container engine (such as Docker)
allows you to see how pod creation, destruction, starting, or stopping affects the
performance of your infrastructure—and vice versa.
While Docker events trace container lifecycles, Kubernetes events report on pod
lifecycles and deployments. Tracking Kubernetes Pending pods and pod failures,
for example, can point you to misconfigured launch manifests or issues of resource
saturation on your nodes. That’s why you should correlate events with Kubernetes
metrics for easier investigation.
58
CHAPTER 7: PUTTING IT ALL TOGETHER – MONITORING KUBERNETES
59
CHAPTER 7: PUTTING IT ALL TOGETHER – MONITORING KUBERNETES
You may recall from earlier in this chapter that certain cluster-level metrics—
specifically, the counts of Kubernetes objects such as the count of pods desired,
currently available, and currently unavailable—are provided by an optional cluster
add-on called kube-state-metrics. If you see that this data is missing from the
dashboard, it means that you have not deployed the kube-state-metrics service. To
add these statistics to the lower-level resource metrics already being collected by
the Agent, you simply need to deploy kube-state-metrics to your cluster.
Once kube-state-metrics is up and running, your cluster state metrics will start
pouring into Datadog automatically, without any further configuration. That’s
because the Datadog Agent’s Autodiscovery functionality detects when certain
services are running and automatically enables metric collection from those
services. Since kube-state-metrics is among the integrations automatically
enabled by Autodiscovery, there’s nothing more you need to do to start collecting
your cluster state metrics in Datadog. Check out our in-depth guide for more
information about the Agent's Autodiscovery feature.
61
CHAPTER 7: PUTTING IT ALL TOGETHER – MONITORING KUBERNETES
When you deploy your instrumented application, it will automatically begin sending
traces to Datadog. From the APM tab of your Datadog account, you can see a
breakdown of key performance indicators for each of your instrumented services
with information about request throughput, latency, errors, and the performance of
any service dependencies.
62
CHAPTER 7: PUTTING IT ALL TOGETHER – MONITORING KUBERNETES
You can then dive into individual traces to inspect a flame graph that breaks down
the trace into spans, each one representing an individual database query, function
call, or operation carried out as part of fulfilling the request. For each span, you can
view system metrics, application runtime metrics, error messages, and relevant
logs that pertain to that unit of work.
63
CHAPTER 7: PUTTING IT ALL TOGETHER – MONITORING KUBERNETES
64
CHAPTER 8: PUTTING IT ALL TOGETHER – MONITORING AWS LAMBDA
Chapter 8:
Putting It All Together –
Monitoring AWS Lambda
65
CHAPTER 8: PUTTING IT ALL TOGETHER – MONITORING AWS LAMBDA
66
CHAPTER 8: PUTTING IT ALL TOGETHER – MONITORING AWS LAMBDA
The billed duration measures the execution time rounded up to the nearest 100
ms. Billed duration is the basis for AWS's Lambda pricing, along with the function’s
memory size, which we will talk about next.
You can compare a function’s duration with its billed duration to see if you can decrease
execution time and lower costs. For instance, let’s look at this function’s log:
The function’s duration was 102 ms, but what you will pay for is based on the 200
ms billed duration. If you notice the duration is consistent (e.g., around 102 ms),
you may be able to add more memory in order to decrease the duration and the
billed duration. For example, if you increase your function’s memory from 128 MB
to 192 MB and the duration drops to 98 ms, your billed duration would then be
100 ms. This means you would be charged less because you are in the 100 ms
block instead of the 200 ms block for billed duration. Though we used a simple
example, monitoring these two metrics is important for understanding the costs
of your functions, especially if you are managing large volumes of requests across
hundreds of functions.
You can compare a function's memory usage with its allocated memory in your
CloudWatch logs, as seen below:
You can see that the function uses (Max Memory Used) only a fraction of its
allocated memory. If this happens consistently, you may want to adjust the
function's memory size to reduce costs. On the other hand, if a function's memory
usage is consistently reaching its memory size then it doesn't have enough memory
to process incoming requests, increasing execution times.
67
CHAPTER 8: PUTTING IT ALL TOGETHER – MONITORING AWS LAMBDA
Synchronous services create the event, which Lambda passes directly to a function
and waits for the function to return a response before passing the result back to
the service. This is useful if you need the results of a function before moving on to
the next step in the application workflow. If an error occurs, the AWS service that
originally sent Lambda the event will retry the invocation.
You can also use event source mapping to link an event source, such as Amazon
Kinesis or DynamoDB streams, to a Lambda function. Mappings are resources that
configure data streams or queues to serve as a trigger for a Lambda function. For
example, when you map a Kinesis data stream to a function, the Lambda runtime
will read batches of events, or records, from the shards (i.e., the sequences of
records) in a stream and then send those batches to the function for processing.
By default, if a function returns an error and cannot process a batch, it will retry the
batch until it is successful or the records in the batch expire (for a data stream).
To ensure a function doesn’t stall when processing records, you can configure the
number of retry attempts, the maximum age of a record in a batch, and the size of a
batch when you create the event source mapping.
68
CHAPTER 8: PUTTING IT ALL TOGETHER – MONITORING AWS LAMBDA
INVOCATIONS
Monitoring invocations can help you understand application activity and how
your functions are performing overall. Anomalous changes in invocation counts
could indicate either an issue with a function's code or a connected AWS service.
For example, an outage for a function's downstream service could force multiple
retries, increasing the function's invocation count.
Additionally, if your functions are located in multiple regions, you can use the
invocation count to determine if functions are running efficiently, with minimal
latency. For example, you can quickly see which functions are invoked most
frequently in which region and evaluate if you need to move them to another region
or availability zone or modify load balancing in order to improve latency. Services
like Lambda@Edge can improve latency by automatically running your code in
regions that are closer to your customers.
ITERATOR AGE
Lambda emits the iterator age metric for stream-based invocations. The iterator
age is the time between when the last record in a batch was written to a stream
(e.g., Kinesis, DynamoDB) and when Lambda received the batch, letting you know
if the amount of data that is being written to a stream is too much for a function to
accept for processing.
There are a few scenarios that could increase the iterator age:
If you see the iterator age increase, it could mean the function is taking too long
to process a batch of data and your application is building a large backlog of
unprocessed events.
69
CHAPTER 8: PUTTING IT ALL TOGETHER – MONITORING AWS LAMBDA
To decrease the iterator age, you need to decrease the time it takes for a Lambda
function to process records in a batch. Long durations could result from not having
enough memory for the function to operate efficiently. You can allocate more
memory to the function or find ways to optimize your function code.
Adjusting a stream’s batch size, which determines the maximum number of records
that can be batched for processing, can also help decrease the iterator age in some
cases. If a batch consists of mostly calls to simply trigger another downstream
service, increasing the batch size allows functions to process more records in a
single invocation, increasing throughput. However, if a batch contains records that
require additional processing, then you may need to reduce the batch size to avoid
stalled shards.
70
CHAPTER 8: PUTTING IT ALL TOGETHER – MONITORING AWS LAMBDA
Reserving concurrency for a function is useful if you know that function regularly
requires more concurrency than others. You can also reserve concurrency to ensure
that a function doesn’t process too many requests and overwhelm a downstream
service. Note that if a function uses all of its reserved concurrency, it will not access
additional concurrency from the unreserved pool. Make sure you only reserve
concurrency for your function(s) if it does not affect the performance of your other
functions, as doing so will reduce the size of the available concurrency pool.
CONCURRENT EXECUTIONS
In order to monitor concurrency, Lambda emits the concurrent executions metric.
This metric allows you to track when functions are using up all of the concurrency in
the pool.
71
CHAPTER 8: PUTTING IT ALL TOGETHER – MONITORING AWS LAMBDA
In the example above, you can see a spike in executions for a specific function.
As mentioned previously, you can limit concurrent executions for a function by
reserving concurrency from the common execution pool. This can be useful if you
need to ensure that a function doesn’t process too many requests simultaneously.
However, keep in mind that Lambda will throttle the function if it uses all of its
reserved concurrency.
The graphs above show a spike in unreserved concurrency and one function using
most of the available concurrency. This could be due to an upstream service
sending too many requests to the function.
72
CHAPTER 8: PUTTING IT ALL TOGETHER – MONITORING AWS LAMBDA
If a function has a long startup time (e.g., it has a large number of dependencies),
requests may experience higher latency—especially if Lambda needs to initialize
new instances to support a burst of requests. You can mitigate this by using
provisioned concurrency, which automatically keeps function instances pre-
initialized so that they’ll be ready to quickly process requests.
73
CHAPTER 8: PUTTING IT ALL TOGETHER – MONITORING AWS LAMBDA
If you only need to collect Lambda logs, you can use Datadog's dedicated Lambda
Layer and Forwarder. Datadog’s Lambda Layer runs as a part of each function’s
runtime to automatically forward CloudWatch logs to the Datadog Forwarder,
which then pushes them to Datadog. Deploying the Forwarder via CloudFormation
is recommended as AWS will then automatically create the Lambda function with
the appropriate role, add Datadog’s Lambda Layer, and create relevant tags like
functionname, region, and account_id, which you can then use in Datadog to
search your logs.
75
CHAPTER 8: PUTTING IT ALL TOGETHER – MONITORING AWS LAMBDA
For example, with the dashboard above, you can easily track cold starts, errors,
and memory usage for all of your Lambda functions. You can customize your
dashboards to include function logs and trace data, as well as metrics from any of
your other services for easy correlation.
You can also use Datadog's Service Map to visualize all your serverless components
in one place. This information helps you quickly understand the flow of traffic across
upstream and downstream dependencies in your environment.
76
CHAPTER 8: PUTTING IT ALL TOGETHER – MONITORING AWS LAMBDA
For example, if you notice a spike in Lambda errors on your dashboard, you can use
Log Patterns to quickly search for the most common types of errors. In the example
below, you can see a cluster of function logs recording an AccessDeniedException
permissions error. The logs provide a stack trace so you can troubleshoot further.
Datadog also provides integrations for other services you may use with your
serverless applications, such as AWS Fargate, Amazon API Gateway, Amazon
SNS, and Amazon SQS. This ensures that you get visibility into every layer of your
serverless architecture. With these integrations enabled, you can drill down
to specific functions that are generating errors or cold starts to optimize their
performance. AWS charges based on the time it takes for a function to execute,
how much memory is allocated to each function, and the number of requests for
your function. This means that your costs could quickly increase if, for instance,
a high-volume function makes a call to an API Gateway service experiencing
network outages and has to wait for a response.
77
CHAPTER 8: PUTTING IT ALL TOGETHER – MONITORING AWS LAMBDA
With tracing, you can map upstream and downstream dependencies such as
API Gateway and trace requests across your entire stack to pinpoint any latency
bottlenecks. The extension also allows you to analyze serverless logs to quickly
identify the types of errors your functions generate.
To start analyzing trace data from your serverless functions, you can use Datadog’s
Serverless view. This view gives a comprehensive look at all of your functions and
includes key metrics such as invocation count and memory usage. You can search
for a specific function or view performance metrics across all your functions.
You can also sort your functions in the Serverless view by specific metrics to help
surface functions that use the most memory, or that are invoked the most, as seen
in the example below.
When you click on a function, you will see all of its associated traces and logs, as
well as a detailed breakdown of information for each invocation such as duration,
related error messages, and whether the function experienced a cold start during
the invocation.
78
CHAPTER 8: PUTTING IT ALL TOGETHER – MONITORING AWS LAMBDA
API latency and cold starts are two common issues with serverless functions, both
of which can significantly increase a function's execution time. Cold starts typically
occur when functions scale behind the scenes to handle more requests. API latency
could be a result of network or other service outages. Datadog enables you to
proactively monitor latency and cold starts for all your functions.
For example, you can create an anomaly alert to notify you when a function
experiences unusual latency. From the triggered alert, you can pivot to traces and
logs to determine if the latency was caused by cold starts or degradation in an API
service dependency. Datadog also automatically detects when cold starts occur
and applies a cold_start tag to your traces, so you can easily surface functions
that are experiencing cold starts to troubleshoot further.
79
CHAPTER 8: PUTTING IT ALL TOGETHER – MONITORING AWS LAMBDA
If a function's increased execution time is a result of too many cold starts, you can
configure Lambda to reduce initialization latency by using provisioned concurrency.
Latency from an API service, on the other hand, may be a result of cross-region
calls. In that case, you may need to colocate your application resources into the
same AWS region.
Datadog provides a list of built-in alerts you can enable from the Serverless view to
automatically notify you of critical performance issues with minimal configuration,
such as a sudden increase in cold starts or out-of-memory errors.
There are also several alert types that you can use for creating custom alerts for
your specific use case, so you can be notified about only the issues you care about.
For example, you can create an alert to notify you if a function has been throttled
frequently over a specific period of time. If you configure the alert to automatically
trigger separate notifications per affected function, this saves you from creating
duplicate alerts and enables you to get continuous, scalable coverage of your
environment, no matter how many functions you’re running. Throttles occur when
there is not enough capacity for a function, either because available concurrency is
used up or because requests are coming in faster than the function can scale.
80
CHAPTER 8: PUTTING IT ALL TOGETHER – MONITORING AWS LAMBDA
81
Infrastructure
CHAPTER 9: DATADOG IS DYNAMIC, CLOUD-SCALE MONITORING
Chapter 9:
Datadog Is Dynamic,
Cloud-Scale Monitoring
In the preceding chapters we demonstrated how Datadog can help you track the
health and performance of Amazon Elastic Load Balancing, Docker, and all their
associated infrastructure. Whatever technologies you use, Datadog enables you to
view and analyze metrics and events from across your infrastructure.
Datadog was built to meet the unique needs of modern, cloud-scale infrastructure:
82
CHAPTER 9: DATADOG IS DYNAMIC, CLOUD-SCALE MONITORING
— Collaboration baked in. Datadog helps teams stay on the same page with
easily sharable dashboards, graphs, and annotated snapshots. Seamless
integrations with industry-leading collaboration tools such as PagerDuty,
Slack, and HipChat make conversations around monitoring data as
frictionless as possible.
If you are ready to apply the monitoring and visualization principles you’ve learned
in this book, you can sign up for a full-featured Datadog trial at www.datadog.com
Above all, we hope that you find the information in this book to be instructive as
you set out to implement monitoring for your infrastructure, or to improve on
existing practices. We have found the frameworks outlined in these chapters to
be extremely valuable in monitoring and scaling our own dynamic infrastructure,
and we hope that you find them equally useful. Please get in touch with us by
email ([email protected]) or on Twitter (@datadoghq) if you have questions or
comments about this book or about Datadog.
Happy monitoring!
83