AWS Splunk Infrastructure Monitoring 101 The Power To Predict and Prevent
AWS Splunk Infrastructure Monitoring 101 The Power To Predict and Prevent
The Power to
Predict and Prevent
The ability to see what’s happening across
an organization’s infrastructure helps
teams to predict and prevent outages
!
| 4
Building
Dev/Apps Servers
Cloud Network
EF41#
E8<
/&!
Office Storage
The Complexity of
IT Infrastructure
BackUp/DR Security
Desktop/BM
5
Infrastructure Monitoring 101: The Power to Predict and Prevent | 5
More complexity =
more room for failure
As we see in the preceding graphic, modern IT infrastructure is an
extraordinarily complex system of interconnected technologies, each of
which has the potential to run into issues or fail outright. And with more
components being added to these stacks as technology evolves, new
opportunities for outages arise. In fact, between 2017 and 2018, instances
of outages or “server service degradation periods” increased from 25% to
31%, and if we look at on-premises data centers, that number rises to 48%.*
34%
60% of data center’s outages could have been
prevented with better management, processes
31%
or configuration.
25% 2019
OUTAGES
2018
OUTAGES
2017
OUTAGES
| 6
Seventy-eight percent of organizations say they had an IT service outage • On August 25, Slack suffered a service outage that affected users in the
in the past three years — a higher percentage than in previous years — and U.K. and western and southern Europe. Slack users faced troubles with
41% classified it as minimal or negligible. Outages in these categories signal files, messages and connecting to Slack.
bigger problems and are troubling more for their frequency than for their
• And on September 23, Tesla suffered an hour-long global network
singular impact. When asked about significant, serious or severe outages —
outage with its internal systems that left several Tesla owners unable
which can cause substantial financial and reputational damage — 31% have
to connect to their cars through the mobile app or the website. Tesla’s
been affected.**
energy products, Tesla solar, and Powerwall home battery systems
About 20% of organizations had a serious or severe outage in the past three were inoperative too. The outage was due to an internal break of their
years — that is, an outage that was costly, caused reputational damage and, application programming interface (API).
in some cases, had major other implications. Nearly a third of all outages
The best way to ensure that issues are resolved quickly — or prevented
cause financial or reputational damage.
altogether — is to monitor and troubleshoot the underlying infrastructure
Outages are increasingly costly for businesses. In 2020, a greater as well as the mission-critical apps and services that run on them. While
percentage of outages cost more than $1 million (now nearly one in six observing any one element of the infrastructure stack is a straightforward
rather than one in ten, as in 2019), and a greater percentage cost between proposition, observing each piece individually introduces a host of
$100,000 and $1 million (40% vs. 28%). additional problems.
Another angle on that statistic: because of largely preventable errors,
almost half of employees and users experienced issues with their apps and
services. That kind of disruption can result in thousands of employee hours
wasted, customer dissatisfaction and, ultimately, loss of business.
We can think of ITOps as a stack of physical and logical layers, each with its own technologies, systems and services,
and each with a corresponding team or individual responsible for monitoring and maintaining it. This makes gaining
visibility into the infrastructure as a whole fundamentally problematic, despite being essential.
A per-layer monitoring practice leads to siloed teams and incompatible views of data. Each layer has different vital
metrics, different monitoring tools and dashboards and different personnel behind the keyboard. In practice, per-layer
monitoring means people looking at limited information using different languages, leading to difficulties detecting and
investigating outages and issues as well as restoring service.
to a successful IT connections, file systems mounted and system memory usage. The level
of detail is configurable by the system administrator; however, there
monitoring solution are sufficient options to provide a complete picture of system activity
throughout its lifetime. Having visibility into these pieces of server data and
monitoring them proactively can help teams find resolutions more quickly
One way to avoid the problems of per-layer monitoring is building
or prevent outages altogether.
with observability in mind. Observability is the natural evolution
of what we used to call monitoring. Observability recognizes
that today’s infrastructure and applications are living, breathing Imagine a gaming company whose users depend
organisms that evolve at a much faster rate than ever before.
on reliable, high-speed access to a web app — not
Observability encompasses all of the things we used to do in
monitoring, like watching for known failure conditions, and extends
terribly hard to picture, is it? Having immediate
it to support the challenges of today’s applications, like being visibility and insight into server performance would
prepared for all the unknown failure conditions. be critical to that company’s success. The ability to
quickly resolve server-based issues (or predict and
avoid them altogether) would have a significant
IT stack layers: impact on the product’s uptime and directly impact
customer satisfaction and, ultimately, revenues.
Servers
A high-quality user experience depends
on effective monitoring of the systems
Having a single tool from which to monitor the health of servers — one that
that support the product. It allows
correlates event data and log data into a seamless experience — enables
administrators and ITOps personnel
ITOps teams to quickly isolate what is driving the failure (like memory
to see resource usage patterns and
usage on a single server) and resolve it. It also facilitates proactivity. The
optimize the servers keeping websites
ability to create alerts and automations within the monitoring tool saves
and applications running smoothly.
teams time and allows them to focus their efforts on other tasks.
Server operating systems routinely
Cloud
Network
Running workloads in a cloud environment is not “set it and forget it.” ITOps
While each organization’s needs and data sources will vary, there are reasons for
teams still need to monitor the performance, usage, security and availability
monitoring network data that are common across companies and institutions:
of the cloud infrastructure continuously. And with the right solutions, it’s
• Protecting corporate networks from attacks.
possible to manage IT systems and derive actionable insights from all of the
• Providing visibility into network traffic. data in one system, even if the services are running in hybrid environments.
• Determining the role of the network in the overall availability and
When an organization migrates its services to a cloud platform (or between
performance of critical services.
cloud platforms), for instance, having end-to-end visibility into every stage
Monitoring a network means more than having visibility into the state of the of the migration can help teams establish baseline performance, monitor
hardware that supports that network, like routers, switches, etc. It includes services during the transition and ensure that all services are running
monitoring network event logs, activities across the network infrastructure, optimally after the transition is over.
traffic bottlenecks or suspicious behavior.
Infrastructure Monitoring 101: The Power to Predict and Prevent | 11
Services running on hybrid and cloud infrastructures can be opaque, Containers enable a number of significant benefits to organizations,
leading to gaps in ITOps teams’ understanding of the system as a whole. developers and users — faster deployment, smaller footprints and
Organizations eager to get the benefits of cloud often overspend on cloud consistency across environments. But containers, like virtual machines,
services — on deprecated or unused services, unknown redundancies have their own system metrics that need to be monitored, and with many
or excessive resources. Ingesting all of the cloud infrastructure data into containers running side-by-side, the task of monitoring, optimizing and
a single environment, replacing the multitude of individual monitoring troubleshooting them becomes much more complicated.
tools with a consolidated solution, can provide an understanding of how
Cloud-native infrastructure such as containers, Kubernetes and serverless
resources are performing and being used, allowing for optimization of
are highly dynamic and ephemeral. When the cloud infrastructure only lives
utilities and billing.
for minutes, the monitoring solution needs to detect and enable automatic
Public Cloud Hybrid Cloud remediation within seconds.
AWS On-Prem
For all the benefits that containers bring to IT organizations, they bring new
Private/Public Cloud Mix
considerations that must be addressed including:
Infrastructure monitoring should provide out-of-the-box, end-to-end
• Significant blind spots: Containers are designed to be disposable. Because of
visibility into all stages of cloud migration — before, during and after — this, they introduce several layers of abstraction between the application and
and full visibility into public cloud IaaS. The right monitoring solution will the underlying hardware to ensure portability and scalability. This all contributes
simplify the multitude of monitoring tools, and allow you to monitor your to a significant blind spot when it comes to conventional monitoring.
entire stack in one place. Teams can collaborate more efficiently, with
• Increased need to record: The easy portability of so many interdependent
greater visibility into resources. Built-in dashboards and accurate alerts
components creates an increased need to maintain telemetry data to ensure
provide shorter mean time to detect (MTTD), helping resolve issues before
observability into the performance and reliability of the application, container
they impact operations.
and orchestration platform.
Kubernetes and Containers • The importance of visualizations: The scale introduced by containers and
container orchestration requires the ability to both visualize the environment to
Since the introduction of the concept in 2013, adoption of containers has
gain immediate insight into your infrastructure health but also be able to zoom
skyrocketed across technology organizations. They share some conceptual
in and view the health and performance of containers, node and pods. The right
features with virtual machines, but they differ in a few essential ways. The monitoring solution should provide this workflow.
easiest way to understand a container is to think of it as exactly that — a
container — a receptacle that holds something securely and can be used A good container monitoring solution enables ITOps to stay on top of a
to transport its contents. A software container performs a similar function. dynamic container-based environment by unifying container data with other
It allows developers to package an application’s code, configuration files, infrastructure data to provide better contextualization and root cause analysis.
libraries, system tools, and everything else needed to execute that app Learn more about container monitoring in The Essential Guide to
into a self-contained unit, so that they can move the package and run it Container Monitoring.
anywhere with ease.
Having a solution that provides a holistic view of the infrastructure alongside detailed views of
individual components is vital if an organization wants to proactively tackle infrastructure issues
and reduce mean time to detection (MTTD), investigation and restoration. It’s also an essential
piece of future planning; knowing how the infrastructure has performed historically, and how it’s
performing in real time, provides invaluable insights that reduce complexity when integrating new
technologies and building new experiences for users and employees.
13
A single platform with a unified experience that provides ITOps with The biggest benefit of an AI/ML-powered monitoring system is the
access to all the information across domains opens up opportunities enormous savings in time and effort on the part of ITOps teams. When
for cross-functional investigation and holistic end-to-end infrastructure repetitive tasks and processes are automated, ITOps teams have the
monitoring. It removes blind spots from the system and, as a result, bandwidth to do the kinds of things AI and ML are ill-equipped to do —
reduces mean time to resolution (MTTR) because teams can more creative problem solving, upgrading existing technologies and planning
quickly identify the problem, fix it and move forward. for the future.
Learn more about Acquia’s cloud Learn more about Namely’s Learn more about Imprivata’s Learn more about CloudShare’s
monitoring success. microservices monitoring success. container monitoring success. virtualization monitoring success.
Learn More
Splunk, Splunk>, Data-to-Everything, D2E and Turn Data Into Doing are trademarks and registered
trademarks of Splunk Inc. in the United States and other countries. All other brand names, product
names or trademarks belong to their respective owners. © 2021 Splunk Inc. All rights reserved.