4 Ways to Improve

Your DevOps Testing

The Role of Testing in Incident
Management, Security, Code
Releases, and Operations
Running a reliable enterprise that exceeds expectations and stays safe and compliant is not easy.
Even when you think your work is done, it’s not.

Monitoring. Code releases. Privacy and security. Continuous testing for DevOps.

To maintain an environment that performs continuously and doesn’t get slowed by attacks, old code,
or poor fundamentals, there’s one thing you can never stop doing.


We’ve compiled four essays from some of our best minds in testing, operations, and security. Because your
enterprise relies on a host of systems, tools, platforms, and architectures, we’ve focused on integration-based
collaboration. We hope these four essays on common testing mistakes and how to optimize your testing
processes will be helpful.

The four essays in this eBook are Why Service Monitoring and Testing Matter by VP of Support Services
Craig Gulliver, Test in Production to Make Code Releases Safer by Director of Operations Adam Serediuk,
DevOps is Failing These Three Tenets of Privacy Compliance by Ops Security Manager Bob Hawk, and
Continuous Testing Is Crucial for DevOps But Not Easy by QA Architect Deepa Guna.

The goal of this introduction is to get you into the essays, not to keep you here. So please read on!

Why Service Monitoring
and Testing Matter

A major part of running a cloud service is seeing whether

the system is healthy and performing as expected. Good
monitoring should provide the necessary transparency
across all aspects of your infrastructure. These might include
operating systems, applications, build and deployment
pipelines, web traffic, sales pipeline, and so on. Monitoring
services allow teams to understand the health of all the
components required to deliver your service to clients.

As a cloud service provider, you learn quickly that service

downtime impacts your business in many ways. Without the
right level of transparency across your technology stack,
troubleshooting and investigation during an incident eat
up valuable time and resources. That’s why it’s important
to employ different techniques and use different levels of
monitoring that will not only detect but also prevent issues
and help you solve problems faster before they impact your
clients. testing, black-box testing, system testing, etc. For example,
white-box testing can be useful to help monitor the internal
Usually, monitoring is based on the premise that the structures or architecture of your service.
application will detect when an error has occurred, and
generate a message that can be acted on. While this Automation is king!
works on a basic level, you’re waiting until something goes Many repeatable processes can be automated with the
wrong before you can act, at which point the error may right tools. At xMatters, we prefer automation over manual-
have already impacted your clients. You need to consider driven processes. Automation affords operators and
different ways of testing your service in conjunction with support agents the ability to focus on higher-level tasks
the monitors you have in place. There are many forms of instead of running or coordinating groups of commands
service testing such as integration testing, component and ensuring that they worked as expected. Like all code,

Why Service Monitoring and Testing Matter

automations must be maintained and tested constantly to

Understanding the variables
ensure they are reliable and correct when needed.
At xMatters, our testing must account for different
environment variables for notification delivery. For instance,
While automation is not a silver bullet, it can certainly change
email notifications rely on internet connectivity and email
the way your teams do business when it matters most. The
relays, while SMS messages rely on the availability of mobile
focus it provides is imperative to an incident team when
networks. Obviously, these are outside our private cloud
resolving incidents or repairing service levels back to normal.
infrastructure and out of our control.

Compounding these issues, testing in a non-production or

staging environment is entirely different from production.
Even when the infrastructure is an exact mirror of production,
we have found most clients cannot duplicate the same
traffic and the transaction volume as found in production.
This makes each production environment unique, which
affects the baseline benchmark for tests.

A simplified example would be testing notification delivery

and user responses. In a quiet system, the response can be
quick as there is little activity, and more resources available
for processing. However, in a busy production system the
response time may be longer depending on the levels
of traffic. This is tricky for monitoring and testing since
production systems always have heavier traffic than testing
A U T O M AT E M O N I T O R I N G or staging environments.


MONITORING: Employ different techniques to detect and prevent issues

TESTING: Testing your service in conjunction with the monitors you have in place

AUTOMATION: Automate repeatable processes

DELIVERY: Account for environment variables for notification delivery

SERVICE HEALTH: Exercise your services in different ways to gain a holistic view

TRANSPARENCY: Be transparent and honest with your customers

Why Service Monitoring and Testing Matter

Incorporating system testing Being transparent

This is not ground breaking as many other cloud services When things go sideways, it’s important to be transparent
have provided insight into how they test their services, for and honest. At xMatters, we strive to make sure that we
example the Chaos Monkey at Netflix. At xMatters, we’ve provide the details that matter to our clients, that we
learned over time that you need to exercise your services are learning from these unfortunate events, and that we
in different ways to have a holistic view of service health. can demonstrate we are working towards improving our
We have incorporated many tools to exercise our service services for everyone. We understand that clients want to
in different ways to help us know about issues before they know the details and they deserve to know.
have any business impact to our clients.
Being honest is much easier when you can demonstrate
For SMS delivery, we’ve implemented a testing solution with that you responded to an issue quickly and responsibly.
a service vendor that provides a global network of real SMS Responding appropriately requires planning and processes.
devices. This allows us to test SMS delivery across various Demonstrating that you responded appropriately requires
regions, and even different carriers within a single region, preserving issues and conversations for post mortems.
across the world. As part of this testing, we can measure
when a message was sent to vendor, how long it took to Doing so manually is time-consuming and error-prone.
reach the device, and whether the content sent matches the That’s why at xMatters, we integrate directly with systems
original message, among other things. like Zendesk, Splunk, Pingdom, JIRA, and StatusPage.

This information is fed back into our monitoring system In a 2017 survey of more than 1,000 DevOps organizations,
which is then configured to detect various failure conditions half of all responders say they lack a consistent process for
such as internal component failures, performance issues, or responding to a major incident. The greatest delay is the time
upstream carrier issues. Moreover, this kind of testing not a ticket sits in the queue before an engineer touches it. You
only exercises our own cloud infrastructure and components, want to make sure you resolve the issue before a customer
but also the communication networks required to reach end reports it. This is the essence of proactive customer service.
user devices: full stack testing.

DevOps Is Failing These Three
Tenets of Privacy Compliance

Think your automated tests will catch your security or documented, have people seen the documented privacy
privacy vulnerabilities? I’ll bet you’re wrong. policy, and have people consented to their data being used
according to the declaration of the privacy policy?
I know, data is streaming from multiple sources into your
SEIM systems, and you’ve configured triggers for your It’s a lot to contend with. To address privacy by security
reporting. You’re watching results from automated tests appropriately, you have to embed privacy by design from
on software running in production. All your monitoring the beginning. It can’t be bolted on.
tools say your code is running flawlessly and there are no
errors. You’re running automated tests, just as the DevOps In fact, properly patched code is 80% of security. The
playbook suggests, but they won’t catch security or privacy firewalls, antivirus software, and other additional elements
compliance vulnerabilities. Why not? DevOps is falling are backup measures in case the fundamentals don’t work.
behind. Think of proper code as the moat and the drawbridge,
while the guards are the firewall. If a product or service is
The hard truth about DevOps at the highest quality possible, privacy and security will be
DevOps is falling behind because privacy is a different embedded and seamless.
matter. It is a matter of complying with laws. There are three
main tenets to privacy compliance: Is the privacy policy Organizations have gravitated toward DevOps because of
in alignment with the current laws and has it been fully its emphasis on process, collaboration, and automation.


ALIGNMENT: Align your privacy policy with current laws and fully document it

VISIBILITY: Make sure people read your privacy policy

CONSENT: Gain consent for use of user data in accordance with the privacy policy

DevOps Is Failing These Three Tenets of Privacy Compliance

Unfortunately, automation has come at the expense of

other things like privacy and security.

Are privacy and security real?

Your privacy and security are as meaningful as their
alignment between your security implementation and your
need for privacy and security. If those things are not aligned,
privacy and security are just academic concepts.

When we run tests, we’re testing the code to see whether it

works. Security may not be in the testing scope. For many
companies, security is tough to test for because it’s in the
firewall and antivirus software. Let’s look at another industry:

in your lane, and more features are built in from the start.
When Ford introduced the Model T in 1908, he revolutionized
production with the assembly line. The Model T gave
Back to our story: The same thing is happening to
people what they wanted: fast, reliable transportation at an
affordable price. information technologies and information systems. Security
is an artifact of the youth of the industry. Innovations come
Security features? None. Later, of course, Ford introduced an out of immature industries, and security is fully integrated
electric starter, a foot accelerator, a foot brake, dashboard when the industries become mature.
gauges, seat belts, air bags, crumple zones, firewall, and
more. Security and safety are built in. Having those features Security requires three things: safety from things that
built in from the start helps to ensure quality, and in the
don’t work right, safety from malicious activity, and privacy
modern era these points are regulated by law makers.
protection. Security regarding things not working and

Today’s cars have hundreds of sophisticated safety features. defense against malicious activity have advanced since the

Rear-view cameras, fluid level sensors, tire pressure sensors, software industry has started to mature. Privacy compliance
nearby car detectors, auto correction technology for staying is still growing and evolving.


DevOps Is Failing These Three Tenets of Privacy Compliance

Privacy compliance is an evolutionary arms race. Social and We are always improving
business factors increase risks. Laws change to enforce At xMatters we strive toward integrating privacy compliance into
behaviors that will mitigate those risks. And the security our products and services. We do this by constantly improving
features built into your code comply with the laws and offset our understanding of privacy compliance requirements and
the behavior of both well-meaning and nefarious people applying a risk-based approach to introducing security controls.
who interact with your company’s code and infrastructure.
Security controls span physical, technical, and administrative
Next steps in security and compliance domains. Using a tiered approach, we use technical controls
There are three things you must do to ensure security and first. Where technical controls are not available or have failed,
compliance given the current state of business. we use administrative controls such as awareness and training.
Administrative controls are focused on individuals in regard
First, validate code as part of development. There are a few ways to necessary information. In other words, we use context and
to do this. You can have it validated by other human beings, situational awareness to get the right information into the right
but of course human beings make mistakes. You can also use hands. We build processes to avoid overwhelming people
automated scanners against known vulnerabilities. A third with unnecessary information or keeping people from the
option is a function map or workflow vetted by someone who information they need to do their jobs.
knows privacy. That person could be a lawyer with technology
cross discipline, or it could be a privacy expert on staff. There is a misconception that privacy and security require
technical solutions. And to an extent, that’s true. But really,
Second, make sure you’re not breaking any laws when you’re it’s a people issue and needs to be solved through training,
coding and writing processes for automation. There is no awareness, and the flow of information.
magic way of knowing whether you are in legal compliance.
In adherence with DevOps best practices, you must map We use these fundamentals not only to protect ourselves,
function and workflows from a legal perspective. but to protect our valued customers. By understanding
the unique requirements of each business, we can help
Third, document the function and workflow. Function our clients understand how their data is being used and
and workflow should not live in people’s heads! When help them stay compliant with the law – and safe from
documented and shared, workflow helps to support the untrustworthy hands.
collaboration that is the heart of DevOps. When you
integrate privacy and security into your product by design, We are in a time of great change, and some situations have
you put your organization on the road to providing effective, no precedence to guide us. Changes to Safe Harbor are a
safe, and secure software to your customers. good example. And now, regulators are building teeth into
laws by applying enormous penalties for running afoul. We
are confident that our proactive stance on compliance and
safety will continue to serve our customers well.

Test in Production to Make
Code Releases Safer


As software releases graduate from development to test, Organizations have gravitated toward DevOps because of
staging and production environments, it undergoes various its emphasis on process, collaboration, and automation.
stages of testing. A release candidate from the development Unfortunately, automation has come at the expense of
environment may undergo daily regression testing. Perhaps other things like privacy and security.
in test, functionality and usability testing is performed.
But as software and its user interactions become more Replicating real-world conditions
complicated and time sensitive, the real rubber meets the These tests in development environments do a great job of
road in only one place—test in production. assessing the usability and general function of software, but
they don’t do a great job of assessing performance in real-
There are many types of testing: feature verification (did world conditions. Both traffic and users alike can behave in
we built what we said we would?), integration testing (did unexpected ways. Finding out that your software doesn’t
the automated tests pass?), usability testing, reliability behave as expected where it normally lives, in production,
testing, and of course performance testing. These tests is never fun. Capturing, sanitizing, and replaying production
can include taking servers offline, introducing errors, and traffic is often a non-trivial affair, especially in complex
other anomalies to see how the software behaves. However, systems with many interactions.
no matter how closely your testing environments mimic
production, there is no greater test than doing it live.


FEATURE VERIFICATION: Did we built what we said we would?

INTEGRATION: Testing software modules as a group

RELIABILITY: Repeating results to increase likelihood of success

PERFORMANCE: Evaluating product quality

DevOps Is Failing These Three Tenets of Privacy Compliance

Most organizations are comfortable talking about features

and usability, but uptime and performance have been part TESTING SERVICES
of other departmental concerns. In a DevOps environment,
that simply will not do. Uptime and performance are no
longer the responsibility of Operations alone.

But, how do we resolve this? Capturing traffic and replaying FEATURE VERIFICATION
it in test environments is non-trivial, and sanitized data
can often remove the exact insanity that you’re trying to
introduce. This isn’t to suggest that you shouldn’t do these
things – you absolutely should. The longer it takes to detect
a problem, the more expensive it is to resolve.
By defining SLAs for your software and testing them as
part of the release process, you can catch these problems in
your common scenarios, including capturing the supporting
data like metrics, performance statistics and error rates.
Testing as part of the release process should be a challenge
to break the software, not just to validate it still behaves.
Inject errors. Take systems offline, introduce chaos testing
to randomly shut off components, to inject network latency
or other unforeseen anomalies. Because sooner or later, can immediately roll back. Good monitoring and metrics are
they’re going to happen in prod. key. You can let software age for a few days to see how it
performs over time, before exposing more users to it.
Verifying that your software is meeting its SLAs prior to
release in production builds the confidence to go beyond, These strategies further validate the viability of a release in
to test in production. production, and are extremely important when making large
architectural changes where the normal characteristics have
Production testing to increase safety changed, and ‘gut feel’ or other fuzzy acceptance measures
This doesn’t mean skipping testing (known as unintentional are clearly not good enough.
testing) in production, but rather using production
to increase the safety of your release through proven A purposeful approach to testing in production reduces
strategies. Red/black deployment and slow rollouts (canary risk and instils the confidence to make changes, with the
releases) can reduce risk by allowing you to test with real ultimate goal being to find problems before your customers
users and real data. If you see an increase in errors, you do, no matter the circumstance or the change.

Continuous Testing Is Crucial
for DevOps, But Not Easy


As software transitions from a monolithic to a microservice 1. Automated Test Triggers:

architecture, organizations are adopting DevOps practices to
To enable faster feedback, tests need to be classified in
accelerate delivery of features to customers and improve their
various layers:
Health check — Focus of these test to ensure the services
Jumping into continuous testing without the right infrastructure, are up and running . Such checks are triggered by various
tools, and processes can be a disaster. Continuous testing monitoring applications .
plays an important role to achieve the fastest quality to market. Smoke test — Focus of these tests are to verify key
Continuous testing requires several levels of monitoring with business features are operational and functional . Such
automated triggers, collaboration, and actions. Here’s what is tests should have short test cycle typically less than 15
required: min and executed on continuous basis.

Intelligent regression — Subset of the regression test

1. Automatic Test Triggers to execute tests as software
scenarios are triggered based on the code changes
transitions from various stages –
with deployments and a full regression is triggered
development / test / staging / production
on nightly basis.
2. Service Health Monitoring to automate feedback
Benchmark/Load Test — Focus of these test to measure
on failures
the performance of the each service and triggered on
3. Test Result Monitoring to automate feedback on failures nightly basis.
4. Identifying Root Cause of Failure and analyzing test results Reliability/Chaos Testing — Focus of tests is to measure
system behavior while failures are deliberately injected
to services. Such tests are triggered on weekly basis to
identify key infrastructural / operational issues.

Continuous Testing Is Crucial for DevOps, But Not Easy

2. Service Health Monitoring Identifying Information — Every request made for

automation test runs includes a custom header that
Maintaining the health of services requires:
includes information like Test Run ID or Test Case ID.
Automated Alerts — Automation notifies the relevant After the request is submitted, the response will contain
service team to take the appropriate action required application trace IDs which you can track under test
by the failure. results log.
3. Test Results Monitoring
Every test failure triggers an automated
As test are triggered , it’s essential to monitor results investigation process which does the following:
and take required steps when there are failures:
• Retrieve the test case details to identify the list
Automated Notifications — Notify the appropriate of component tested in the respective tests
service development teams to take necessary actions
such as block release from going to production if a critical • Identify if any one of the components has changed
recently since the last successful test run
defect is introduced.
• Identify the list of changes and retrieve the metrics
4. Identifying Root Cause of the failure for each of those components
Set up a framework to track every request made for automated
• Correlate the test results based on the changes
test runs and how those requests traverse the various
to identify a pattern
distributed services:
• Once the problem is identified, update the service team
owners to take the necessary action to fix the problem


You Made It!
The speed and complexity of a DevOps environment provides a zillion (literally) opportunities for mistakes
every day. Between product development, code release, security, and monitoring, transactions take place
at an amazing rate. Preventing incidents completely is virtually impossible, but there is one way to maintain
quality assurance.

Testing thoroughly and often is the best way to prevent issues that can put your business at risk. We hope
you have found the content in this paper valuable. Please visit us at xMatters.com/solutions/devops/
for more information.

Thanks! And good luck.





