0% found this document useful (0 votes)
115 views34 pages

OReilly Mastering Cloud Operations

Uploaded by

Rabiwu Bawah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
115 views34 pages

OReilly Mastering Cloud Operations

Uploaded by

Rabiwu Bawah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Mastering

Cloud Operations
Optimizing the Enterprise for Speed and Agility
Michael Kavis & Ken Corless

Preview
Edition
Compliments of
WHAT LIES BENEATH: A PODCAST
FOR CURIOUS IT ARCHITECTS
Our new podcast offers a deep dive into the technology and trends that drive modern IT.
Each episode features a conversation with an IT expert about the technologies, processes,
and cultural trends that impact today's dynamic IT organizations.

https://whatliesbeneath.fireside.fm/1›
Mastering Cloud
Operations
Optimizing the Enterprise for Speed and Agility

This Preview Edition of Mastering Cloud


Operations, Chapters 1 and 2, is a work in
progress. The final book is currently scheduled
for release in December 2020 and will be
available through O’Reilly Online Learning and
other retailers once it is published.

Michael Kavis and Ken Corless


Mastering Cloud Operations
by Michael Kavis and Ken Corless

Copyright © 2020 Michael J. Kavis and Ken Corless. All rights reserved.

Printed in the United States of America.

Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.

O’Reilly books may be purchased for educational, business, or sales promotional use. Online
editions are also available for most titles (http://oreilly.com). For more information, contact our
corporate/institutional sales department: 800-998-9938 or [email protected].

Acquisitions Editor: Kathleen Carr Interior Designer: Monica Kamsvaag


Development Editor: Sarah Grey Cover Designer: Karen Montgomery
Production Editor: Deborah Baker Illustrator: Rebecca Demarest

December 2020: First Edition

Revision History for the Early Release


2020-02-07: First Release

See http://oreilly.com/catalog/errata.csp?isbn=9781492055952 for release details.

The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Mastering Cloud Operations,
the cover image, and related trade dress are trademarks of O’Reilly Media, Inc.

The views expressed in this work are those of the author(s), and do not represent the publish-
er’s views. While the publisher and the author(s) have used good faith efforts to ensure that the
information and instructions contained in this work are accurate, the publisher and the
author(s) disclaim all responsibility for errors or omissions, including without limitation
responsibility for damages resulting from the use of or reliance on this work. Use of the infor-
mation and instructions contained in this work is at your own risk. If any code samples or other
technology this work contains or describes is subject to open source licenses or the intellectual
property rights of others, it is your responsibility to ensure that your use thereof complies with
such licenses and/or rights.

This work is part of a collaboration between O’Reilly and F5 Networks. See our statement of edi-
torial independence.

978-1-492-05588-4

[LSI]
Contents

| Introduction: Why Do We Need a New Operating Model


for Cloud? v

1 | The Shift to the Cloud 1

2 | What Needs to Change? 19

iii
Introduction: Why Do
We Need a New
Operating Model for
Cloud?

In 2006, Amazon launched Amazon Web Services (AWS), which provided com-
pute, storage, and queuing services on demand, giving birth to the term cloud
computing. While timesharing of computers had existing for decades, Amazon’s
“software-defined everything” approach was groundbreaking. For the first time,
enterprises could deploy software on virtual data centers without ever having to
purchase or lease physical infrastructure on-premises or at a hosted facility. In
fact, they didn’t have to talk to a single human being. Enterprises could now run
their software on Amazon’s data centers and forgo the process of procuring, rack-
ing and stacking, and managing physical infrastructure.
We have all heard how developers flocked to the cloud because they were
freed from long procurement cycles. AWS provided capabilities that allowed
developers to self provision, manage, and change compute, network, and storage
resources on demand from a self-service portal. Not only was the infrastructure
self-service, it was scriptable—developers could now write code to provision
infrastructure instead of relying on another group to procure, install, and alter
hardware and its configurations. The result? Drastically improved time to mar-
ket. Almost overnight, the line between development (Dev) and Operations (Ops)
started to see great synergies.
The term “shift left” refers to roles and responsibilities moving from down-
stream processes (e.g., testing or operations) and organizations to the develop-
ment process and early planning phases. As domain areas like quality assurance

v
vi | Introduction: Why Do We Need a New Operating Model for Cloud?

(QA), security, networking, storage, and others start shifting left into the develop-
ment organizations, existing organizational structures, roles, and responsibilities
became disrupted. This disruption leads to a variety of challenges like political
infighting, increased risk, legacy process bottlenecks, skill gap challenges, and
much more. All of these challenges make cloud adoption extremely hard within
large organizations and can lead to failed cloud adoption, poor return on invest-
ment (ROI), and soaring costs. Cloud adoption is fundamentally hard; big IT
organizations are generally resistant to change given the risk-averse nature of
most IT shops. We have seen many of our clients take on an odd strategy—adopt
the cloud, but try to change as little as possible. As we will cover throughout the
book, this approach has rarely led to the desired outcomes.
One of the fastest ways for a CIO to get fired is to lead an unsuccessful cloud
transformation. Too often, cloud leaders focus a disproportionate amount of time
and money on cloud technologies with little to no time and money focused on
organizational change management and process reengineering. It takes invest-
ments in all three levers: people, process, and technology to succeed in the cloud.
Transformation does not happen by buying a bunch of new tools. This book will
focus on the people and process part of the equation, with the lens on cloud oper-
ations.
Throughout this book, we will discuss the common challenges that large
organizations face when trying to operate in the cloud at scale. We will share our
experiences as practitioners who have built and operated born in the cloud com-
panies, as IT leaders in Fortune 500 companies, and as long-time consultants
with over 100 enterprise clients across the globe.
Before we get started, let’s clearly define what a cloud operating model is. An
operating model is a visual representation of how an organization delivers value
to its internal and external customers. A cloud operating model is a visual repre-
sentation of how an organization delivers value from cloud services to its custom-
ers. The cloud operating model encompasses both organizational structure and a
process flow model. A cloud operating model is a subset of an overall IT operat-
ing model.
Another term we will discuss throughout this book is cloud operations or
CloudOps for short. In this book, CloudOps refers specifically to the people, pro-
cess, and technology used to manage the operations of cloud software and cloud
platforms. CloudOps focuses on operating production workloads in the cloud
across any cloud service model: Infrastructure as a Service (IaaS), Platform as a
Service (PaaS), and Software as a Service (SaaS).
Introduction: Why Do We Need a New Operating Model for Cloud? | vii

What you should take away from this book is that building any old piece of
software in the cloud is relatively easy, but building and operating software in the
cloud at scale that is secure, compliant, highly available, and resilient is a very
complex and challenging undertaking. If you expect (or your vendor is telling
you) that by moving to the cloud these things become easy, you are in for a rude
awakening. Traditional organizational structures and processes designed for
deploying software on physical infrastructure in a data center do not translate
well to the new world of cloud computing. This book will share best practices,
antipatterns, and lessons learned from over 10 years of experience and tons of
battle scars.
Now let’s dive in!
| 1

The Shift to the Cloud

The rate of innovation is faster today than ever before. CIOs have an extremely
tough job balancing “keeping the lights on” with delivering new features to mar-
ket and keeping current with investments in new technologies. Pick up any trade
magazine and you will see success stories of large companies adopting emerging
technologies such as cloud computing, machine learning, artificial intelligence,
blockchain, and many others. At the same time, companies are trying to adopt
new ways to improve agility and quality. Whether it’s DevOps, Scaled Agile
Framework (SAFe), Site Reliability Engineering (SRE), or another favorite buzz-
word, there is so much change coming at CIOs that it’s a full-time job just keep-
ing up. Each new trend is designed to solve a specific set of problems, but it often
takes a combination of tools, trends, methodologies, and best practices to deliver
cloud at scale.

The Journey to the Cloud Is a Long Hard Road


Even as CIOs embrace cloud computing, they have to follow the policies of corpo-
rate governance, risk, compliance (GRC), and security teams. In many compa-
nies, those teams don’t welcome change; they usually have a strong incentive to
make sure “we never end up in the Wall Street Journal” for any kind of breach or
system failure. CEOs, however, also worry about appearing in the Harvard Busi-
ness Review when the company goes bankrupt due to its dated, conservative view
on technology adoption. CIOs are being asked to deliver more value faster in this
very strained environment. The fear of risk and the need to develop new capabili-
ties are competing priorities. Many organizations are responding by shifting tra-
ditional domain-specific functions like testing, security, and operations to the
software engineering teams and even to Agile teams that are more aligned to
business functions or business units.

1
2 | MASTERING CLOUD OPERATIONS

The challenge this presents is that many IT people are not sufficiently skilled
to take on these new roles effectively. It only takes one server with an open port
to the web to cause your chief information security officer to lock everything
down to the point where nobody can get timely work done in the cloud. When
domain expertise shifts without corresponding changes in existing organiza-
tional structures, roles, responsibilities, and processes, the result is usually unde-
sirable—and sometimes even catastrophic. IT leaders need to step back and
redesign their organizations to build and run software in the cloud at scale. That
means rethinking the entire software development value stream, from business
value ideation to business value running ongoing in production. Additionally,
since most journeys to the cloud take years, organizations must harmonize their
processes across their legacy environments and new cloud environments.

Cloud Transformation
When we talk to companies about cloud transformation, we first try to establish
what we mean by the term. Foundationally, a cloud transformation is the journey
of a company to achieve two separate but related goals: to create a faster, cheaper,
safer IT function, and to use its capabilities to build their future. They can ach-
ieve the first goal by leveraging the wonderful array of new technologies available
on demand in the cloud, from Internet as a Service (IaaS) services to artificial
intelligence (AI), machine learning, the Internet of Things (IoT), Big Data, and
other exciting new building blocks of digital business. Today, every company is a
technology company. Those who use the best technology to build digital prod-
ucts, services, and customer experiences are best positioned in the global market-
place. A cloud transformation is truly a path to the future for any CIO.
No two cloud transformations are the same, but the patterns for success and
failure are very common. Many companies that succeed in the cloud have
learned tough lessons about the flaws in their initial thoughts and strategies.
Nobody gets it right at the beginning, but if you start your transformation expect-
ing some bumps and bruises along the way, you can start making progress. “Pro-
gress over perfection” is a principle from the Agile Manifesto that is very
appropriate for the cloud. In fact, many “born in the cloud” tech companies, like
Netflix and Etsy, are on their second or even third generation of cloud architec-
tures and approaches. The pace of change continues to increase once you arrive
in the cloud.
Your company’s culture must embrace transparency and continuous learn-
ing; expect to adjust constantly and improve your cloud transformation over time.
THE SHIFT TO THE CLOUD | 3

Your authors, Mike and Ken, attend a few conferences each year, such as AWS
re:Invent, Google Cloud Next, and DOES (DevOps Enterprise Summit), and we
hear great customer success stories. If you’re working at a company that hasn’t
achieved that level of success, you can get disheartened at these conferences
because it can seem like all the other companies are getting it right. However,
many of those companies are battling with the same issues. Don’t be fooled;
most of the presenters’ success stories only represent a single team (or product
line or business unit) within a very large organization, not the entire organiza-
tion. They started their journey with challenges similar to those your company is
facing, and much of their organization may still be in the very early stages. Keep
your chin up. This book will share a lot of lessons learned for what to do and,
more importantly, what not to do as you embark on your cloud journey. Getting
started can be hard, even daunting, but remember the words of Chinese philoso-
pher Lao Tzu: “A journey of a thousand miles begins with a single step.”
Cloud transformations are a multiyear journey that is never really complete.
The term cloud may be dropped one day (just like we really don’t say “client-
server” systems anymore), but operating (and optimizing) in the cloud is a never-
ending journey. What’s more important than getting it right at the beginning is
actually starting. Too many organizations get caught up in trying to create the
perfect strategy with low tolerance for risks and failures. These companies often
have only years of strategy documents, PowerPoint decks, and a few consulting
bills to show for their efforts, even as their competitors keep advancing. Why
have their efforts created so little business value?
The reason some companies don’t get too far is they don’t see the cloud as a
transformation. They only see it as a technology project, like adopting a new tool.
Being too conservative to allow the company to move forward and do anything
significant in the cloud might be more of a failure than moving to the cloud and
running into problems with availability and resiliency. At least with the latter the
company is getting experience with the cloud and increasing maturity. Slowing
or stopping a cloud transformation “done wrong” is very similar to what we saw
in the 1990s when companies stopped moving from mainframes to client
servers due to “client server done wrong.”
When companies don’t recognize the need to transform their organization to
build, operate, and think differently about software, they take their old business
processes, tooling, and operating model with them to the cloud—which almost
always results in failure. Even worse, sometimes they then declare victory—“Yes,
we have a cloud!”—thereby further inhibiting the business.
4 | MASTERING CLOUD OPERATIONS

The Evolution of Cloud Adoption


Coauthor Mike created this maturity curve based on customer requests he’d been
receiving since 2013.

Note
This image will be available in the final release of the book.

He explains:

When I first started consulting in the cloud computing space, most of the
client requests were for either a TCO (total cost of ownership) or an ROI
analysis, for a cloud initiative or overall cloud strategy. Many leaders had a
hard sell to convince their CEO and board that cloud computing was the
way forward. The cloud, like virtualization before it, was viewed as a cost-
saving measure achieved by better utilization of resources. At that time,
about 80% of the requests were focusing on private cloud while only 20%
were for the public cloud, almost exclusively AWS. In November 2013, at
the annual re:Invent conference, AWS announced a wide variety of
enterprise-grade security features. Almost immediately their phone rang
off the hook with clients asking for public cloud implementations. A year
later our work requests were completely flipped, with over 80% for public
cloud and 20% for private cloud.

Why the Private Cloud Isn’t


From 2005-2012, many large enterprises focused their cloud efforts on
building a private cloud. Security and regulatory uncertainty made them
believe they needed to retain complete control of their computing envi-
ronments. The traditional hardware vendors were more than happy to
condone this point of view: “Yes, buy all of our latest stuff and you will be
able to able to gain all the advantages you would get by going to a public
cloud!” Quite a few Fortune 500 companies invested hundreds of mil-
lions of dollars to build world-class data centers they promised would,
through the power of virtualization and automation, deliver the same
benefits as going to the cloud. While such efforts were often declared
successes (they did tend to save money on hardware), they fell well short
of turning the company into the next Uber or Amazon.
THE SHIFT TO THE CLOUD | 5

We have seen some companies jump all-in to the public cloud with positive
results. While the adoption of public cloud increased, companies moved or built
new workloads in the cloud at rates much faster than they had traditionally
deployed software. However, two common anti-patterns emerged.

Wild West anti-pattern


Developers, business units, and product teams now had access to on-
demand infrastructure and leveraged the cloud to get product out the door
faster than ever. Since cloud was new to the organization, there was no set
of guidelines or best practices. Development teams were now taking on
many responsibilities that they never had before. They were delivering
value to their customers faster than ever before, but often exposing the
organization to more security and governance risks than before and deliv-
ering less resilient products. Another issue was that each business unit or
product team was reinventing the wheel: buying their favorite logging,
monitoring, and security third-party tools. They each took a different
approach to designing and securing the environment and often imple-
mented their own CI/CD toolchains, with very different processes.
Command and control anti-pattern
Management, infrastructure, security, and/or GRC teams put the brakes
on access to the public cloud. They built heavily locked down cloud services
and processes that made developing software in the cloud cumbersome,
destroying one of the key value propositions of the cloud—agility. We have
seen companies take three to six months to provision a virtual machine in
the cloud, something that should only take five to ten minutes. The
command-and-control cops would force cloud developers to go through the
same ticketing and approval processes required in the data center. These
processes were often decades old, designed when deployments occurred
two or three times a year and all infrastructure was physical machines
owned by a separate team.

Ken relates this story from his very early experiences with cloud adoption:

The company had decided on an aggressive cloud adoption plan. Being a


large SAP shop, they were very excited about the elastic nature of the
cloud. SAP environments across testing, staging, development, etc. can
be very expensive, so there are rarely as many as teams would like.
Shortly after the non-production environments moved to the cloud, I had
6 | MASTERING CLOUD OPERATIONS

the cloud infrastructure team show me how they had automated the pro-
visioning of an entire SAP environment. What had previously taken
months could now be done in a few hours! With excitement in my stride, I
strolled over to the testing team.

“You must be thrilled that we can now provision a whole new SAP line
in a few hours!” I exclaimed.

“What are you talking about?” they asked. I went on to explain what I
had just learned about the automated provisioning of the SAP environ-
ment.

“Well, that’s not how it really works,” they told me. “If we want a new
environment, we have to create a Remedy ticket requesting it. Our envi-
ronment manager has five business days to review it. If he approves it
without questioning it (he never does), it then goes to finance for approval.
They meet monthly to review budget change requests. If the budget is
approved, we then need to submit it to the architecture review board.
They meet the third Friday of every month. That process typically requires
at least two cycles of reviews. So I’m thrilled that someone can do it in a
few hours, but I’m still looking at several months.”

Totally deflated, I realized the truth in the old saying: “I have met the
enemy, and the enemy is us.”

The two anti-patterns mentioned here drove a lot of work over the next few
years. In the wild west pattern, production environments became unpredictable
and unmanageable due to a lack of rigor and governance. There were regular
security breaches because teams did not understand the fundamentally different
security postures of the public cloud, believed that security was “somebody else’s
job,” or both. The command-and-control pattern created very little value while
requiring large amounts of money in ongoing strategy and policy work, building
internal platforms that did not meet developers’ needs. Worse yet, this created an
insurgence of shadow IT: groups or teams running their own mini IT organiza-
tions because their needs are not being met through normal channels.
All of these issues have created an awareness of the need for a strong focus
on cloud operations and a new cloud operating model. Since 2018, one of our cli-
ents’ most frequent requests is for help modernizing operations and designing
new operating models.
THE SHIFT TO THE CLOUD | 7

Many of the companies we work with are two or three years or more into
their journey. In the inaugural years, they pay a lot of attention to cloud technolo-
gies. They improve their technical skills for building software and guardrails in
the cloud. They often start at the IaaS layer because they are comfortable dealing
with infrastructure. As their cloud experience matures, they realize that the true
value of cloud is higher up in the cloud stack, and they look into PaaS and SaaS
services.
Development shops have been embracing automation and leveraging con-
cepts like continuous integration (CI) and continuous delivery (CD). The rest of
this book will focus on the impact of concepts like DevOps, cloud native architec-
ture, and cloud computing on traditional operations.

From Hardened Data Center to Blank Canvas


When you start building in the public cloud, you are basically starting from
scratch. You have no existing cloud data center, no guardrails, no financial man-
agement tools and processes, no disaster recovery or business continuity plan—
just a blank canvas. Conventional wisdom is to just apply all the tools, processes,
and organizational structures from the data center to the cloud. That’s a recipe
for disaster.
Regardless of how well or badly an organization manages its data center,
people are accustomed to the existing policies and processes and generally know:

• What to do when incidents, events, or outages arise


• What processes to follow to deploy software
• What the technology stack is for the products they are building and man-
aging
• What process to follow to introduce new technology

When applications are moved, refactored, or built new on the cloud, they are
being deployed to a brand new virtual environment that is radically different than
the environments that people are used to in the existing data centers. The pro-
cesses and policies governing how work gets done in a data center have typically
evolved from many years of change across numerous shifts in technology from
mainframes, client-server architectures, internet-enabled applications, and
today’s modern architectures. Many of these processes were defined in a differ-
ent era: a gigabyte of storage was hundreds of thousands of dollars in the 1980s.
8 | MASTERING CLOUD OPERATIONS

In the cloud, it’s about two cents per month. Human labor was the the cheap
component. Along with these legacy processes come a whole host of tools, many
of which are legacy themselves and were never intended to support software that
runs in the cloud.
Too often, teams from infrastructure, security, GRC and other domain-
specific areas insist on sticking to their existing tools and processes. If these tools
are not cloud native or at least cloud friendly, a painful integration must take
place to make them work effectively.
This creates unnecessary friction for getting software out the door. It can
also create complexity, which can increase costs, reduce performance, and even
reduce resiliency. Another issue is that often these legacy tools are also tied to
legacy processes which makes it challenging and sometimes impossible to auto-
mate the end-to-end software build and release processes.
Another common anti-pattern is the desire to keep existing on-premises log-
ging solutions in place and not go to a cloud-native solution. When you do this,
all logs must be sent from the public cloud back to the data center through a pri-
vate channel, incurring data transfer costs and creating an unnecessary depend-
ency on the data center. These legacy logging solutions often have dependencies
on other software solutions as well as processes that create dependencies
between the cloud and data center. This means that a change in the data center
can cause an outage in the cloud because nobody knew of the dependency. These
issues are very hard to debug and fix quickly.
Here is another example. We did an assessment of a client’s tools, recom-
mended tools that would work well in the cloud, and advised them on which
ones should be replaced by a more cloud-suitable solution. One we recom-
mended replacing dealt with monitoring incoming network traffic. The group
that managed the tool refused to look into a new tool because they were comfort-
able with the existing tools and didn’t want to have to manage two tools. This cre-
ated a single point of failure for all of the applications and services running in
their public cloud. One day the tool failed and no traffic was allowed to flow to
the public cloud, thus taking down all cloud applications.
The lesson here is to evaluate replacements for tools that are not well suited
for the cloud. We often see resistance to process change lessen when new tools
are brought in; tools are often tightly tied to the processes they are used in. Try to
reduce the number of dependencies that the applications running in the public
cloud have on the data center and have a plan to mitigate any failures on a data
center dependency.
THE SHIFT TO THE CLOUD | 9

The new cloud operating model that we are advocating brings domain
experts closer together. We hope it will reduce these avoidable incidents as com-
panies rethink their approach to the cloud.

Shared Responsibility: The Data Center Mindset Versus the Cloud


Mindset
A common mistake that companies make is they treat the cloud just like a data
center. They think in terms of physical infrastructure instead of leveraging cloud
infrastructure as a utility, like electricity or water. There are two major capabili-
ties that are different in the cloud when compared to most on-premises infra-
structures. First, in the public cloud, everything is software defined and software
addressable (that is, it has an API). This creates an incredible opportunity to auto-
mate, streamline, and secure the systems. While software-defined everything has
made significant strides in the data center in the last decade, most of our clients
still have major components that must be configured and cared for manually.
The second major difference in the public cloud is the inherent design for multi-
tenancy. This “from the ground up” view of multi-tenancy has driven a great level
of isolated configuration in the cloud.
Here’s an example. In most companies, there are one or two engineers who
are allowed to make DNS changes. Why is that? Because the tooling we often use
on-premises does not isolate workloads (or teams) from each other. This means
that if we let Joe manage his own DNS, he might accidentally change Sue’s DNS,
causing disruption. So we have made sure that only David and Enrique are
allowed to change DNS for everyone in the whole company. In contrast, in the
cloud, everyone’s accounts are naturally isolated from each other. Joe can have
full authority over his DNS entries while he might not even be able to browse, let
alone change, Sue’s entries. This core difference is often overlooked and is one of
the key facets that allows for self-service capability in the public cloud.
Enterprises who have been building and running data centers for many years
often have a challenge shifting their mindset from procuring, installing, main-
taining, and operating physical infrastructure to a cloud mindset where infra-
structure is consumed as a service. (Randy Bias has memorably described the
difference between cloud and physical servers as being like the difference
between livestock and pets; one is named and cared for personally, the other is
numbered and replaceable.)
You might also think of an analogy we like to use: buying a house versus
renting a house. The analogy really boils down to assets that are purchased ver-
10 | MASTERING CLOUD OPERATIONS

sus assets that are rented and the responsibilities that go along with each. When
you buy a house, you are investing in both property and physical structure(s) on
that property. You are responsible for maintaining the house, landscaping, clean-
ing, and everything else that comes with home ownership. When you rent, you
are paying for the time that you inhabit the rental property. It is the landlord’s
responsibility to maintain it. The biggest difference between renting and buying
is what you, as the occupant of the house, have control over. (And just as people
get more emotionally attached to their owned homes than to their rented apart-
ments, plenty of infrastructure engineers have true emotional attachments to
their servers and storage arrays.)
When you leverage the cloud, you are renting time in the cloud provider’s
“house.” What you control is very different than what you control in your own
data center. For people who have spent a lot of their career defining, designing,
and implementing processes and technologies for the controls they are responsi-
ble for in their data center, shifting some of those controls to a third party can be
extremely challenging.
The two groups who probably struggle the most to grasp the cloud shared-
responsibility model are auditors and GRC teams. These teams have processes
and controls for physically auditing data centers. When you pause to think about
it, physically evaluating a data center is a bit of a vestigial process. Sure, 50 years
ago nearly all IT processes (including application development) probably hap-
pened in the data center building, but today, many data centers run with skeleton
crews. IT processes are distributed in many locations, often globally. But the
auditors expect to be able to apply these exact processes and controls in the cloud.
The problem is, they can’t. Why? Because these data centers belong to the cloud
service providers (CSPs), who have a duty to make sure your data is safe from
their other clients’ data. Would you want your competitor walking on the raised
floor at Google where your software is running? Of course not. That’s just one
simple example.
At one meeting we attended, a representative of one of the CSPs was explain-
ing how they handle live migrations of servers that they can run at any time dur-
ing the day with no impact to the customers. The client was adamant about
getting all of the CSP’s logs to feed into their company’s central logging solution.
With the shared-responsibility model, the CSP is responsible for logging and
auditing the infrastructure layer, not the client. The client was so used to being
required to store this type of information for audits that they simply would not
budge. We finally had to explain that in the new shared responsibility model, that
THE SHIFT TO THE CLOUD | 11

data would no longer be available to them. We asked where they stored the logs
for failed sectors in their disk array and how they logged the CRC (error correc-
tion) events in their CPU. Of course, they didn’t.
We explained to the client that they would have to educate their audit team
and adjust their processes. To be clear, the policy that required the client to store
those logs is still valid. How you satisfy that policy in the cloud is completely dif-
ferent. If the auditors or GRC teams cannot change their mindset and come up
with new ways to satisfy their policy requirements, they might as well not go the
public cloud. But does an auditor of a GRC team really want to hold an entire
company back from leveraging cloud computing? Should the auditor be making
technology decisions at all? A key task in the cloud modernization journey is the
education of these 3rd party groups that have great influence in the enterprise. As
technology becomes more capable and automated, the things that we have to
monitor will change—because the risk profile has changed fundamentally.
In the data center world, teams are traditionally organized around skill
domains as they related to infrastructure. It is common to find teams responsible
for storage, for network, for servers, for operating systems, for security, and so
forth. In the cloud, much of this infrastructure is abstracted and available to the
developers as an API call. The need to create tickets to send off to another team
to perform a variety of tasks to stand up physical infrastructure like a SAN (stor-
age area network) simply does not exist in the public cloud. Developers have
access to storage as a service and can simply write code to provision the neces-
sary storage. This self-service ability is crucial to enabling one of the prizes of
cloud transformation: higher-velocity IT.
Networking teams in the data center leverage third-party vendors who pro-
vide appliances, routers, gateways, and many other important tools required to
build a secure, compliant, and resilient network. Many of these features are avail-
able as a service in the cloud. For areas where the cloud providers don’t provide
the necessary network security functionality, there are many third-party SaaS or
pay-as-you-go solutions available, either directly from the vendor or from the
CSP’s marketplace. Procuring these solutions in the cloud when they are con-
sumed as SaaS, PaaS, or IaaS is different than how similar tools in the data cen-
ter are procured. In the public cloud, there are usually no physical assets being
purchased. Gone are the days of buying software and paying 20-25% of the pur-
chase price for annual maintenance. In the cloud you pay for what you use, and
pricing is usually consumption based.
12 | MASTERING CLOUD OPERATIONS

Use What You Have Versus Use What You Need


Before cloud computing was an option, almost all of the development we were
involved in was deployed within the data centers that our employers and clients
owned. Each piece of the technology stack was owned by specialists for that given
technology. For databases, there was a team of DBAs (database administrators)
who installed and managed software from vendors like Oracle, Microsoft,
Neteeza, and others. For middleware, there were system administrators that
installed and managed software like IBM’s Websphere, Oracle’s Weblogic,
Apache Tomcat, and others. The security team owned various third-party soft-
ware solutions and appliances. The network team owned a number of both physi-
cal solutions and software solutions and so forth. Whenever development wanted
to leverage a different solution from what was offered in the standard stack, it
took a significant amount of justification for the following reasons:

• The solution had to be purchased up front.


• The appropriate hardware had to be procured and implemented.
• Contractual terms had to be agreed upon with the vendor.
• Annual maintenance fees had to be budgeted for.
• Employees and/or consultants needed to be trained or hired to implement
and manage the new stack component.

Adopting new stack components in the cloud, if not constrained by legacy


thinking or processes, can be accomplished much quicker, especially when these
stack components are native to the CSP. Here are some reasons why:

• No procurement is necessary if the solution is available as a service.


• No hardware purchase and implementation are necessary if the service is
managed by the CSP.
• No additional contract terms should be required if the proper master
agreement is set up with the CSP.
• There are no annual maintenance fees in the pay-as-you go model.
THE SHIFT TO THE CLOUD | 13

• The underlying technology is abstracted and managed by the CSP, so the


new skills are only needed at the software level (how to consume the API,
for example).

Let’s say that Acme Retail, a fictitious big box retailer, has standardized on
Oracle for all of its OLTP (online transaction processing) database needs and Ter-
adata for its data warehouse and NoSQL needs. Now a new business requirement
comes in and Acme needs a document store database to address it. In the non-
cloud model, adopting graph databases would require new hardware, new soft-
ware licensing, database administrators (DBAs) trained in the new database
technology, new disk storage, and many other stack components. The process
required to get all of the relevant hardware and software approved, procured,
implemented, and secured would be significant. In addition, Acme would need
to hire or train DBAs to manage the database technology.
Now let’s look at how much simpler this can be in the public cloud. If
Acme’s CSP of choice offers a managed service for a document store database,
most of these steps are totally eliminated. They no longer need hardware, soft-
ware licensing, no additional DBAs to manage the database service, no new disk
storage devices—no procurement services at all. All Acme would need is to learn
how to use the API for the document store database to start building its solution.
Of course they would still need to work with the security team, the network
team, the governance team, and others to get approvals, but much of time it takes
to adopt the new technology is reduced. In Chapter 2, we will discuss shifting
responsibilities closer to development (shifting left). But for the sake of this
example, let’s assume that we still have silos for each area of technology expertise
(network, storage, server, security, etc.).
Let’s say I have 120 days to deliver the new feature to the business. The best
solution for the requirement would have been using a document store database
like MongoDb. However, we estimate that the effort required to get approvals,
procure all of the hardware and software, train or hire DBAs, and implement the
database would exceed our 120-day window. Therefore, we decided to leverage
Oracle, our relational database engine to solve the problem. This is suboptimal,
but at least we can meet our date. And remember, this is just the burden to get
started; we are not yet factoring in the burden that comes from day-to-day change
management as we are defining and building the solution.
This decision process repeats itself over and over from project to project,
which results in a ton of technical debt because we keep settling for suboptimal
14 | MASTERING CLOUD OPERATIONS

solutions due to our constraints. Now let’s see how different this can all be if we
are allowed to embrace a database-as-a-service solution in the public cloud.
After doing some testing in our sandbox environment in the cloud, we deter-
mine that the document store managed service on our favorite CSP’s platform is
a perfect solution for our requirements. We can essentially start building our sol-
ution right away because the database is already available in a pay-as-you-go
model complete with autoscaling. Leveraging stack components as a service can
reduce months of time from a project. It also allows you to embrace new technol-
ogies with a lot less risk. But most importantly, you no longer have to make tech-
nology compromises because of the legacy challenges of adopting new stack
components. Unfortunately, we do see some companies, especially early in their
cloud journeys, destroy this model by using the new technology with their old
processes (approvals, hand-offs, and SLAs). This change-resistant approach is
why we are so adamant about linking new process with new technology.
IMMUTABLE INFRASTRUCTURE

Earlier we discussed the livestock/pet analogy. Let’s expand on that thinking


here. In the cloud, a best practice is to treat infrastructure as expendable (cattle).
We use the term immutable to describe the processes of destroying and rebuild-
ing a virtual machine (VM), network configuration, database, etc. in the cloud. In
the legacy model, servers exist continuously and a lot of time and effort goes into
making sure they are healthy. Release planning in the legacy model usually
requires that we have a backout or rollback strategy. Even regular security patch-
ing is a headache that all teams dread. Removing software updates from a pro-
duction system can often create even more issues than what an unsuccessful
release may have originally caused. Rollbacks can be extremely risky, especially
in a complex system.
In the cloud, when a virtual machine is unhealthy, we can simply shut it
down and create a new one. This allows us to focus our immediate attention on
SLAs in the areas of availability, performance, reliability, etc. instead of spending
time trying to determine what caused the issue. Once the system is back to being
stable and meeting its SLAs, we can then perform forensics on the data we cap-
tured from the terminated infrastructure. A best practice is to take a snapshot of
the machine image which can be used in conjunction with logging and monitor-
ing tools to triage the problem.
Treating virtual machines in the cloud as immutable gives us many advan-
tages for releasing software as well. Instead of designing complex rollback pro-
THE SHIFT TO THE CLOUD | 15

cesses, we can implement a solution like the following process that is designed
for web servers sitting behind a load balancer.
In this example, we have three web servers sitting behind a load balancer.
The approach is to deploy the new software to three brand new VMs and to take a
snapshot of the current VMs in case we need to back out changes. Then we
attach the new VMs to the load balancer and start routing traffic to them. We can
either take the older VMs out of rotation at this point or simply make sure no
traffic gets routed to them. Once the new release looks stable, we can shut down
and detach the old VMs from the load balancer to complete the deployment. If
there are any issues, we can simply reroute traffic to the older VMs, and then
shut down the new VMs. This allows us to stabilize the system without introduc-
ing the risk of backing out the software. Once the system is stable, we can fix the
issues without worrying about creating any new issues from a complex rollback
scheme.
This is just one of many ways to embrace the concepts of immutable infra-
structure for deployments. Companies with deep experience in this area can
design advanced processes to create highly resilient systems, all of which can be
fully automated.
MICROSERVICES MONOLITHS

Historically, applications were built, packaged and deployed as a large unit of


work made up of thousands or even millions of lines of code. These large sys-
tems are known as monoliths. Monoliths have many disadvantages. First, they
tend to be fragile. Any change to a line of code could impact the entire system.
Because of the large number of dependencies in these systems, monoliths are
changed infrequently and often scheduled as quarterly, biannual, or even annual
releases. These dependencies often create huge costs in regression testing (the
enemy of high velocity delivery). Infrequent changes reduce agility and create
long wait times for customers to gain access to new features and products.
Many companies have moved to or are experimenting with microservices. A
microservices architecture is “an approach to developing a single application as a
suite of small services, each running in its own process and communicating with
lightweight mechanisms, often an HTTP resource API,” according to Martin
Fowler and James Lewis. Each service can be run on its own infrastructure which
can be a server, a virtual machine, or a container (or now, a serverless function).
This style of architecture is often referred to as loosely coupled because the serv-
ices are independent of each other and not hard-coded to the underlying infra-
structure.
16 | MASTERING CLOUD OPERATIONS

Figure 1-1. Microservices Architecture. Source: Martin Fowler https://martinfowler.com/articles/


microservices.html

The advantage of microservices is that each service can be deployed sepa-


rately (because they can be tested separately), resulting in much more frequent
deployments. Developers can make small changes or add new services without
being dependent on any other changes to the overall product. When you hear of
companies deploying multiple times a day, the company typically has a microser-
vices architecture or an architecture made up of many individual components.
The disadvantage is that managing a system made of of many individual
parts can be challenging. Traditional monolithic systems are made up of 3 major
parts: a web front end, a database, and backend processes where most of the
business logic is. In a microservices architecture, the system is made up of many
independent services. Operating a service-based product requires new tooling,
new processes (especially for building and deploying software), and new skills.
We will discuss microservices in more detail in Chapter 2.
THE NEED FOR SPEED

We now build and deploy software faster than has ever been done before. Cloud
computing is one of the strongest contributors to that speed of deployment.
The cloud providers offer developers a robust service catalog that abstracts
the underlying infrastructure and a variety of platform services, allowing develop-
THE SHIFT TO THE CLOUD | 17

ers to focus more on business requirements and features. When cloud comput-
ing first became popular in the mid to late 2000s, most people used the
Infrastructure as a Service cloud business model. As developers became more
experienced, they started leveraging higher levels of abstraction. Platform as a
Service abstracts away both the infrastructure and the application stack (operat-
ing system, databases, middleware, etc.). Software as a Service vendors provide
full-fledged software solutions where the enterprises only need to make configu-
ration changes to meet their requirements and manage user access.
Each one of these three cloud service models (IaaS, PaaS, SaaS) can be huge
accelerators for the business. Businesses used to have to wait for their IT depart-
ments to procure and install all of the underlying infrastructure, application
stack, and build and maintain the entire solution. Depending on the size of the
application, this could take several months or even years.
In addition to the cloud service models, technologies like serverless comput-
ing, containers, and fully managed services (for example, databases as a service,
blockchain as a service, and streaming as a service) are providing capabilities for
developers to build systems much faster. We will discuss each of these concepts
in more detail in Chapter 2.
All of these new ideas challenge our traditional operating models and pro-
cesses. Adopting them in a silo without addressing the impacts on people, pro-
cesses, and technology across all of the stakeholders involved in the entire SDLC,
is a recipe for failure. We will discuss some of the patterns and anti-patterns for
operating models later in this book.
| 2

What Needs to
Change?

One of the key messages of this book is that success in the cloud cannot be
achieved by only focusing on cloud technology. To succeed at scale in the cloud,
enterprises must make changes not only to the technology, but to the organiza-
tional structures and the legacy processes that are used to deliver and operate
software. Embracing DevOps is a key ingredient to successfully transforming the
organization as it adopts cloud computing. But what is DevOps, really?

Defining DevOps
One of the biggest misperceptions about the term DevOps is that it is a set of
technologies and tools that developers and operators use to automate “all the
things.” DevOps is much more than tools and technologies, and it takes more
than just developers and operators to successfully embrace DevOps in any enter-
prise. Many people will shrug off this debate as nothing more than semantics,
but understanding DevOps is critical for an organization so that they can put
together a good strategy to drive their desired outcomes. If all that DevOps is to
an organization is automation of CI/CD pipelines, they will likely leave out many
important steps required to deliver in the cloud at scale.
There is no official definition of DevOps. The definition Mike came up in an
article back in 2014 is “DevOps is a culture shift or a movement that encourages
great communication and collaboration (aka teamwork) to foster building better-
quality software more quickly with more reliability.” I went on to add “DevOps is
the progression of the software development lifecycle (SDLC) from Waterfall to
Agile to Lean and focuses on removing waste from the SDLC.”
But don’t take our word for it; look at the work of the leading DevOps
authors, thought leaders, and evangelists. Gene Kim, co-author of popular

19
20 | MASTERING CLOUD OPERATIONS

DevOps books such as The Phoenix Project, DevOps Handbook, and the Unicorn
Project, in his keynote presentation at the DevOps Enterprise summit in 2017,
stated:

“This is my personal definition. I would define DevOps by the outcomes. In


my mind, DevOps is those set of cultural norms and technology practices
that enable the fast flow of planned work into operations while preserving
world class reliability, operation, and security.

DevOps is not about what you do, but what your outcomes are. So
many things that we associate with DevOps fits underneath this very
broad umbrella of beliefs and practices—which of course communication
and culture are a part of.”

In their book The IT Manager’s Guide to DevOps authors Buntel and Shroud
define DevOps as “... a set of cultural philosophies, processes, practices, and tools
that radically removes waste from your software production process.”
The DevOps Handbook opens with “Imagine where the product owners,
Development, QA, IT Operations, and Infosec work together, not only to help
each other, but also to ensure that the overall organization succeeds.”
And finally, the popular and influential book Accelerate, by Forsgren, Hum-
ble, and Kim, describe the basis of their research:

“We wanted to investigate the new ways, methods, and paradigms that
organizations were using to develop software, with a focus on Agile and
Lean processes that extended downstream from development and priori-
tized a culture of trust and information flow, with small cross-functional
teams creating software. At the beginning of the project in 2014, this
development and delivery methodology was widely known as “DevOps,”
and so this was the term we used.”

As you read through these definitions, what are some of the key terms you
see?

• Culture
• Cross-functional teams
• Processes
• Remove waste
WHAT NEEDS TO CHANGE? | 21

• Reliable
• Quality
• Trust
• Sharing
• Collaboration
• Outcomes
• Flow

If DevOps embraces all of these terms, one has to wonder why so many
organizations create a new silo called DevOps and focus on writing automation
scripts without collaborating with the product owners and developers. We see
many companies take their existing operations teams, adopt a few new tools, and
call it DevOps. While these steps are not bad, and can even be viewed as pro-
gress, a non-holistic approach to DevOps will not deliver its promise. DevOps
silos often lead to even more waste in the SDLC because the focus is usually
exclusively on the tools and scripting, and not on trust, culture, collaboration,
removing waste, outcomes, etc.

Why DevOps?
What about DevSecOps? NoOps? AutoOps? AIOps? Aren’t those things
all better than DevOps? Should I be adopting those instead? Well the
answer to those questions all get back to having an outcome-focused
definition of DevOps that have tools and techniques as enablers of the
culture and process, not as the goal. To most mature DevOps practition-
ers, DevOps is all of those things. When we are focused on outcomes, you
use the techniques that best help you deliver those outcomes. These
new terms get created to focus on some aspect of the system that a pro-
ponent feels is under-represented (security, automation, AI, etc). It helps
professionals and consultants feel more leading edge, and their clients/
organization feel current. But make no mistake, there is little truly new in
these next-generation buzzwords from this old fashioned (circa 2009)
term DevOps. So at the end of the day, you should follow the principles
we discuss in this chapter. Then call it whatever you want.
22 | MASTERING CLOUD OPERATIONS

To understand DevOps, it is critical that we understand its roots. Its


evolution started back in 2008. At the Agile 2008 Toronto conference,
Patrick Debois, a data center consultant and Andrew Shafer, an Agile
developer, met shortly after Debois was the only attendee for Shafer’s
session entitled “Agile Infrastructure.” The two discussed how to use
Agile infrastructure to resolve the bottlenecks and conflicts between
development and operations. They created a group called the Agile Sys-
tems Administration Group to try to improve life in IT. In 2009 at the
O’Reilly Velocity Conference, a presentation given by John Allspaw, Head
of Operations at Flickr, and Paul Hammond, head of Engineering at Flickr,
titled “10+ Deploys Per Day: Dev and Ops Cooperation at Flickr,” got the
IT community buzzing, including Debois. Deploying multiple times a day
back in 2008 was almost unheard of. Inspired by the presentation,
Debois set up a meeting in Belgium where he lived and invited his net-
work on Twitter. He named the conference DevOpsDays and used the
hashtag #DevOps when promoting it on Twitter. Were it not for Twitter’s
140 character limit, we would likely have had a less succinct movement in
the industry.

From this history, it is easy to see why many people think that DevOps is just
about developers and operators working together. However, the DevOps net has
since been cast much more broadly than during its founding days and touches all
areas of the business. Many companies start their DevOps journey looking only
at automating the infrastructure or the application build process, but high-
performing companies that have been on their DevOps journey for several years
are redesigning their organizations and process both inside and outside of IT.

Typical Maturity Progression


Many companies that are enjoying success in their DevOps journey share these
common beliefs:

• Changing the culture and mindset is a critical success factor.


• Removing waste/bottlenecks from the SDLC helps drive business value.
• Shifting from reactive to proactive operations improves reliability.
• Start somewhere and then continuously learn and improve.
WHAT NEEDS TO CHANGE? | 23

• DevOps and cloud require a new operating model (organization change).

These companies look very different 3 to 4 years into their journey as they
continue to embrace change and continuous learning. Table 2-1 shows a typical
order in which companies address bottlenecks to improve delivery and business
outcomes. Usually as they make strides removing one bottleneck (for example
inconsistent environments), they then progress to resolve their next biggest bot-
tleneck (for example security).

Table 2-1. Bottlenecks and pain points


Pattern Bottleneck/Pain Point Solution
1 Nonrepeatable, error-prone build process Continuous Integration (CI)
2 Slow and inconsistent environment Continuous Delivery (CD)
provisioning
3 Inefficient testing processes/handoffs Shift testing left/test
automation
4 Resistance from security team, long and Shift security left, DevSecOps
painful approval processes
5 Painful handoff to operations teams, forced Shift ops left, new operating
to use legacy tools/processes, poor MTTR models (platform teams, SRE,
etc)
6 Slow SLAs from Tier 1-3 support Shift support left, new
operating models
7 Slow and painful approval processes for Shift GRC left, stand up cloud
GRC (governance, risk, and compliance) GRC body

This is just a subset of the types of problems and the corresponding changes
that are often implemented to remove the bottlenecks. There are also large initia-
tives for reskilling the workforce and rethinking the future of work. The way we
incent workers must change to achieve the desired outcomes. Procurement pro-
cesses must change as we shift from licensing and maintenance to pay-as-you-go
models. Basically every part of the organization is impacted.

A New Operating Model


As you can see from the list, there are a lot of changes that fundamentally chal-
lenge how IT has operated for years. This transformation is bigger than CI/CD
pipelines or Terraform templates. The organizational change, culture change,
thinking and acting differently, and modernizing how work gets done while lev-
24 | MASTERING CLOUD OPERATIONS

eraging new tools and technologies is indeed the heart and soul of DevOps. Sure,
there are teams within large organizations that are having great success focusing
mostly on the technology. But to scale DevOps across an organization, a new
operating model is required. A fundamental re-engineering of IT is now happen-
ing in many large companies. There is great irony that technology and IT were
often significant contributors to the waves of re-engineering we’ve seen in the
enterprise over the last several decades, while IT themselves were largely run-
ning the same processes and structure despite major changes in the underpin-
ning technology. Even fairly large advances in methods in IT, most notably Agile,
only changed processes in parts of silos, rather than holistically looking at the IT
function.
About the Authors
Mike Kavis has served in numerous technical roles such as CTO, chief architect,
and VP positions with more than 30 years of experience in software development
and architecture. A pioneer in cloud computing, Kavis led a team that built the
world’s first high-speed transaction network in Amazon’s public cloud and won
the 2010 AWS Global Startup Challenge. Kavis is the author of Architecting the
Cloud: Design Decisions for Cloud Computing Service Models (SaaS, PaaS, and
IaaS).

Ken Corless is a principal in Deloitte’s Cloud practice, serving as the CTO. As


CTO, he is responsible for evangelizing the use of Cloud at enterprise scale, pri-
oritizing Deloitte investment in cloud assets, and driving technology partnerships
in the ecosystem. Ken also remains very client-facing, responsible for solutioning
and delivery of some of our most complex client engagements. Ken’s experience
brings together digital, cloud, and emerging technologies to help clients create
breakaway products, services, and processes. He has received industry accolades
for his leadership, innovative solutions to business problems, and bold
approaches to disruption, including being named to Computerworld Premier
100 IT Leaders and CIO magazine’s Ones to Watch. Additionally, Ken was the
lead author for several Deloitte Tech Trends (Inevitable Architecture, Reengin-
eering Technology) and has written several articles for the Wall Street Journal
CIO Edition. Prior to his current role, Ken spent 28 years at Accenture perform-
ing both consulting and internal IT leadership roles.

You might also like