OReilly Mastering Cloud Operations
OReilly Mastering Cloud Operations
Cloud Operations
Optimizing the Enterprise for Speed and Agility
Michael Kavis & Ken Corless
Preview
Edition
Compliments of
WHAT LIES BENEATH: A PODCAST
FOR CURIOUS IT ARCHITECTS
Our new podcast offers a deep dive into the technology and trends that drive modern IT.
Each episode features a conversation with an IT expert about the technologies, processes,
and cultural trends that impact today's dynamic IT organizations.
https://whatliesbeneath.fireside.fm/1›
Mastering Cloud
Operations
Optimizing the Enterprise for Speed and Agility
Copyright © 2020 Michael J. Kavis and Ken Corless. All rights reserved.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional use. Online
editions are also available for most titles (http://oreilly.com). For more information, contact our
corporate/institutional sales department: 800-998-9938 or [email protected].
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Mastering Cloud Operations,
the cover image, and related trade dress are trademarks of O’Reilly Media, Inc.
The views expressed in this work are those of the author(s), and do not represent the publish-
er’s views. While the publisher and the author(s) have used good faith efforts to ensure that the
information and instructions contained in this work are accurate, the publisher and the
author(s) disclaim all responsibility for errors or omissions, including without limitation
responsibility for damages resulting from the use of or reliance on this work. Use of the infor-
mation and instructions contained in this work is at your own risk. If any code samples or other
technology this work contains or describes is subject to open source licenses or the intellectual
property rights of others, it is your responsibility to ensure that your use thereof complies with
such licenses and/or rights.
This work is part of a collaboration between O’Reilly and F5 Networks. See our statement of edi-
torial independence.
978-1-492-05588-4
[LSI]
Contents
iii
Introduction: Why Do
We Need a New
Operating Model for
Cloud?
In 2006, Amazon launched Amazon Web Services (AWS), which provided com-
pute, storage, and queuing services on demand, giving birth to the term cloud
computing. While timesharing of computers had existing for decades, Amazon’s
“software-defined everything” approach was groundbreaking. For the first time,
enterprises could deploy software on virtual data centers without ever having to
purchase or lease physical infrastructure on-premises or at a hosted facility. In
fact, they didn’t have to talk to a single human being. Enterprises could now run
their software on Amazon’s data centers and forgo the process of procuring, rack-
ing and stacking, and managing physical infrastructure.
We have all heard how developers flocked to the cloud because they were
freed from long procurement cycles. AWS provided capabilities that allowed
developers to self provision, manage, and change compute, network, and storage
resources on demand from a self-service portal. Not only was the infrastructure
self-service, it was scriptable—developers could now write code to provision
infrastructure instead of relying on another group to procure, install, and alter
hardware and its configurations. The result? Drastically improved time to mar-
ket. Almost overnight, the line between development (Dev) and Operations (Ops)
started to see great synergies.
The term “shift left” refers to roles and responsibilities moving from down-
stream processes (e.g., testing or operations) and organizations to the develop-
ment process and early planning phases. As domain areas like quality assurance
v
vi | Introduction: Why Do We Need a New Operating Model for Cloud?
(QA), security, networking, storage, and others start shifting left into the develop-
ment organizations, existing organizational structures, roles, and responsibilities
became disrupted. This disruption leads to a variety of challenges like political
infighting, increased risk, legacy process bottlenecks, skill gap challenges, and
much more. All of these challenges make cloud adoption extremely hard within
large organizations and can lead to failed cloud adoption, poor return on invest-
ment (ROI), and soaring costs. Cloud adoption is fundamentally hard; big IT
organizations are generally resistant to change given the risk-averse nature of
most IT shops. We have seen many of our clients take on an odd strategy—adopt
the cloud, but try to change as little as possible. As we will cover throughout the
book, this approach has rarely led to the desired outcomes.
One of the fastest ways for a CIO to get fired is to lead an unsuccessful cloud
transformation. Too often, cloud leaders focus a disproportionate amount of time
and money on cloud technologies with little to no time and money focused on
organizational change management and process reengineering. It takes invest-
ments in all three levers: people, process, and technology to succeed in the cloud.
Transformation does not happen by buying a bunch of new tools. This book will
focus on the people and process part of the equation, with the lens on cloud oper-
ations.
Throughout this book, we will discuss the common challenges that large
organizations face when trying to operate in the cloud at scale. We will share our
experiences as practitioners who have built and operated born in the cloud com-
panies, as IT leaders in Fortune 500 companies, and as long-time consultants
with over 100 enterprise clients across the globe.
Before we get started, let’s clearly define what a cloud operating model is. An
operating model is a visual representation of how an organization delivers value
to its internal and external customers. A cloud operating model is a visual repre-
sentation of how an organization delivers value from cloud services to its custom-
ers. The cloud operating model encompasses both organizational structure and a
process flow model. A cloud operating model is a subset of an overall IT operat-
ing model.
Another term we will discuss throughout this book is cloud operations or
CloudOps for short. In this book, CloudOps refers specifically to the people, pro-
cess, and technology used to manage the operations of cloud software and cloud
platforms. CloudOps focuses on operating production workloads in the cloud
across any cloud service model: Infrastructure as a Service (IaaS), Platform as a
Service (PaaS), and Software as a Service (SaaS).
Introduction: Why Do We Need a New Operating Model for Cloud? | vii
What you should take away from this book is that building any old piece of
software in the cloud is relatively easy, but building and operating software in the
cloud at scale that is secure, compliant, highly available, and resilient is a very
complex and challenging undertaking. If you expect (or your vendor is telling
you) that by moving to the cloud these things become easy, you are in for a rude
awakening. Traditional organizational structures and processes designed for
deploying software on physical infrastructure in a data center do not translate
well to the new world of cloud computing. This book will share best practices,
antipatterns, and lessons learned from over 10 years of experience and tons of
battle scars.
Now let’s dive in!
| 1
The rate of innovation is faster today than ever before. CIOs have an extremely
tough job balancing “keeping the lights on” with delivering new features to mar-
ket and keeping current with investments in new technologies. Pick up any trade
magazine and you will see success stories of large companies adopting emerging
technologies such as cloud computing, machine learning, artificial intelligence,
blockchain, and many others. At the same time, companies are trying to adopt
new ways to improve agility and quality. Whether it’s DevOps, Scaled Agile
Framework (SAFe), Site Reliability Engineering (SRE), or another favorite buzz-
word, there is so much change coming at CIOs that it’s a full-time job just keep-
ing up. Each new trend is designed to solve a specific set of problems, but it often
takes a combination of tools, trends, methodologies, and best practices to deliver
cloud at scale.
1
2 | MASTERING CLOUD OPERATIONS
The challenge this presents is that many IT people are not sufficiently skilled
to take on these new roles effectively. It only takes one server with an open port
to the web to cause your chief information security officer to lock everything
down to the point where nobody can get timely work done in the cloud. When
domain expertise shifts without corresponding changes in existing organiza-
tional structures, roles, responsibilities, and processes, the result is usually unde-
sirable—and sometimes even catastrophic. IT leaders need to step back and
redesign their organizations to build and run software in the cloud at scale. That
means rethinking the entire software development value stream, from business
value ideation to business value running ongoing in production. Additionally,
since most journeys to the cloud take years, organizations must harmonize their
processes across their legacy environments and new cloud environments.
Cloud Transformation
When we talk to companies about cloud transformation, we first try to establish
what we mean by the term. Foundationally, a cloud transformation is the journey
of a company to achieve two separate but related goals: to create a faster, cheaper,
safer IT function, and to use its capabilities to build their future. They can ach-
ieve the first goal by leveraging the wonderful array of new technologies available
on demand in the cloud, from Internet as a Service (IaaS) services to artificial
intelligence (AI), machine learning, the Internet of Things (IoT), Big Data, and
other exciting new building blocks of digital business. Today, every company is a
technology company. Those who use the best technology to build digital prod-
ucts, services, and customer experiences are best positioned in the global market-
place. A cloud transformation is truly a path to the future for any CIO.
No two cloud transformations are the same, but the patterns for success and
failure are very common. Many companies that succeed in the cloud have
learned tough lessons about the flaws in their initial thoughts and strategies.
Nobody gets it right at the beginning, but if you start your transformation expect-
ing some bumps and bruises along the way, you can start making progress. “Pro-
gress over perfection” is a principle from the Agile Manifesto that is very
appropriate for the cloud. In fact, many “born in the cloud” tech companies, like
Netflix and Etsy, are on their second or even third generation of cloud architec-
tures and approaches. The pace of change continues to increase once you arrive
in the cloud.
Your company’s culture must embrace transparency and continuous learn-
ing; expect to adjust constantly and improve your cloud transformation over time.
THE SHIFT TO THE CLOUD | 3
Your authors, Mike and Ken, attend a few conferences each year, such as AWS
re:Invent, Google Cloud Next, and DOES (DevOps Enterprise Summit), and we
hear great customer success stories. If you’re working at a company that hasn’t
achieved that level of success, you can get disheartened at these conferences
because it can seem like all the other companies are getting it right. However,
many of those companies are battling with the same issues. Don’t be fooled;
most of the presenters’ success stories only represent a single team (or product
line or business unit) within a very large organization, not the entire organiza-
tion. They started their journey with challenges similar to those your company is
facing, and much of their organization may still be in the very early stages. Keep
your chin up. This book will share a lot of lessons learned for what to do and,
more importantly, what not to do as you embark on your cloud journey. Getting
started can be hard, even daunting, but remember the words of Chinese philoso-
pher Lao Tzu: “A journey of a thousand miles begins with a single step.”
Cloud transformations are a multiyear journey that is never really complete.
The term cloud may be dropped one day (just like we really don’t say “client-
server” systems anymore), but operating (and optimizing) in the cloud is a never-
ending journey. What’s more important than getting it right at the beginning is
actually starting. Too many organizations get caught up in trying to create the
perfect strategy with low tolerance for risks and failures. These companies often
have only years of strategy documents, PowerPoint decks, and a few consulting
bills to show for their efforts, even as their competitors keep advancing. Why
have their efforts created so little business value?
The reason some companies don’t get too far is they don’t see the cloud as a
transformation. They only see it as a technology project, like adopting a new tool.
Being too conservative to allow the company to move forward and do anything
significant in the cloud might be more of a failure than moving to the cloud and
running into problems with availability and resiliency. At least with the latter the
company is getting experience with the cloud and increasing maturity. Slowing
or stopping a cloud transformation “done wrong” is very similar to what we saw
in the 1990s when companies stopped moving from mainframes to client
servers due to “client server done wrong.”
When companies don’t recognize the need to transform their organization to
build, operate, and think differently about software, they take their old business
processes, tooling, and operating model with them to the cloud—which almost
always results in failure. Even worse, sometimes they then declare victory—“Yes,
we have a cloud!”—thereby further inhibiting the business.
4 | MASTERING CLOUD OPERATIONS
Note
This image will be available in the final release of the book.
He explains:
When I first started consulting in the cloud computing space, most of the
client requests were for either a TCO (total cost of ownership) or an ROI
analysis, for a cloud initiative or overall cloud strategy. Many leaders had a
hard sell to convince their CEO and board that cloud computing was the
way forward. The cloud, like virtualization before it, was viewed as a cost-
saving measure achieved by better utilization of resources. At that time,
about 80% of the requests were focusing on private cloud while only 20%
were for the public cloud, almost exclusively AWS. In November 2013, at
the annual re:Invent conference, AWS announced a wide variety of
enterprise-grade security features. Almost immediately their phone rang
off the hook with clients asking for public cloud implementations. A year
later our work requests were completely flipped, with over 80% for public
cloud and 20% for private cloud.
We have seen some companies jump all-in to the public cloud with positive
results. While the adoption of public cloud increased, companies moved or built
new workloads in the cloud at rates much faster than they had traditionally
deployed software. However, two common anti-patterns emerged.
Ken relates this story from his very early experiences with cloud adoption:
the cloud infrastructure team show me how they had automated the pro-
visioning of an entire SAP environment. What had previously taken
months could now be done in a few hours! With excitement in my stride, I
strolled over to the testing team.
“You must be thrilled that we can now provision a whole new SAP line
in a few hours!” I exclaimed.
“What are you talking about?” they asked. I went on to explain what I
had just learned about the automated provisioning of the SAP environ-
ment.
“Well, that’s not how it really works,” they told me. “If we want a new
environment, we have to create a Remedy ticket requesting it. Our envi-
ronment manager has five business days to review it. If he approves it
without questioning it (he never does), it then goes to finance for approval.
They meet monthly to review budget change requests. If the budget is
approved, we then need to submit it to the architecture review board.
They meet the third Friday of every month. That process typically requires
at least two cycles of reviews. So I’m thrilled that someone can do it in a
few hours, but I’m still looking at several months.”
Totally deflated, I realized the truth in the old saying: “I have met the
enemy, and the enemy is us.”
The two anti-patterns mentioned here drove a lot of work over the next few
years. In the wild west pattern, production environments became unpredictable
and unmanageable due to a lack of rigor and governance. There were regular
security breaches because teams did not understand the fundamentally different
security postures of the public cloud, believed that security was “somebody else’s
job,” or both. The command-and-control pattern created very little value while
requiring large amounts of money in ongoing strategy and policy work, building
internal platforms that did not meet developers’ needs. Worse yet, this created an
insurgence of shadow IT: groups or teams running their own mini IT organiza-
tions because their needs are not being met through normal channels.
All of these issues have created an awareness of the need for a strong focus
on cloud operations and a new cloud operating model. Since 2018, one of our cli-
ents’ most frequent requests is for help modernizing operations and designing
new operating models.
THE SHIFT TO THE CLOUD | 7
Many of the companies we work with are two or three years or more into
their journey. In the inaugural years, they pay a lot of attention to cloud technolo-
gies. They improve their technical skills for building software and guardrails in
the cloud. They often start at the IaaS layer because they are comfortable dealing
with infrastructure. As their cloud experience matures, they realize that the true
value of cloud is higher up in the cloud stack, and they look into PaaS and SaaS
services.
Development shops have been embracing automation and leveraging con-
cepts like continuous integration (CI) and continuous delivery (CD). The rest of
this book will focus on the impact of concepts like DevOps, cloud native architec-
ture, and cloud computing on traditional operations.
When applications are moved, refactored, or built new on the cloud, they are
being deployed to a brand new virtual environment that is radically different than
the environments that people are used to in the existing data centers. The pro-
cesses and policies governing how work gets done in a data center have typically
evolved from many years of change across numerous shifts in technology from
mainframes, client-server architectures, internet-enabled applications, and
today’s modern architectures. Many of these processes were defined in a differ-
ent era: a gigabyte of storage was hundreds of thousands of dollars in the 1980s.
8 | MASTERING CLOUD OPERATIONS
In the cloud, it’s about two cents per month. Human labor was the the cheap
component. Along with these legacy processes come a whole host of tools, many
of which are legacy themselves and were never intended to support software that
runs in the cloud.
Too often, teams from infrastructure, security, GRC and other domain-
specific areas insist on sticking to their existing tools and processes. If these tools
are not cloud native or at least cloud friendly, a painful integration must take
place to make them work effectively.
This creates unnecessary friction for getting software out the door. It can
also create complexity, which can increase costs, reduce performance, and even
reduce resiliency. Another issue is that often these legacy tools are also tied to
legacy processes which makes it challenging and sometimes impossible to auto-
mate the end-to-end software build and release processes.
Another common anti-pattern is the desire to keep existing on-premises log-
ging solutions in place and not go to a cloud-native solution. When you do this,
all logs must be sent from the public cloud back to the data center through a pri-
vate channel, incurring data transfer costs and creating an unnecessary depend-
ency on the data center. These legacy logging solutions often have dependencies
on other software solutions as well as processes that create dependencies
between the cloud and data center. This means that a change in the data center
can cause an outage in the cloud because nobody knew of the dependency. These
issues are very hard to debug and fix quickly.
Here is another example. We did an assessment of a client’s tools, recom-
mended tools that would work well in the cloud, and advised them on which
ones should be replaced by a more cloud-suitable solution. One we recom-
mended replacing dealt with monitoring incoming network traffic. The group
that managed the tool refused to look into a new tool because they were comfort-
able with the existing tools and didn’t want to have to manage two tools. This cre-
ated a single point of failure for all of the applications and services running in
their public cloud. One day the tool failed and no traffic was allowed to flow to
the public cloud, thus taking down all cloud applications.
The lesson here is to evaluate replacements for tools that are not well suited
for the cloud. We often see resistance to process change lessen when new tools
are brought in; tools are often tightly tied to the processes they are used in. Try to
reduce the number of dependencies that the applications running in the public
cloud have on the data center and have a plan to mitigate any failures on a data
center dependency.
THE SHIFT TO THE CLOUD | 9
The new cloud operating model that we are advocating brings domain
experts closer together. We hope it will reduce these avoidable incidents as com-
panies rethink their approach to the cloud.
sus assets that are rented and the responsibilities that go along with each. When
you buy a house, you are investing in both property and physical structure(s) on
that property. You are responsible for maintaining the house, landscaping, clean-
ing, and everything else that comes with home ownership. When you rent, you
are paying for the time that you inhabit the rental property. It is the landlord’s
responsibility to maintain it. The biggest difference between renting and buying
is what you, as the occupant of the house, have control over. (And just as people
get more emotionally attached to their owned homes than to their rented apart-
ments, plenty of infrastructure engineers have true emotional attachments to
their servers and storage arrays.)
When you leverage the cloud, you are renting time in the cloud provider’s
“house.” What you control is very different than what you control in your own
data center. For people who have spent a lot of their career defining, designing,
and implementing processes and technologies for the controls they are responsi-
ble for in their data center, shifting some of those controls to a third party can be
extremely challenging.
The two groups who probably struggle the most to grasp the cloud shared-
responsibility model are auditors and GRC teams. These teams have processes
and controls for physically auditing data centers. When you pause to think about
it, physically evaluating a data center is a bit of a vestigial process. Sure, 50 years
ago nearly all IT processes (including application development) probably hap-
pened in the data center building, but today, many data centers run with skeleton
crews. IT processes are distributed in many locations, often globally. But the
auditors expect to be able to apply these exact processes and controls in the cloud.
The problem is, they can’t. Why? Because these data centers belong to the cloud
service providers (CSPs), who have a duty to make sure your data is safe from
their other clients’ data. Would you want your competitor walking on the raised
floor at Google where your software is running? Of course not. That’s just one
simple example.
At one meeting we attended, a representative of one of the CSPs was explain-
ing how they handle live migrations of servers that they can run at any time dur-
ing the day with no impact to the customers. The client was adamant about
getting all of the CSP’s logs to feed into their company’s central logging solution.
With the shared-responsibility model, the CSP is responsible for logging and
auditing the infrastructure layer, not the client. The client was so used to being
required to store this type of information for audits that they simply would not
budge. We finally had to explain that in the new shared responsibility model, that
THE SHIFT TO THE CLOUD | 11
data would no longer be available to them. We asked where they stored the logs
for failed sectors in their disk array and how they logged the CRC (error correc-
tion) events in their CPU. Of course, they didn’t.
We explained to the client that they would have to educate their audit team
and adjust their processes. To be clear, the policy that required the client to store
those logs is still valid. How you satisfy that policy in the cloud is completely dif-
ferent. If the auditors or GRC teams cannot change their mindset and come up
with new ways to satisfy their policy requirements, they might as well not go the
public cloud. But does an auditor of a GRC team really want to hold an entire
company back from leveraging cloud computing? Should the auditor be making
technology decisions at all? A key task in the cloud modernization journey is the
education of these 3rd party groups that have great influence in the enterprise. As
technology becomes more capable and automated, the things that we have to
monitor will change—because the risk profile has changed fundamentally.
In the data center world, teams are traditionally organized around skill
domains as they related to infrastructure. It is common to find teams responsible
for storage, for network, for servers, for operating systems, for security, and so
forth. In the cloud, much of this infrastructure is abstracted and available to the
developers as an API call. The need to create tickets to send off to another team
to perform a variety of tasks to stand up physical infrastructure like a SAN (stor-
age area network) simply does not exist in the public cloud. Developers have
access to storage as a service and can simply write code to provision the neces-
sary storage. This self-service ability is crucial to enabling one of the prizes of
cloud transformation: higher-velocity IT.
Networking teams in the data center leverage third-party vendors who pro-
vide appliances, routers, gateways, and many other important tools required to
build a secure, compliant, and resilient network. Many of these features are avail-
able as a service in the cloud. For areas where the cloud providers don’t provide
the necessary network security functionality, there are many third-party SaaS or
pay-as-you-go solutions available, either directly from the vendor or from the
CSP’s marketplace. Procuring these solutions in the cloud when they are con-
sumed as SaaS, PaaS, or IaaS is different than how similar tools in the data cen-
ter are procured. In the public cloud, there are usually no physical assets being
purchased. Gone are the days of buying software and paying 20-25% of the pur-
chase price for annual maintenance. In the cloud you pay for what you use, and
pricing is usually consumption based.
12 | MASTERING CLOUD OPERATIONS
Let’s say that Acme Retail, a fictitious big box retailer, has standardized on
Oracle for all of its OLTP (online transaction processing) database needs and Ter-
adata for its data warehouse and NoSQL needs. Now a new business requirement
comes in and Acme needs a document store database to address it. In the non-
cloud model, adopting graph databases would require new hardware, new soft-
ware licensing, database administrators (DBAs) trained in the new database
technology, new disk storage, and many other stack components. The process
required to get all of the relevant hardware and software approved, procured,
implemented, and secured would be significant. In addition, Acme would need
to hire or train DBAs to manage the database technology.
Now let’s look at how much simpler this can be in the public cloud. If
Acme’s CSP of choice offers a managed service for a document store database,
most of these steps are totally eliminated. They no longer need hardware, soft-
ware licensing, no additional DBAs to manage the database service, no new disk
storage devices—no procurement services at all. All Acme would need is to learn
how to use the API for the document store database to start building its solution.
Of course they would still need to work with the security team, the network
team, the governance team, and others to get approvals, but much of time it takes
to adopt the new technology is reduced. In Chapter 2, we will discuss shifting
responsibilities closer to development (shifting left). But for the sake of this
example, let’s assume that we still have silos for each area of technology expertise
(network, storage, server, security, etc.).
Let’s say I have 120 days to deliver the new feature to the business. The best
solution for the requirement would have been using a document store database
like MongoDb. However, we estimate that the effort required to get approvals,
procure all of the hardware and software, train or hire DBAs, and implement the
database would exceed our 120-day window. Therefore, we decided to leverage
Oracle, our relational database engine to solve the problem. This is suboptimal,
but at least we can meet our date. And remember, this is just the burden to get
started; we are not yet factoring in the burden that comes from day-to-day change
management as we are defining and building the solution.
This decision process repeats itself over and over from project to project,
which results in a ton of technical debt because we keep settling for suboptimal
14 | MASTERING CLOUD OPERATIONS
solutions due to our constraints. Now let’s see how different this can all be if we
are allowed to embrace a database-as-a-service solution in the public cloud.
After doing some testing in our sandbox environment in the cloud, we deter-
mine that the document store managed service on our favorite CSP’s platform is
a perfect solution for our requirements. We can essentially start building our sol-
ution right away because the database is already available in a pay-as-you-go
model complete with autoscaling. Leveraging stack components as a service can
reduce months of time from a project. It also allows you to embrace new technol-
ogies with a lot less risk. But most importantly, you no longer have to make tech-
nology compromises because of the legacy challenges of adopting new stack
components. Unfortunately, we do see some companies, especially early in their
cloud journeys, destroy this model by using the new technology with their old
processes (approvals, hand-offs, and SLAs). This change-resistant approach is
why we are so adamant about linking new process with new technology.
IMMUTABLE INFRASTRUCTURE
cesses, we can implement a solution like the following process that is designed
for web servers sitting behind a load balancer.
In this example, we have three web servers sitting behind a load balancer.
The approach is to deploy the new software to three brand new VMs and to take a
snapshot of the current VMs in case we need to back out changes. Then we
attach the new VMs to the load balancer and start routing traffic to them. We can
either take the older VMs out of rotation at this point or simply make sure no
traffic gets routed to them. Once the new release looks stable, we can shut down
and detach the old VMs from the load balancer to complete the deployment. If
there are any issues, we can simply reroute traffic to the older VMs, and then
shut down the new VMs. This allows us to stabilize the system without introduc-
ing the risk of backing out the software. Once the system is stable, we can fix the
issues without worrying about creating any new issues from a complex rollback
scheme.
This is just one of many ways to embrace the concepts of immutable infra-
structure for deployments. Companies with deep experience in this area can
design advanced processes to create highly resilient systems, all of which can be
fully automated.
MICROSERVICES MONOLITHS
We now build and deploy software faster than has ever been done before. Cloud
computing is one of the strongest contributors to that speed of deployment.
The cloud providers offer developers a robust service catalog that abstracts
the underlying infrastructure and a variety of platform services, allowing develop-
THE SHIFT TO THE CLOUD | 17
ers to focus more on business requirements and features. When cloud comput-
ing first became popular in the mid to late 2000s, most people used the
Infrastructure as a Service cloud business model. As developers became more
experienced, they started leveraging higher levels of abstraction. Platform as a
Service abstracts away both the infrastructure and the application stack (operat-
ing system, databases, middleware, etc.). Software as a Service vendors provide
full-fledged software solutions where the enterprises only need to make configu-
ration changes to meet their requirements and manage user access.
Each one of these three cloud service models (IaaS, PaaS, SaaS) can be huge
accelerators for the business. Businesses used to have to wait for their IT depart-
ments to procure and install all of the underlying infrastructure, application
stack, and build and maintain the entire solution. Depending on the size of the
application, this could take several months or even years.
In addition to the cloud service models, technologies like serverless comput-
ing, containers, and fully managed services (for example, databases as a service,
blockchain as a service, and streaming as a service) are providing capabilities for
developers to build systems much faster. We will discuss each of these concepts
in more detail in Chapter 2.
All of these new ideas challenge our traditional operating models and pro-
cesses. Adopting them in a silo without addressing the impacts on people, pro-
cesses, and technology across all of the stakeholders involved in the entire SDLC,
is a recipe for failure. We will discuss some of the patterns and anti-patterns for
operating models later in this book.
| 2
What Needs to
Change?
One of the key messages of this book is that success in the cloud cannot be
achieved by only focusing on cloud technology. To succeed at scale in the cloud,
enterprises must make changes not only to the technology, but to the organiza-
tional structures and the legacy processes that are used to deliver and operate
software. Embracing DevOps is a key ingredient to successfully transforming the
organization as it adopts cloud computing. But what is DevOps, really?
Defining DevOps
One of the biggest misperceptions about the term DevOps is that it is a set of
technologies and tools that developers and operators use to automate “all the
things.” DevOps is much more than tools and technologies, and it takes more
than just developers and operators to successfully embrace DevOps in any enter-
prise. Many people will shrug off this debate as nothing more than semantics,
but understanding DevOps is critical for an organization so that they can put
together a good strategy to drive their desired outcomes. If all that DevOps is to
an organization is automation of CI/CD pipelines, they will likely leave out many
important steps required to deliver in the cloud at scale.
There is no official definition of DevOps. The definition Mike came up in an
article back in 2014 is “DevOps is a culture shift or a movement that encourages
great communication and collaboration (aka teamwork) to foster building better-
quality software more quickly with more reliability.” I went on to add “DevOps is
the progression of the software development lifecycle (SDLC) from Waterfall to
Agile to Lean and focuses on removing waste from the SDLC.”
But don’t take our word for it; look at the work of the leading DevOps
authors, thought leaders, and evangelists. Gene Kim, co-author of popular
19
20 | MASTERING CLOUD OPERATIONS
DevOps books such as The Phoenix Project, DevOps Handbook, and the Unicorn
Project, in his keynote presentation at the DevOps Enterprise summit in 2017,
stated:
DevOps is not about what you do, but what your outcomes are. So
many things that we associate with DevOps fits underneath this very
broad umbrella of beliefs and practices—which of course communication
and culture are a part of.”
In their book The IT Manager’s Guide to DevOps authors Buntel and Shroud
define DevOps as “... a set of cultural philosophies, processes, practices, and tools
that radically removes waste from your software production process.”
The DevOps Handbook opens with “Imagine where the product owners,
Development, QA, IT Operations, and Infosec work together, not only to help
each other, but also to ensure that the overall organization succeeds.”
And finally, the popular and influential book Accelerate, by Forsgren, Hum-
ble, and Kim, describe the basis of their research:
“We wanted to investigate the new ways, methods, and paradigms that
organizations were using to develop software, with a focus on Agile and
Lean processes that extended downstream from development and priori-
tized a culture of trust and information flow, with small cross-functional
teams creating software. At the beginning of the project in 2014, this
development and delivery methodology was widely known as “DevOps,”
and so this was the term we used.”
As you read through these definitions, what are some of the key terms you
see?
• Culture
• Cross-functional teams
• Processes
• Remove waste
WHAT NEEDS TO CHANGE? | 21
• Reliable
• Quality
• Trust
• Sharing
• Collaboration
• Outcomes
• Flow
If DevOps embraces all of these terms, one has to wonder why so many
organizations create a new silo called DevOps and focus on writing automation
scripts without collaborating with the product owners and developers. We see
many companies take their existing operations teams, adopt a few new tools, and
call it DevOps. While these steps are not bad, and can even be viewed as pro-
gress, a non-holistic approach to DevOps will not deliver its promise. DevOps
silos often lead to even more waste in the SDLC because the focus is usually
exclusively on the tools and scripting, and not on trust, culture, collaboration,
removing waste, outcomes, etc.
Why DevOps?
What about DevSecOps? NoOps? AutoOps? AIOps? Aren’t those things
all better than DevOps? Should I be adopting those instead? Well the
answer to those questions all get back to having an outcome-focused
definition of DevOps that have tools and techniques as enablers of the
culture and process, not as the goal. To most mature DevOps practition-
ers, DevOps is all of those things. When we are focused on outcomes, you
use the techniques that best help you deliver those outcomes. These
new terms get created to focus on some aspect of the system that a pro-
ponent feels is under-represented (security, automation, AI, etc). It helps
professionals and consultants feel more leading edge, and their clients/
organization feel current. But make no mistake, there is little truly new in
these next-generation buzzwords from this old fashioned (circa 2009)
term DevOps. So at the end of the day, you should follow the principles
we discuss in this chapter. Then call it whatever you want.
22 | MASTERING CLOUD OPERATIONS
From this history, it is easy to see why many people think that DevOps is just
about developers and operators working together. However, the DevOps net has
since been cast much more broadly than during its founding days and touches all
areas of the business. Many companies start their DevOps journey looking only
at automating the infrastructure or the application build process, but high-
performing companies that have been on their DevOps journey for several years
are redesigning their organizations and process both inside and outside of IT.
These companies look very different 3 to 4 years into their journey as they
continue to embrace change and continuous learning. Table 2-1 shows a typical
order in which companies address bottlenecks to improve delivery and business
outcomes. Usually as they make strides removing one bottleneck (for example
inconsistent environments), they then progress to resolve their next biggest bot-
tleneck (for example security).
This is just a subset of the types of problems and the corresponding changes
that are often implemented to remove the bottlenecks. There are also large initia-
tives for reskilling the workforce and rethinking the future of work. The way we
incent workers must change to achieve the desired outcomes. Procurement pro-
cesses must change as we shift from licensing and maintenance to pay-as-you-go
models. Basically every part of the organization is impacted.
eraging new tools and technologies is indeed the heart and soul of DevOps. Sure,
there are teams within large organizations that are having great success focusing
mostly on the technology. But to scale DevOps across an organization, a new
operating model is required. A fundamental re-engineering of IT is now happen-
ing in many large companies. There is great irony that technology and IT were
often significant contributors to the waves of re-engineering we’ve seen in the
enterprise over the last several decades, while IT themselves were largely run-
ning the same processes and structure despite major changes in the underpin-
ning technology. Even fairly large advances in methods in IT, most notably Agile,
only changed processes in parts of silos, rather than holistically looking at the IT
function.
About the Authors
Mike Kavis has served in numerous technical roles such as CTO, chief architect,
and VP positions with more than 30 years of experience in software development
and architecture. A pioneer in cloud computing, Kavis led a team that built the
world’s first high-speed transaction network in Amazon’s public cloud and won
the 2010 AWS Global Startup Challenge. Kavis is the author of Architecting the
Cloud: Design Decisions for Cloud Computing Service Models (SaaS, PaaS, and
IaaS).