Patterns of DevOps
Patterns of DevOps
Patterns of DevOps
Patterns of
DevOps Culture eMag Issue 36 - November 2015
FOLLOW US CONTACT US
GENERAL FEEDBACK [email protected]
ADVERTISING [email protected]
EDITORIAL [email protected]
facebook.com @InfoQ google.com linkedin.com
/InfoQ /+InfoQ company/infoq
MANUEL isofInfoQ’s DevOps Lead Editor and an enthusiast
Continuous Delivery and Agile practices.
PAIS Manuel Pais tweets @manupaisable
A LETTER FROM
THE EDITOR
DevOps is a movement. DevOps is a mindset. applied these lessons to the IT world. Leaders need
DevOps is devs and ops working together. DevOps is to nurture their teams by complaining up and prais-
a way of organizing. DevOps is continuous learning. ing down. They need to let team members take the
The intangibility of DevOps makes it hard for lead when appropriate. Biven mentions how team
leaders to come up with a clear-cut roadmap for chemistry amplifies individuals’ talents as well as the
adopting DevOps in their organizations. It requires team’s talent. This question sums up the mindset ev-
grass-roots adoption and top-down support. The eryone should strive for: “What did you do today that
meaning of DevOps is highly contextual as there are you would be proud of?”
no popular methodologies with prescribed practices As infrastructure becomes code, review and
to abide by. testing provide the confidence necessary for refac-
However, healthy organizations exhibit similar toring and fixing systems. Code reviews are not just
patterns of behavior, structure, and improvement for software, Chris Burroughs explains. In fact, they
effort. In this e-mag, we explore some of those pat- are well suited for infrastructure automation given
terns through testimonies from their practitioners that some scenarios might be difficult to test — they
and through analysis by consultants in the field who can be too expensive or take too much time — but
have been exposed to multiple DevOps adoption ini- changes can be immediately reviewed. Artifacts that
tiatives. should be reviewed include configuration manage-
First, Daniel Schauenberg takes a look at Etsy’s ment, deployment scripts, provisioning manifests,
blameless postmortems in terms of philosophy, pro- packages, and runbooks. Reviews also help spread
cess, and practical measures and guidance to avoid consistent best practices throughout an organiza-
blame and better prepare for the next outage. Be- tion, Burroughs adds.
cause failures are inevitable in complex socio-tech- Finally, we finish with an insightful interview
nical systems, it’s the failure handling and resolution with Steve Smith and Matthew Skelton, authors of
that can improve by learning from postmortems. Build Quality In, a collection of experience reports
Matthew Skelton explains how there are many (including their own) on Continuous Delivery (CD)
different team topologies that can work for DevOps. and DevOps initiatives. Report authors range from
Each topology comes with a slightly different culture, in-house technical and process leaders to external
and a team topology suitable for one organization consultants. Many stories talk about organizational
may not be suited to another, even in a similar sec- improvements as a pre-condition for success. A key
tor. Skelton explores the cultural differences between takeaway is how contextual CD and DevOps are,
team topologies for DevOps to help you choose a which means that there are no silver bullets but also
suitable DevOps topology for your organization. that different organizations and people can make it
Michael Biven, team lead at Ticketmaster, shares work if they keep a focus on people as well as on pro-
lessons he learned as a firefighter on leadership, cess and tools.
mentoring, and team chemistry. Biven successfully
Read online on InfoQ
In the best case, you start out between components become possible — or you would have
with a Web front end and a da- more complex, understanding put guardrails and protection in
tabase representing your initial of how the whole system works place to prevent it. Everybody is
feature. And then you add things: shrinks, and you start to witness caught by surprise. You want the
features, billing, a background emergent behavior that you business to run, you want the site
queue, another handful of serv- weren’t aware of before — be- to be up. You are working hard to
ers, another database, image up- havior that can’t be explained by fix the problem while your amyg-
loads, etc. And of course you also single components but that are dala is hijacking your brain. May-
hire more people to work on all a result of the interplay among be people are running around
of this. them. Things start to break a lot the office or are asking questions
At this point, you realize you more often. in your chat system. How could
work in a complex socio-techni- Traditionally, it has never this have happened? Why didn’t
cal system. You did from the be- been great to have things break. we have tests for the code path
ginning, but it gets a little more You are suddenly confronted with that broke? Why didn’t you know
obvious here. As interactions a situation you thought wasn’t about that failure case? When
as possible. We have
cussion following the timeline as for all emergent behavior. Ideal-
they provide another set of facts ly, specialists for the system that
on which to base improvements was broken as well as specialists
to make sure we learn and remediations.
Once we have arrived at the
for your alerting system will both
be present. This is a great oppor-
Matthew Skelton has been building, deploying, and operating commercial software systems
since 1998. Co-founder and principal consultant at Skelton Thatcher Consulting, he specialises in
helping organisations to adopt and sustain good practices for building and operating software
systems: continuous delivery, DevOps, aspects of ITIL, and software operability. He is co-editor
of Build Quality In, a book of Continuous Delivery and DevOps experience reports.
Since then, I have worked on sev- our clients that the choice of the communication structures of
eral software systems for organi- tooling for DevOps should real- these organisations”. Given that
sations with differing team con- ly be informed by Conway’s law an IT organisation itself is a kind
figurations and team cultures, to produce the best outcomes of system, it follows from Con-
and the relationship between for organisations. Conway’s law way’s law that the topology of
team configuration — “topology” — brilliantly described by Ra- that system will be shaped by the
as I call it — and organisational chel Laycock in the book Build kinds of communication that we
capability has become some- Quality In — is the weirdly baf- allow or encourage to take place.
thing of a fascination for me. fling-but-believable observa- There is increasing evidence that
In my talk at QCon London tion by Mel Conway in 1968 Conway’s law is hard to bypass.
in March 2015 on continuous that “organisations which design Since 2013, a few others
delivery, I described how we’ve systems… are constrained to pro- and I have been collecting and
found through working with duce designs which are copies of documenting different team to-
Site reliability
engineering
The culture is again different for
organisations that use a SRE team
Fig. 4: Type 5 can function as a precursor to Type 2 or 3 topologies, in the Google model (sometimes
but beware the danger of a permanent DevOps team preventing called WebOps). The SRE team is
communication between dev and ops. willing to take on all production
responsibility (on call, incident
ity needs to be consumed “as a ately isolated from the applica- response, etc.) — as long as the
service” by application-develop- tion-dev teams and would share software produced by the dev
ment teams; to prevent Conway’s less with them than internally team meets stringent operation-
law from driving too much cou- amongst themselves, at least on al criteria.
pling between dev teams and the level of code and tooling. The dev team’s collabora-
the infrastructure team, we have (We’d still want shared lunchtime tion with the product develop-
recommended limiting commu- pizza sessions across the teams ment team is chiefly limited to
nication between the dev teams to advocate for role rotation and helping meet the operational
and the infrastructure team. learn about new approaches.) criteria and providing feedback
In this case, we anticipate The culture here would on run-time behaviour via met-
that the infrastructure team see sharing of some infrastruc- rics, incident reports, etc. The
members are somewhat deliber- ture-level metrics between the dev team is deliberately isolated
Michael Biven is a lead systems engineer at Ticketmaster and a former firefighter who has taken
his experience in leadership and emergency management from the fire service and put it to use
in technology since 2000. Before Ticketmaster, he helped build amazing experiences for anyone
and everyone at Ticketfly, Change.org, Beats Music, and Apple.
Before servers and services became my job I worked in the fire service.
During that time, I received one of the best pieces of advice ever given
to me even though it wasn’t offered as such. It was a challenge from
one of the captains in the firehouse: “What did you do today that you
would be proud of?”
The fire service succeeds be- we use to do our job, but there teams that at times appear to be
cause it has a culture that’s bal- will always be some new tech- at odds with each other.
anced between pulling your own nology to learn around the cor- Military strategist John
weight and being team oriented. ner. Boyd held that you consider
There is a natural pressure to not DevOps focuses our atten- “people, ideas, technology — in
let the rest of your people down, tion on the differences between that order.” Every few decades or
and you do that by holding your- the teams that make up an or- so, some new technology or pro-
self to the highest standard pos- ganization. The name itself high- cess is brought along as a way to
sible. lights the tension that has exist- reduce the number of firefighters
Compare that with our pro- ed between development and needed to protect a community.
fession, where we place the focus operations. It encourages empa- But in the end, it always comes
on technology and process. You thy by forcing us to address com- down to the same fact: to do the
may think that attention makes munication and understanding job, you need the people. And
sense as it’s focused on the tools the needs between the various when we need people, we have
It also used gossip to detect tracked down to a side effect of some internal review and test-
failed nodes and to ensure that a new feature). Getting spooky ing, pushed it to our production
client requests went to nodes errors on restart is particularly clusters. The patch worked (!),
that were up. When no nodes re- nerve-wracking because rou- fixing a vexing production prob-
sponsible for a particular range of tine operations like upgrades or lem before a long holiday. How-
data were available, the database changing settings to help debug ever, after posting the patch with
could not satisfy the request and the problem become high-stress the works-in-production seal of
instead returned an error to the affairs. approval, someone pointed out
client. Fortunately, we found a re- that it would prevent the cluster
This system was working port of a similar problem with from even starting in certain con-
well for us until nodes start- this open-source product and figurations. Simply reading the
ed going haywire, returning a a solution that someone had code and reasoning about the
stream of availability errors on sketched out. I cleaned up and system had uncovered a funda-
restart (the cause was eventually rebased the patch and, after mental flaw we had not discov-
by Manuel Pais
Steve Smith is an agile consultant and Continuous Delivery specialist at Always Agile
Consulting Ltd. Steve is a co-author of the continuous delivery and DevOps book Build Quality
In, a co-organizer of the monthly London Continuous Delivery Meetup group, a co-organizer
of the annual PIPELINE conference, and a regular conference speaker. Steve blogs at www.
alwaysagileconsulting.com/blog.
28
Advanced DevOps
Toolchain
33
Cloud
Migration
32
developers and tries to enrich them with links to reference for a Truly Competitive
material that has become meanwhile available online. Advantage