DevOps Culture and Practice With OpenShift
DevOps Culture and Practice With OpenShift
Tim Beattie
Mike Hepburn
Noel O'Connor
Donal Spring
BIRMINGHAM—MUMBAI
DevOps Culture and Practice with OpenShift
Copyright © 2021 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system,
or transmitted in any form or by any means, without the prior written permission of
the publisher, except in the case of brief quotations embedded in critical articles or
reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of
the information presented. However, the information contained in this book is sold
without warranty, either express or implied. Neither the author(s), nor Packt Publishing
or its dealers and distributors, will be held liable for any damages caused or alleged to
have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the
companies and products mentioned in this book by the appropriate use of capitals.
However, Packt Publishing cannot guarantee the accuracy of this information.
Authors: Tim Beattie, Mike Hepburn, Noel O'Connor, and Donal Spring
Illustrator: Ilaria Doria
Technical Reviewer: Ben Silverman
Managing Editors: Aditya Datar and Siddhant Jain
Acquisitions Editor: Ben Renow-Clarke
Production Editor: Deepak Chavan
Editorial Board: Vishal Bodwani, Ben Renow-Clarke, Edward Doxey, Alex Patterson,
Arijit Sarkar, Jake Smith, and Lucy Wan
www.packt.com
Praise for DevOps Culture and
Practice with OpenShift
"Creating successful, high-performing teams is no easy feat. DevOps Culture and
Practice with OpenShift provides a step-by-step, practical guide to unleash
the power of open processes and technology working together."
—Jim Whitehurst, President, IBM
"This book is packed with wisdom from Tim, Mike, Noel, and Donal and lovingly illustrated
by Ilaria. Every principle and practice in this book is backed by wonderful stories of the
people who were part of their learning journey. The authors are passionate about visualizing
everything and every chapter is filled with powerful visual examples. There is something for
every reader and you will find yourself coming back to the examples time and again."
—Jeremy Brown, Chief Technology Officer/Chief Product Officer at Traveldoo,
an Expedia Company
"This book describes well what it means to work with Red Hat Open Innovation Labs,
implementing industrial DevOps and achieving business agility by listening to the team. I have
experienced this first hand. Using the approach explained in this book, we have achieved a level
of collaboration and engagement in the team we had not experienced before, the results didn't
take long and success is inevitable. What I have seen to be the main success factor is the change
in mindset among team members and in management, which this approach helped us drive."
—Michael Denecke, Head of Test Technology at Volkswagen AG
"This book is crammed full to the brim with experience, fun, passion, and great practice. It
contains all the ingredients needed to create a high performance DevOps culture...it's awesome!"
—John Faulkner-Willcocks, Head of Coaching and Delivery Culture, JUST
"DevOps has the opportunity to transform the way software teams work and the products they
deliver. In order to deliver on this promise, your DevOps program must be rooted in people. This
book helps you explore the mindsets, principles, and practices that will drive real outcomes."
—Douglas Ferguson, Voltage Control Founder, Author of Magical Meetings
and Beyond the Prototype
"Innovation requires more than ideas and technology. It needs people being well led and the
'Open Leadership' concepts and instructions in DevOps Practice and Culture with OpenShift
should be required reading for anyone trying to innovate, in any environment, with any team."
—Patrick Heffernan, Practice Manager and Principal Analyst,
Technology Business Research Inc.
"Whoa! This has to be the best non-fiction DevOps book I've ever read. I cannot believe how
well the team has captured the essence of what the Open Innovation Labs residency is all
about. After reading, you will have a solid toolbox of different principles and concrete practices
for building the DevOps culture, team, and people-first processes to transform how you use
technology to act as a force multiplier inside your organization."
—Antti Jaakkonen, Lean Agile Coach, DNA Plc
"Fascinating! This book is a must-read for all tech entrepreneurs who want to build scalable
and sustainable companies. Success is now handed to you."
—Jeep Kline, Venture Capitalist, Entrepreneur
"DevOps Culture and Practice with OpenShift is a distillation of years of experience into
a wonderful resource that can be used as a recipe book for teams as they form and develop,
or as a reference guide for mature teams as they continue to evolve."
—David Worthington, Agile Transformation Coach, DBS Bank, Singapore
Table of Contents
Foreword i
Preface iii
Acknowledgements ix
Index 763
Foreword
>
Over the past two decades, as the popularity of Agile and Lean approaches hit the
mainstream, many new frameworks have emerged, each promising to solve your
problems if you just bought their secret sauce. Yet the pioneers in the early days
didn't get the answers handed to them; instead, they had to figure out the recipes
through trial and error. This relentless discovery and invention process led to great
leaps forward; it drove the most innovative companies in the world. So the question
is, why did we stop reinventing? When was it enough to follow guidelines rather than
constantly evolve and grow?
A common problem for organizations is to stay competitive in a constantly evolving
market. Competitors emerge fast and disrupt the playing field. To deal with this
challenge, organizations hire expensive creative agencies to run ideation workshops
in the hope that they can spark new ideas and future proof their organization. But it
doesn't stick. Bringing in someone else's creative talent that leaves when the workshops
are over doesn't breed an innovation culture.
ii | Foreword
Red Hat recognized that to help their clients innovate, a model was needed that
could be fully customized and adapted it to their needs. A model that would help
organizations build their own innovation culture. To help people learn to fish rather
than fishing for them. By blending Mobius, an open innovation model, with Red Hat's
open culture, organizations can create their own process that builds up their innovation
muscle. That creates their own process, develops their own people, and applies
technology in novel ways to achieve their desired outcomes faster.
With pragmatic stories from the trenches, the team at Red Hat Open Innovation Labs
has created an essential handbook. It takes you on a journey from Day One, from setting
up the workspace to practical tips on getting a team to gel and collaborate on their
real-world challenges. We get to see under the covers how the DevOps culture emerges
through stories and photos. Rather than keeping their secret sauce recipe, Red Hat is
following its own principles of being open and sharing its knowledge in a pragmatic,
easy-to-follow way.
This book brings together the key ingredients: the people, processes, and technology.
It's like having a great travel guide that gives you the tips you need when you need
them. I also love that the authors speak with candor and share their real-world war
stories, including the mistakes and pitfalls.
The last thing I will say is that the idea of fun is integral to the book, from the simple
how-to guides to the engaging illustrations and photos. This book is the culmination
of the learning collected along the way and I hope this book brings forth great ideas
that can help shape the future and create not only awesome products, but awesome
organizations.
Gabrielle Benefield
Founder, Mobius Loop
Preface
>
About
This section briefly introduces the authors, the coverage of this book, the skills you'll need to get
started, and the hardware and software needed to complete all of the technical topics.
iv | Preface
Noel O'Connor is a Senior Principal Architect in Red Hat's EMEA Solutions Practice
specializing in cloud native application and integration architectures. He has worked
with many of Red Hat's global enterprise customers in both Europe, Middle East & Asia.
He co-authored the book "DevOps with OpenShift" and he constantly tries to learn new
things to varying degrees of success. Noel prefers dogs over cats but got overruled by
the rest of the team.
Donal Spring is a Senior Architect for Red Hat Open Innovation Labs. He works in
the delivery teams with his sleeves rolled up tackling anything that's needed - from
coaching and mentoring the team members, setting the technical direction, to coding
and writing tests. He loves technology and getting his hands dirty exploring new tech,
frameworks, and patterns. He can often be found on weekends coding away on personal
projects and automating all the things. Cats or Dogs? He likes both :)
Learning Objectives
• Implement successful DevOps practices and in turn OpenShift within your
organization
• Deal with segregation of duties in a continuous delivery world
• Understand automation and its significance through an application-centric view
• Manage continuous deployment strategies, such as A/B, rolling, canary, and
blue-green
• Leverage OpenShift’s Jenkins capability to execute continuous integration
pipelines
• Manage and separate configuration from static runtime software
• Master communication and collaboration enabling delivery of superior software
products at scale through continuous discovery and continuous delivery
Audience
This book is for anyone with an interest in DevOps practices with OpenShift or other
Kubernetes platforms.
This DevOps book gives software architects, developers, and infra-ops engineers
a practical understanding of OpenShift, how to use it efficiently for the effective
deployment of application architectures, and how to collaborate with users and
stakeholders to deliver business-impacting outcomes.
Approach
This book blends to-the-point theoretical explanations with real-world examples to
enable you to develop your skills as a DevOps practitioner or advocate.
We recommend all readers, regardless of their technical skill, explore the concepts
explained in these chapters. Optionally, you may wish to try some of the technical
practices yourself. These chapters provide guidance in how to do that.
The OpenShift Sizing requirements for running these exercises are outlined in
Appendix A.
Conventions
Code words in the text, database names, folder names, filenames, and file extensions
are shown as follows:
We are going to cover the basics of component testing the PetBattle user interface
using Jest. The user interface is made of several components. The first one you see
when landing on the application is the home page. For the home page component, the
test class is called home.component.spec.ts:
describe('HomeComponent', () => {
let component: HomeComponent;
let fixture: ComponentFixture<HomeComponent>;
Downloading resources
All of the technology artifacts are available in this book's GitHub repository at https://
github.com/PacktPublishing/DevOps-Culture-and-Practice-with-OpenShift/
High resolution versions of all of the visuals including photographs, diagrams and
digital artifact templates used are available at https://github.com/PacktPublishing/
DevOps-Culture-and-Practice-with-OpenShift/tree/master/figures
We also have other code bundles from our rich catalog of books and videos available at
https://github.com/PacktPublishing/. Check them out!
We are aware that technology will chage over time and APIs will evolve. For the latest
changes of technical content, have a look at the book's GitHub repository above. If you
want to contact us directly for any issue you've encountered, please raise an issue in
this repository.
Acknowledgements
First and foremost, we'd like to thank those on the front line who dealt with and are
dealing with COVID-19 and its impact. Their incredible contributions to maintaining
and strengthening our communities cannot be overstated.
We'd also like to thank those in the Open Source community who collaborate and
contribute to make all the products we use better every day. This includes the many
contributors to the Open Practice Library1, with special thanks to the individuals who
have driven the Open Practice Library from its initial idea to where it is now including
Justin Holmes, Ryan de Beasi, Matt Takane, Riley Ghiles, Jerry Becker, and Donna
Benjamin. We thank the Mobius Outcome Delivery community who have evolved an
inspiring mental model and navigators through complexity with extra special thanks to
the founder of this community, Gabrielle Benefield, for all her support in providing the
framework that we've anchored this book around.
We'd like to thank our internal and external reviewers for their effort in keeping us
on the straight and narrow, correcting us, challenging us, and driving continuous
improvement into this book - Chris Baynham-Hughes, Charley Beattie, Donna
Benjamin, Jeremy Brown, Margaret Dineen, Cansu Kavili Oernek, David Eva, Oli Gibson,
Kari Mayhew, Brid Mackey, Ed Seymour, Mike Walker, and the whole team at Packt
Publishing.
Thank you to our many colleagues and customers of Red Hat, Red Hat Open Innovation
Labs, and of our previous companies for providing us with our experiences, stories and
tips that we have shared in this book.
1 https://github.com/openpracticelibrary/openpracticelibrary/graphs/contributors
x | Acknowledgements
On a personal level:
In this section, we are going to introduce the book, where it came from, and how it's
organized.
Chapter 1, Introduction – Start with Why focuses on the book's purpose and the target
audience. Chapter 2, Introducing DevOps and Some Tools explains, in our words, what
DevOps is and how it helps speed up the value chain of product development. We'll
explore what this chain is and the bottlenecks that DevOps culture and practices
address. We'll introduce a couple of important tools that we'll use throughout the book
to navigate around the use of many different types of practices we're going to apply.
In Chapter 3, The Journey Ahead, we will introduce how we use real-world stories and
the case study we'll use throughout the book that will outline how the remaining six
sections of the book are organized.
This will set us up to build a foundation and start a journey of continuous discovery,
options, and continuous delivery.
Introduction — Start
1
with Why
You've picked up this book and have started reading it — thank you very much!
Perhaps you read the back cover and it gave you just enough information to be
inquisitive enough to open the book up and read some more. Maybe a friend or
colleague told you about it and recommended it to you. Maybe you have stumbled upon
it for another reason. Whatever the reason, we're very happy you've taken some time
out of your day to start reading this and we hope you get some value from it and want
to keep reading it.
Before going into any kind of detail regarding what this book is about and what it's
going to cover, we want to start with why. This is a practice we use to create a common
vision of purpose. Why have we written this book? What problems is it trying to solve
and who is the intended audience?
4 | Introduction — Start with Why
So, what is a practice? A practice is an activity that helps teams achieve specific goals.
It's not just an idea; it's something that you do repeatedly in order to hone or polish a
skill. Practices have the following attributes:
• Empowering: The practices in this book will help teams discover and deliver
iteratively.
• Concise: They can be read in a few minutes.
• Agnostic: Practices don't require the team to follow a specific framework.
• Proven: Practices have been tested in the real world.
• Repeatable: Practices can be used more than once.
Hopefully, throughout this book, you'll see examples of us practicing what we preach
through the experiences, stories, and tips we will share from our real‑world delivery
experience, which includes stories such as these:
• The story about when we worked with an insurance company to rebuild one of
their applications using DevOps and OpenShift but had a stop the world moment
(a practice we'll talk about in the next section) when we realized we were
redeveloping an app that users did not want and were not using!
• The story of when we worked with a European automotive company and
kick‑started modern application development and agile practices with one of
their teams, only for the product owner to question how they were going to
prove to management that this was a better way of working when management
only work with spreadsheets and numbers.
• The story of the telecom company that suffered huge outages and
non-functional problems over a festive period and were keen to learn new
cultural and engineering practices to drive an auto-scaling and self-healing
approach to their infrastructure and applications.
Red Hat's culture is relatively unique as it is entirely based on open source culture and
open organizations (of which Red Hat is one of the largest examples). This means that
the Red Hat organization is run under a set of characteristics that are closely aligned
with open source culture and philosophy. They include collaboration, community,
inclusivity, adaptability, and transparency. We highly recommend learning more
about Red Hat's open organization philosophy by reading Jim Whitehurst's The Open
Organization1.
A lot of the experience that has informed this book and the stories and tips we will share
emanate from engagements led by Red Hat Open Innovation Labs (or Labs for short). Labs
provides an immersive and open approach to creating new ways of working that can help
our customers and their teams develop digital solutions and accelerate business value
using open technology and open culture. The main offering provided by Labs is called
the residency, which is a four- to twelve-week timeboxed engagement where client's
engineers are matched one-on-one with Red Hat's technology and culture specialists.
Between the four authors, we've been involved in over 50 Open Innovation Labs'
residencies around the world, in addition to many other professional services
engagements. Due to the relatively short nature of Labs residencies, we get to learn
very quickly different techniques, different approaches, and different practices. We
get to see what works well and what doesn't work so well. We get to build up a huge
collection of stories and tips. This book is all about sharing those stories and tips.
1 https://www.redhat.com/en/explore/the-open-organization-book
Where Did This Book Come From? | 7
One of the key success factors behind this was injecting lots of experience and real-
life stories into our delivery and using a lot of our practices on ourselves to deliver
the course. Every time we run the course, we use the definition of done2 practice to
explain to participants that every practice we are going to teach on the course will be
presented in a consistent way, following this process:
1. Introducing the practice with the theory and an overview of what it is, why you
should use it, and how to use it
2. A hands-on practical exercise so everyone participating can leave the course having
had a go at using the practice and having gained some learning and experience
from it
3. A real-world example of the practice being used in action on a real customer
delivery project or product development initiative
The core practices taught in this course vary from discovery practices, such as
impact mapping and event storming, to delivery practices, such as sprint planning
and retrospectives. They include a set of practices we've found to be very powerful in
establishing high-performing, long-lived product teams, such as social contracts, team
sentiment practices, and mob and pair programming. They include the engineering
practices that many coming to the course would have most strongly associated with
the term DevOps, such as continuous integration, continuous delivery, test-driven
development, and infrastructure as code.
One of the unique aspects of this course is its appeal to a broad audience. It is not
exclusively for technologists or designers. In fact, we embraced the idea of having
cross-functional groups of people spanning from engineers to project managers,
from infrastructure experts to user experience designers. We felt this course offered
the opportunity to break down silos. We intentionally do not run different tracks for
different types of people. The aim is for participants to have a shared understanding of
all of the practices that can be applied to truly appreciate and enable a DevOps culture.
Having run this course more than a hundred times globally, we've learned volumes from
it and have continuously improved it as we've gone along.
Faced with the opportunity to write a new book about DevOps with OpenShift and
to apply new learnings and more up-to-date technologies from Stefano Picozzi,
Mike Hepburn, and Noel O'Connor's existing book, DevOps with OpenShift – Cloud
Deployments Made Easy, we considered what the important ingredients are to
make DevOps with OpenShift a success for any organization choosing to adopt the
technology.
2 https://openpracticelibrary.com/practice/definition-of-done/
8 | Introduction — Start with Why
The success factors are all based on people, processes, and technology through the
application of the many practices we've used with our customers globally and, in
particular, the kinds of practices we were introducing and enabling using DevOps
culture and practice enablement.
This book's purpose is to enable you to understand and be ready to apply the many
different practices — some people-related, some process-related, some technology-
related — that will make DevOps culture and practice with OpenShift a success within
your organization.
• Caoimhe, a technical lead who looks after a team of people who develop
software. She wants to learn more about DevOps so she can help adopt great
DevOps practices.
• Fionn, a project manager who is responsible for a set of legacy software
applications and wants to modernize his team's approach to make use of this
DevOps thing he's heard lots of people talking about.
• Padraig, an Agile coach who is very experienced in applying Agile delivery
frameworks such as Scrum and wants to further his skills and experience with
DevOps. He feels that this will really add value to the teams he is coaching.
• Tadhg, a user experience designer who wants to better understand what other
people in the company's development team do with his designs and how he can
collaborate with them to deliver products.
• Séamus, who is an IT leader executing his company's technology strategy to
adopt containers and cloud-native technology across the company's entire IT
estate. He has chosen OpenShift Container Platform (OCP) as the strategic
product to support this. He wants to ensure that OCP generates a fast return
on investment and that there is a large uptake across all IT teams in his
organization.
• Aroha, the CIO of the organization. She wants to ensure that the company's
people are aligned with company strategy and getting the very best out of the
technology and the organizational decisions being made to drive the strategy.
She's motivated for the business to become more agile and adapt quickly if
and when market conditions change. She wants to read about what similarly
sized organizations in different industries (including in her own industry) have
successfully done and what they saw as being the critical success factors.
• Siobhán, an infrastructure engineer who has been using Kubernetes for many
years and is now part of a team introducing OCP to her organization. She wants
to ensure that the platform is configured to support her team's goals and wants
to know how she can best work with development teams so that they get the
maximum value out of the technology.
10 | Introduction — Start with Why
• Eimar, a project manager who has spent two decades delivering IT projects
through up-front planning, tracking deliverables against plans, and managing
risks, issues, and dependencies with strong project reporting and stakeholder
management skills. She gets frustrated by the amount of time it takes to ship
software and not being able to address user needs and fixes quickly. She sees
the benefit of moving to a more product-centric approach rather than a project-
centric one. She would like to re-skill herself to be a product manager. In doing
this, she wants to be able to test and adapt quickly, ship deliverables quicker,
adapt to changing market conditions, and also improve performance, uptime,
recovery times, and more.
• Finn, a system tester who takes great pride in quality assuring software before
it is shipped to customers. His business analysis background helps him develop
comprehensive testing approaches and scripts and, over the years, he's also led
performance testing, security testing, and operability testing. He's keen to learn
how he can introduce more automation to his work and branch out to other
forms of testing.
From I to T to M
With this book, we want people to move away from being I-shaped, where they are
a specialist in one skill or one field. We want them to become more T-shaped, where
they still have a depth of skill and experience in a particular field (such as infrastructure
or UX design), but they also have an appreciation and breadth of knowledge across all
the other skills that people bring to make up a cross-functional team. This could be a
frontend engineer, for example, who also works side by side with the API engineer.
A great cross-functional team is one where the full team holds all the skills and
experience they need. They are empowered to take a new requirement from a user
or business stakeholder through to production. A team could be made up of lots
of I-shaped people, but this type of team quickly becomes dependent on specific
individuals who can be a blocker when they are not available. For example, if a database
change is needed to expose a new API but only one team member has the knowledge
to be able to do this, the team can quickly become stuck. If the team is full of more
T-shaped members, there is a greater opportunity for collaboration, sharing, and
partnerships across the team and less reliance on individuals:
Conclusion | 11
We want this book to help I-shaped people become more T-shaped and perhaps even
become M-shaped. M-shaped people are inspired to deepen their learning, take it into
other fields, and hold multiple skills, thereby building stronger cross-functional teams.
Conclusion
This chapter presented a brief overview of why we wrote this book and who it is
intended for.
We introduced ourselves and how we will be using our applied knowledge, experience,
and learnings to write this book full of stories and examples.
We examined the different personas we are targeting in this book and how we intend
to help move these focused I-shaped people into more T-shaped or M-shaped to build
stronger cross functional teams.
In the next chapter, we will introduce DevOps and some tools we will use during the
book to organize and explain DevOps practices.
Introducing DevOps
2
and Some Tools
Throughout the 2010s, Agile was that buzzword—This is going to be an Agile project, or
We're going to use Agile to deliver this, or We're going to use the Agile methodology were
common phrases that many of us have heard. It was (and still is) often used incorrectly
about delivering software faster. In fact, Agile is focused more around delivering
business value earlier and more frequently and driving a culture of continuous learning.
Agile has now officially grown up—it had its 18th birthday in February 2019. Even after all
this time, we still love to use the values and principles of the Agile Manifesto1 created
back in 2001.
Containers is another buzzword these days. We see it being used by individuals without
them necessarily understanding the full meaning of what a container is and why people,
teams, and organizations would benefit by utilizing them.
So, with this book being about DevOps and OpenShift (a container management
platform), we're going to de-buzzify these terms and talk about very practical,
real world experience and examples of the real value behind DevOps and OpenShift
containers.
Let's take a look back in time and see where we believe these phenomena came from.
We all have worked in IT for a number of decades (some more decades than others!).
While chatting over a beer and looking back at our experiences of delivering IT projects,
we recognized some common characteristics in all our IT projects that have been
constant. We also identified a set of gaps in the value chain of delivering IT projects
that, for us, seemed to slow things down.
1 www.agilemanifesto.org
The Value Chain | 15
Between us, we have helped many organizations, from the public sector and finance
to retail and charities. We've seen it all! As we reminisced, we discussed the end result
of some of our projects; we thought about our why — there was almost always some
kind of monetary value aspect associated with the reason for us being there. There
were other motivations, too, such as increased customer satisfaction, reduced risk, and
improved security and performance, but the bottom line is that an essential part of any
of our commercial customers' business is to make money and reduce costs.
So, in the end, value was often linked to money in some shape or form. Three of us
authors are Irish and the fourth is from New Zealand, so we felt it was appropriate to
reflect this as a pot of gold!
The 1990 book The Machine That Changed the World, written by James Womack, Daniel
Jones, and Daniel Roos, first introduced the term value stream. The idea was further
popularized by the book Lean Thinking, written by the same authors. According to
them, the value stream is the sequence of activities an organization undertakes to
deliver on a customer request. More broadly, a value stream is the sequence of activities
required to design, produce, and deliver a good or service to a customer, and it includes
the dual flows of information and material. Most value streams are highly cross-
functional: the transformation of a customer request to a good or service flows through
many functional departments or work teams within the organization:
Let's visualize this as our customers dreaming of that pot of gold. They're constantly
thinking about how they can get the most out of their products or ideas to generate the
most gold. So, how do they go about doing this?
The Gaps
We're going to explore the gaps in the value chain between customers and
organization's business people, between business people and development people, and
between development people and operations people.
Our early projects often involved long phases of business analysts documenting
every possible requirement they could conceivably think of into epic volumes of
business requirements — documents. The goal was to pre-empt every conceivable
customer journey or scenario and to cover all the bases by building specifications that
included every possible eventuality. Sounds rigid, right? What if we made an incorrect
assumption?
This gap of not building the right thing has been plugged in recent years by the emergence
of human-centered design and design thinking. These are a set of practices that put the
end user at the center of capturing the needs and requirements of a product.
We gather the information by talking directly to users and forming greater empathy2
with them:
In this book, we'll explore how techniques such as impact mapping, event storming, and
human-centered design can aid the software development process. We'll also explore
other practices to help us define solutions and features and crucially ensure that the
solution is connected to business value. We'll show how the act of coupling research
activities such as user interface prototypes and technical spikes with experimentation
inform product backlogs that are well prioritized according to delivered business value.
We will show you how using just enough information can lead to a better-understood
product.
2 https://www.merriam-webster.com/dictionary/empath
The Gaps | 19
A team of software designers and architects then gets to work, producing a high-level
design (HLD) that will deliver a solution or set of solutions according to the business
requirements specified. These requirements also go through a formal review process
by key project stakeholders and, once signed off, become the reference source for the
solution scope.
Often, different design documents are written in the next phase - detail design
documents, program specifications, data designs, logical architecture blueprints,
physical architecture solutions, and many more. Each of these is written to support a
defined, dated, and signed-off version of the HLD, which itself, is signed off against a
defined set of business requirement specifications:
Any changes to the earlier documents have direct time and cost implications
for reassessing and updating each of the following design documents. Software
development teams may have been involved in the production or review of some of
these documents. However, they are often encouraged not to start any coding or
development activities until these designs have been locked down. Some organizations
reduce project costs by not onboarding development teams until this stage.
Development is often siloed by function and unaware of the big picture with limited
automated testing.
20 | Introducing DevOps and Some Tools
At a predefined point in the project plan, all developers are expected to have delivered
their coded components to a testing environment. Perhaps each developer manually
builds and deploys their own code to the testing environment. Some larger programs
seek economies of scale by setting up build infrastructure teams who do this on behalf
of all developers. Once all components had been delivered, a separate team of testers
starts executing the hundreds of test scripts they had been writing in the preceding
weeks and months to test the solution according to business requirements and HLD
documentation. This is the first time some components are integrated and tested
together. Of course, problems and bugs drive reworking by development teams and
designers to fix such issues.
Just as there are different levels of design documentation, testing often undergoes
different levels of testing, with one starting when the previous phase is completed.
A test manager would sign off on a set of test results, signaling that the next level of
testing could start. Testing would range from a set of component integration testing to
wider system integration testing, security and penetration testing, performance testing,
failover and operability testing, and finally, user acceptance testing!
The final stage before the big-bang go-live of a solution would often be user acceptance
testing, involving a set of focus users and the test system. In many cases, it could
often be months or years before this first user saw the implemented system. Once
user acceptance of the solution was signed off, the green light was given to deploy to
the production environment. Finally, with the software in the hands of real end users,
business revenue could hopefully be generated from all this work.
You're probably thinking that this process sounds long and drawn out — well in truth, it
was! Many programs hit delays at different points along the way and what started out as
a multi-month project plan ended up being years long. For the curious, there is even a
list of some epic failures on Wikipedia: https://en.wikipedia.org/wiki/List_of_failed_
and_overbudget_custom_software_projects.
Often, business conditions would change during the development period. New feature
requests would be generated. During testing, gaps in the requirements would emerge
that no one considered during the analysis and requirements capture. The market
didn't stand still during development and competitor companies may have started to
innovate quicker. The competition would even provide more feature requests, in a
process akin to a feature comparison war.
Of course, there was always some kind of change control procedure to handle new
scope like this. In a complex program of work, the lead time to get features added to the
work plan could range from months to years. In order to get something into production,
program executives would simply say no to any more change and just focus on getting
to the end of the project plan.
The Gaps | 21
Figure 2.8: Traditional deliverables with its drawbacks failed to achieve customer satisfaction
Let's think about how we plug that second gap with more modern software
development processes. How do modern developers manage to translate user needs
into working software solutions much more quickly compared to previous ways of
working?
The formation of the Agile movement in 2001, led by the 17 IT individuals who wrote the
Agile Manifesto, has triggered alternative approaches and mindsets toward delivering
software. Many of the individuals involved in writing the Agile Manifesto had been
tackling many of the problems described by Waterfall development. Jeff Sutherland
and Ken Shwaber had created the Scrum framework for software development, which
included delivering small incremental releases of value much more frequently — they
used the term sprint, which was a fixed timebox ranging from one to four weeks
(usually being two weeks), during which a set of events and roles would work together
such that big solutions could be delivered iteratively and incrementally. Kent Beck
and Ron Jefferies led much of the eXtreme Programming (XP) movement, focusing on
delivering faster releases of value and working on key practices that helped drive more
efficiency into review, testing, and release processes, using better collaboration and
increased automation:
22 | Introducing DevOps and Some Tools
Figure 2.9: Implementation of DevOps practices leading to faster delivery and better products
In this book, we'll show you different software delivery practices and how our
experience using a mixture of different practices from Scrum, Kanban, XP, Lean, and
some scaling frameworks helps deliver value quicker. All the underlying practices are
simply tools to help close the gap between an idea or requirement being captured and
it being delivered. This has been an area we have sought to continuously improve to a
level where the gaps are minimized and we're working in a mode of continuous delivery.
Development to Operations
There is one more gap to plug in our efforts to optimize the software delivery process.
The third gap is the one between development teams and operations teams.
In our Waterfall process, we had reached the point where the signed-off solution exited
user acceptance testing and went through a big-bang go-live. So, what happened next?
Often, a whole new team responsible for maintenance and support would then pick
up the solution. The people who work in this new team were not involved in any of the
design, development, or testing, so additional time would be built into the project plan
for knowledge transfer. The delivery team would write lengthy documentation in the
hope that this would be a useful resource for future operations teams.
The Gaps | 23
At this point, the package of software would metaphorically be thrown over the wall
from the army of developers to the group of operation engineers. The operations
teams often had to learn about the software the hard way by investigating production
incidents, addressing bugs that were not found previously, and handling new scenarios
not considered during the requirement planning stage:
Figure 2.10: Aspiring to bring down the wall between development and operations teams
To plug this gap, we must bring development and operations teams together. Tear
down that wall and remove the silos! Bringing down the wall forms new teams that
are focused on development and operations activities. These teams are collectively
responsible for the whole solution and can design the solution according to each
others' needs.
The term DevOps was coined by the idea that we no longer have siloed development
and operations teams. In recent years, we've seen various other terms emerge from this
idea, such as DevSecOps, BizDevOps, DesOps and even BizDesDevSecOps!
Note
BizDesDevSecOps is a bit of a mouthful, so we're going to use the term product
team to describe it throughout this book. It addresses the ultimate goal of
plugging all gaps in the software development process and bringing down all the
walls.
24 | Introducing DevOps and Some Tools
Note that we will not use the DevOps team term DevOps team — the idea of having a
team or even an individual purely focused on DevOps runs counter to what the DevOps
philosophy is all about — collaboration, cross-functionality, and the removal of silos.
How many times have you seen ads on LinkedIn or other sites looking for DevOps
engineers? The invention of the DevOps engineer or the DevOps team could be seen as
creating just another silo.
This book is about finding the right combination of people, process, and technology
changes needed to maximize business outcomes on a continuous basis. This requires
changes in mindset and changes in behavior. This book will look at the behavioral
change that we have seen be most effective with the hundreds of organizations we have
collectively worked with. We've observed that such mindset and behavioral change
is needed across all roles and that we need to break down the silos we see inside
organizations, which, as we saw previously, is what drives the gaps and inefficiencies in
software development:
3 https://github.com/rht-labs/enablement-docs
Mobius is a framework that connects discovery and delivery and can be used to connect
strategy to products to operations. The common denominator is measurable outcomes.
Mobius is used to understand, align, and share measurable target outcomes so they can
be tested and validated.
There are a number of principles that underpin the Mobius navigator:
• Outcomes over outputs: We focus on delivering tangible impacts or outcomes to
people as opposed to delivering lots of features that may not drive outcomes.
• Multi-options strategy (options pivot): We look to build a list of options, a list of
research initiatives, experiments, and implementation features that can be used
to test hypotheses about whether those research initiatives, experiments, and
implementation features will indeed drive the anticipated outcomes.
• Rapid delivery: We aim to use short iterations of delivery with regular feedback
and measurement as we strive toward the idea of continuous delivery.
• Continuous learning and improvement: happens throughout the cycle so that
our next set of options yield an even better impact on outcomes.
There are seven core elements to the Mobius approach across a continuous and
never‑ending flow. They can be visualized on a single canvas that is open source and
made available under a creative commons license at www.mobiusloop.com:
• Why describes the purpose. Why are we doing this? What is the problem we are
trying to solve? What is the idea we are trying to pursue?
• Who focuses on the end users. Who are we trying to solve the problem for?
• Outcomes are where we want to get to with these people, the changes in
their human behavior that influences big results, and how we will measure the
customer and business impacts delivered.
• Options are the potential solutions that could deliver these outcomes. They help
define the hypotheses we can go on to test and help us find the simplest way to
achieve the desired outcome with the least amount of effort or output.
• Deliver is the cycle where we run experiments to deliver a solution or set of
solutions to users so we can measure the impact.
• Measure is where we assess what happened as a result of delivering the solution
or set of solutions. We check whether the impact of the solution delivered the
desired outcomes and assess how much of an impact we achieved.
• Learn is the feedback loop that takes us back to the options pivot. We learn from
what we delivered and assess what to do next. Have we delivered enough to
make an assessment? Do we go right back around the delivery loop again? Have
we reached our target outcomes or invalidated assumptions from our learnings?
Do we return to the discovery loop?
28 | Introducing DevOps and Some Tools
Personas such as Tadhg, our user experience designer, would typically spend a lot
of time in the discovery loop. Personas such as Caoimhe, our technical lead, would
traditionally be focused on the delivery loop. Personas such as Fionn, our project
manager, would typically spend a lot of time here establishing outcomes and gathering
options. But, as we seek to move to cross-functional teams of T- or M-shaped people,
we really benefit from everyone being involved at every stage of the Mobius loop. And
Mobius creates a common language based on targeted measurable outcomes.
You can apply the same principles of outcome-driven thinking for strategy, product,
and services delivery to enabling business and technical operations — we'll return to
this idea later in the book.
Mobius is powerful because it's framework agnostic. It integrates with many existing
frameworks and methods you may already be familiar with — Scrum, Kanban, design
thinking, Lean UX, Business Model Generation, Lean startup, and many other great
frameworks that have surfaced during the last couple of decades. You don't have to
reinvent the wheel or replace everything you already like and that works for you.
You can capture key information on a discovery map, an options map, and a delivery
map — all of these are open source artifacts available under Creative Commons at
www.mobiusloop.com:
Figure 2.14: Using the Discovery, Options, and Delivery canvases of the Mobius loop
The Mobius Loop and the Open Practice Library | 29
When Red Hat Open Innovation Labs started using Mobius, we placed all of our
practices around the Mobius loop. Some practices clearly aligned with the discovery
loop and, in particular, the Why & Who end of the discovery loop. Practices such as
impact mapping, start-at-the-end, and empathy mapping are great at uncovering the
answers posed in this section of the loop. We'll get into the detail of these practices in
subsequent chapters of this book.
Practices such as event storming and user story mapping were very helpful in
establishing and visualizing outcomes on the other side of the discovery loop. Again,
we'll look at these practices in detail and share some great examples of their effect.
Practices such as design sprints, how-might-we, and product backlog refinement would
help determine and organize the series of options available attempting to drive toward
outcomes.
Practices such as sprint planning would help plan and execute the incremental delivery
of products toward outcomes. We'll explore these iterative delivery practices and how
different Agile frameworks can be used with Mobius.
Practices such as showcases and retrospectives would help with capturing measure-
and-learn data from incremental delivery.
We still had a large number of practices that we did not feel naturally fitted into one of
the loops or the options pivot. When we laid out all of the remaining practices that we
had all used with numerous customers very effectively, we found they fitted into one of
two areas. One set of practices were all focused on creating culture and collaboration.
The other practices were all technical engineering practices that supported the
concept of continuous delivery.
When explaining these practices to others, we talked about these being very important
practices to put in place, but not necessarily practices that you would schedule. For
example, you will learn that practices such as impact mapping on the discovery loop are
important scheduled workshops that you execute and occasionally revisit in the future.
Practices such as sprint planning, showcases, and retrospectives on the delivery loop
are also tightly scheduled when working in an iterative delivery framework. But the
practices associated with culture and collaboration or those associated with technical
engineering were more like practices that you use all the time, continuously.
30 | Introducing DevOps and Some Tools
Practices such as social contracts and definition of done are not one-time-only
practices where you bring the artifact out on a schedule. These are living and breathing
artifacts that teams use all the time in their day-to-day work. Likewise, continuous
integration, test automation, and infrastructure as code — these are not the types of
practices you schedule one or two times a week. These are practices that you do all the
time. They are practices in the foundation of where and how we're working. In order
to effectively practice continuous delivery and continuous discovery as presented by
the Mobius loop, we need to have a strong foundation of culture, collaboration, and
technical engineering practices.
To visualize this, we added the foundation to the Mobius loop:
This graphic has become a simple visualization tool that helps us navigate the
ever‑growing list of practices and techniques we use to achieve continuous discovery
and continuous delivery of digital products:
The Mobius Loop and the Open Practice Library | 31
Figure 2.16: Practicing continuous discovery and delivery through the Mobius loop
Open Innovation Labs Residencies involves traveling around the Mobius loop a few
times, usually starting from discovery before proceeding to delivery and then pivoting
a few times to either more delivery or returning to discovery continuously. We
find that, in order for this to be sustainable, you must build a foundation of culture
and collaboration and you must build a strong foundation of technical engineering
practices.
Open Innovation Labs kick-started an open source, community-driven project called
the Open Practice Library. The Open Practice Library is a community-driven repository
of practices and tools. These are shared by people currently using them day-to-day for
people looking to be inspired with new ideas and experience.
All of the practices you read about in this book have been contributed to the Open
Practice Library and, throughout the book, we will use the Mobius loop and the
foundation of culture, collaboration, and technical practices as a reference point to
determine where and how all our open practices fit together to deliver great DevOps
culture and practice with OpenShift.
32 | Introducing DevOps and Some Tools
An important characteristic of Mobius and the Open Practice Library is that it is not
prescriptive. It is not a methodology. It does not tell you exactly which practice to
use when and where. Think of the Open Practice Library as a box of tools — a really
well‑organized toolbox with lots of compartments and shelves. The practices have
been organized into compartments that help with discovery and, in particular, the why
and who, followed by deriving outcomes. There is a drawer containing all the tools
that help form, organize, and prioritize options and how to pivot later in the cycle.
There is a portion of the toolbox with all of the tools that help with delivery — whether
that be iterative and incremental delivery associated with agile practices or single
delivery associated with Waterfall. There are tools to help capture and understand the
measurements and learning from delivery. Finally, there is a huge drawer of tools used
to establish culture, collaboration, and technical engineering excellence. These are
often the first tools we go to grab when starting a piece of work.
Conclusion
In this chapter, we introduced the value chain in software product delivery and
explored how traditional ways of working brought inefficiencies, bottlenecks and gaps
between users, business stakeholders, development teams, and operational teams.
We explored some of the techniques that have been used to plug these gaps and how a
balanced focus on people, process, and technology is needed by all involved.
Finally, we introduced the open-source navigator tool called Mobius that connects
discovery and delivery in an infinite loop and can connect strategy to product to
operations with a common denominator of measurable outcomes. The Open Practice
Library uses mobius on a foundation of culture and technology to navigate between an
evolving number of open practices — many of which will be explained in subsequent
chapters.
In the next chapter, we're going to outline how we'll approach the rest of the book by
introducing our case study and the structure for the remaining sections.
3
The Journey Ahead
As we conclude the first section of this book, this chapter will explain the journey we
intend to take you through the remaining sections.
This will include how we intend to not just tell you about practices and techniques but
also show them in action and apply them. We'll introduce a fun case study and real
world stories to do this.
One of the challenges of writing a book intended to be read by a diverse group of
people with different skill sets and backgrounds is how to write it in such a way that
means it can be consumed, understood, and appreciated by all. From tech leads,
infrastructure engineers, and OpenShift specialists, to Agile coaches, user experience
designers, and project managers, to IT leaders and CXOs, we want you to grasp a
shared understanding of what's behind all the practices being taught and the principles
that underpin then.
34 | The Journey Ahead
The topics covered are going to range from how to capture behaviors in an empathy
map using human-centered design practices to considering observability within
applications using performance metrics. It will look at ways to help product owners
prioritize value versus risk while also addressing instrumentation for applications,
image tagging, and metadata!
Similar to the definition of done practice we use on our DevOps culture and practice
enablement course, we're going to use a few different approaches in this book to help
you with your journey:
1. Explaining the culture and practice
2. Showing the culture and practice
3. Applying the culture and practice
To explain the culture and practice, we will introduce what the practice is and why and
where we've chosen to use it, and give some guidance on how to use it. In some ways,
this is the easy part.
We have a saying among us that we prefer to show, not tell. It's easy to research and
write a load of words. It's far more compelling to visually show a practice in action and
the effect it is having. To show the culture and practice, we have a few techniques:
1. As much as possible, we'll aim to make use of visualization techniques such as
sketch notes, diagrams, and other charts. You will have seen a few of these,
beautifully drawn by Ilaria Doria, in this section already, and hopefully, they have
helped bring the words to life a little.
2. Where we can show you a practice in action through photographs or reproduced
artifacts, we will do so. Where possible, we have made the diagrams and other
visual artifacts open source, and they are available at https://github.com/
PacktPublishing/DevOps-Culture-and-Practice-with-OpenShift/.
3. We find stories and real-world examples the best way to explain a practice and the
value it brings. So, from time to time, we will break away and tell a story that one or
more of the authors have experienced connected with these practices. We'll visually
tell these stories on their own by having a box around the story. Let's start with
one now:
A Story about Telling a Practice | 35
Finally, we want to apply some of the culture and practices for real. To help us do that,
we are going to use a simple, fun case study about a small start up organization going
through some of the challenges and hurdles associated with creating a DevOps culture
and establishing DevOps practices. This story will represent an anonymized account of
some of the things we've seen in the field with our customers using these practices.
We'll regularly return to this story of applying DevOps culture and practices using
OpenShift in shaded boxes. Let's get this rolling with the backstory — we hope you're
ready for this!
PetBattle is a hobbyist app, started for fun, hacked around with so that
the authors can Cat versus Cat battle each other in a simple online forum.
A My cat is better than your cat type of thing. There are very few bells
and whistles to the initial architecture — there is a simple web-based user
interface and an API layer coupled with a NoSQL database.
1 https://en.wikipedia.org/wiki/Cats_and_the_Internet
What about Legacy Systems? | 37
PetBattle begins life deployed on a single virtual machine. It's online but not
attracting a lot of visitors. It's mainly frequented by the authors' friends and
family.
While on holiday in an exotic paradise, one of the authors happensed to
meet an online influencer. They date, they have a holiday romance, and
PetBattle suddenly becomes Insta-famous! Nearly overnight, there was
a drastically increased number of players, the PetBattle server crashes,
and malicious pictures of not cats start appearing on the child-friendly
application.
Back from holiday, the authors suddenly find themselves needing to
earn a living from PetBattle and decide that developing a business and a
production-ready version of the hobbyist app is now a viable thing to do.
The scene is set for the PetBattle founders to go on an exciting journey embracing
DevOps culture and practice with OpenShift.
Borrowing Brilliance
Before we start to dive deeper into the detail, we should take a moment to point out
that we did not write or dream up any of the practices in this book. The practices in
this book and in the Open Practice Library are a growing list of contributions of some
brilliant minds. We have borrowed that brilliance and will attribute it to the brilliant
minds that came up with it. We hope we have attributed everyone correctly and any
omissions are purely accidental.
What we have attempted to do with this book is show how these practices, when
connected together, have delivered some highly impactful outcomes for organizations
and show some of the underlying principles needed to enable those outcomes.
From a technical foundation perspective, we're going to share some of our most
successful approaches, including the visualization of technology through a big picture,
the green from go philosophy, and how we treat everything as code. We'll introduce
some of the baseline metrics we've used to measure the success and impact of DevOps
culture and practices. We'll even set the scene for some of the technical practice trade
offs and approaches to consider when creating your foundation — GitFlow — versus
Trunk-based development, setting up development workflows, considering different
types of testing, and setting up an environment for pairing and mobbing.
To show and not tell, establishing the foundation is about turning the picture on the left
into the picture on the right:
Section 3 — Discover It
Here, we'll dive into the discovery loop of Mobius and look at some of the best ways
to use it. We'll share some of our favorite and most impactful practices from the Open
Practice Library that have helped us in the discovery loop, including impact mapping,
human-centered design, and event storming.
We'll look at how this relates to technology and the idea of emerging architecture and
enabling true continuous delivery.
From a business perspective, we'll explore the difference between outcomes and
outputs and how we're trying to move from the idea of more features being better to
creating powerful outcomes with fewer features. We'll explore some practices for how
we can continuously measure outcomes and how we can radiate information from the
entire discovery loop on open source canvases.
To show and not tell, we'll look at moving discovery from looking like what you see on
the left to what you see on the right:
40 | The Journey Ahead
Figure 3.3: Practicing discovery through impact mapping, human-centric design, and event storming
Section 4 — Prioritize It
Here, we'll dive into the options pivot of Mobius and see why living, breathing, and
always-changing options are important. We'll explore practices such as user story
mapping and value slicing that help us with this and share some of the gotcha stories
we have of where this has been misunderstood and misused. We'll look at how we go
about building that initial product backlog using discovery that leads to options pivot
practices. We'll look at different types of items that end up in product backlogs, which
range from research work to experimentation work and implementation work. We'll
look at some economic prioritization models and how to assess the trade-offs between
value and risk with the mindset of continuous experimentation and continuous
learning. We have lots of stories to share — some with a specific focus area and some
with a thin thread of learning across many areas.
To show and not tell, we'll see how prioritization can go from looking like what's on the
left to what's on the right:
Section 5 — Deliver It
In this section, we'll look at Agile delivery and where and when it is applicable according
to levels of complexity and simplicity. We'll also look at Waterfall and the relative
merits and where it might be appropriate. We'll explore different agile frameworks out
there and how all of them relate to the Open Practice Library and Mobius loop. We'll
explore the importance of visualization and of capturing measurements and learning.
Technology-wise, we'll look at how advanced deployment techniques now available help
underpin some of the experimentation and learning approaches being driven.
To show and not tell, we'll see about getting delivery from looking like the picture on
the left to something like the picture on the right:
To show and not tell, we'll explore taking building and running from being what you see
on the left to what you see on the right:
Like most, we had to pivot our own work and modify our own ways of working with
our customers during the pandemic. Regardless of preference or prior thinking on
distributed working, we now had to explore and invest in distributed working practices.
For us personally, this meant the launch of the Open Innovation Labs Virtual Residency
and other services to be provided remotely and distributed.
When we released this first section of this book as an early preview for feedback, we
were strongly encouraged by our readers to explore distributed use of practices more
than we were originally planning to. So, we will. In each of the sections of this book,
we’ll include a section to share our experience of applying practices with distributed
teams. This will include stories and experiences from the field during 2020 when we
did this and the many learnings we got from doing so. Where relevant, we’ll also include
details and links to digital templates that have supported us.
A note of caution. Just investing in a tool like Mural or Miro and having access to a
bunch of templates will not enable you to carry on as normal with all the practices and
techniques you may have used in a room. We’ve already outlined the importance of
getting the balance of people, process and technology change to get successful DevOps
Culture and Practice. When switching to using distributed mode - teams need extra
and continued focus on people, process, and technology.
We strongly believe that open culture and open practices using open technology makes
the best cocktail for successful transformation.
Conclusion | 45
Conclusion
In this chapter, we introduced PetBattle and the backstory of the hobbyist app that will
form our fun case study we'll use throughout this book.
We also introduced how we'll regularly break out into real stories and examples from
work we've done with our customers.
Finally, we set out the remaining sections of the book and what we'll explore in each of
those sections.
Our introduction is complete. Let’s start working our way round the Mobius Loop and
exploring our most used practices. Before we get onto the loop, we’re going to the
foundation the loop will stand on. In the next chapter we’ll start by building the very
important foundation of culture.
Section 2: Establishing
the Foundation
In Section 1, Practices Make Perfect, we introduced DevOps and the practices and tools
we're going to use to navigate around the Mobius Loop, which we also introduced.
Before we get onto the loop, we're going to build a foundation for the loop to stand on.
This is a foundation focused on building culture and technology:
When you hear the word Foundation, what do you think of? A lot of people will think
of the foundations of the building you are currently sitting or standing in. As you
know, you cannot just turn up and start building a house on top of some land. I mean,
theoretically, you could. But not if you wanted something to stand the test of time and
more importantly meet building regulations! First, you need to build a solid foundation.
That means going beneath the surface of where you're going to build. The taller the
building, the deeper and stronger the foundations need to be.
When you think about it, this kind of applies to anything in life. When you lay a solid
foundation, incrementally building on top of it has a much higher chance of success.
Think about your education, your health, your fitness, your career, and your life. The
successes you enjoy are because of the foundations and what you have been able to
build on top of them.
What happens when you build on top of a weak foundation? Well, generally, it's not
good news:
When we are on the Discovery Loop, we identify and use practices that help us answer
the question of Why – why are we embarking on this journey? What problems are we
trying to solve? Who are we trying to solve them for and what do we know about them?
What is our great idea? We also use practices on the Discovery Loop to help us identify
and set target measurable outcomes for the business and their customers.
When we are at the Options Pivot, we use practices to identify how we are going to
achieve measurable outcomes. What are the ideas we have that we could implement to
help us get there? What are the hypotheses that have resulted from our discovery that
we can test, run experiments on, and conduct research? How can we prioritize these
options so we deliver value more quickly?
When we are on the Delivery Loop, we are using practices to do the work identified
on the Options Pivot – implementing the features, running the experiments, and
conducting the research. We crucially also use practices that allow us to take
measurements and capture learning about the impact of our delivery. And, as we
return back into the Options Pivot, we assess what we should do next based on those
measurements and learning.
The Mobius Loop is this infinite, continuous journey of continuous discovery, Options
Pivots, and continuous delivery of measurable business outcomes that matter. We use
practices on the Mobius Loop that typically have defined start and end points. The
practices are designed to help a team advance around the loops – for example, in their
Discovery process, to make their way to options and a decision point.
When creating the Open Practice Library – a toolbox to store all of these practices –
we found that many of them did not necessarily fit within the Discovery or Delivery
Loops. Their nature was ongoing, or continuous. For example, we use many practices
and techniques that help establish and drive a culture of collaboration. We use tools
that help increase the energy, autonomy, and empowerment of product teams. We use
practices that help drive an environment built on information radiation, transparency,
and continuous learning. The concept of the Foundation beneath the Mobius Loop
was added to explain these kinds of practices. These practices are designed to make
it easy for teams to do the right thing by default. They are practices that we use to
build a foundation of culture. There are also many technical practices that we use
on an ongoing basis. The first instinct might be to associate these practices with
delivery and argue that they should belong to the Delivery Loop. However, there is a
subtle difference given that practices sitting on the Delivery Loop tend to be time-
boxed, scheduled bursts of activity that help drive delivery. Practices such as Sprint
Planning, Showcases, and user acceptance testing events (all of which we'll explore in
detail in Section 5, Deliver It) tend to be scheduled at a particular time in the week or
iteration. There are a host of practices that you would not schedule in the same way
that you would, for example, a Sprint Planning session. These include practices such
| 51
The answer is no. The foundation of the Open Practice Library is a growing selection
of tools and practices that will help grow an open culture. You don't need to use them
all but you should start with at least a few. You may even have a few of your own! Ask
yourself whether you are achieving an open culture and whether the tools you are using
are working well for you.
Open Culture
4
Many development teams, operations teams, and even cross-functional product teams
have a tendency to jump straight into the technological aspect of the project. Let's set
up pipelines, let's automate our infrastructure, let's write some code! It's such a natural
temptation to dive straight in without investing at least some time in establishing a
cultural foundation. It can also lead to ineffective and unnecessary complexity.
Don't worry, we will get to the technology in Chapter 6, Open Technical Practices –
Beginnings, Starting Right. But, before we do, we want to talk about open culture and
open leadership. In this chapter, we are going to explore what open culture is and why
it proves to be such a critical enabler for development and operations.
Open culture stems from a performance-oriented approach to collaborative work.
It provides motivated teams of individuals with an environment where they can be
continuously inspired to work together and feel a sense of autonomy, mastery, and
purpose. We are going to explore what these three elements entail and provide some
good and bad examples of them.
We're going to dive into our Open Practice Library and explore some of our favorite
practices that we have used to establish a foundation of culture, such as social
contracting, stop-the-world cords, and real-time retrospectives. These are practices to
help establish team identity, understand team sentiment, and visualize all of our work in
a very transparent manner.
54 | Open Culture
We have an array of stories to share about using these and other practices and we'll
return to our PetBattle organization to see how they go about establishing an open
culture as they move from being part-time hobbyists to establishing a high-performing
team ready to take over the pet battling world!
Why Is It Important?
Culture refers to shared customs or behaviors among members of a group of people.
Open culture refers to the custom of defaulting to open principles of transparency,
collaboration, community, inclusivity, and adaptability. Studies have shown that high-
performing teams need psychological safety, and open culture reinforces the behavior
patterns that allow for this.
What do we mean by psychological safety? William Kahn introduced this term in
1990,1 explaining Psychological Safety was experienced as feeling able to show and
employ one's self without fear of negative consequences to self-image, status, or career.
Dr Amy Edmondson provided further explanation in her 1999 paper,2 where she stated,
Psychological safety is a belief that one will not be punished or humiliated for speaking
up with ideas, questions, concerns, or mistakes. This was taken from her 1999 paper that
got picked up by the Google Project Aristotle in 2013. That Google study found that
psychological safety was (somewhat surprisingly) the most important factor for building
high-performing teams.
Tom Geraghty, Transformation Lead for Red Hat Open Innovation Labs, recently
redefined this as a part of his own research and transformation services work with
his customers. He explained, In a group context, psychological safety manifests as the
belief that other members value your contributions and concerns and will not harm you,
actively or passively, for expressing them. It creates space for group members to take
calculated risks, admit vulnerability and acknowledge mistakes without fear of negative
consequences.
Open culture and psychological safety are essential ingredients of any organization.
They are the enabler for an environment where people and teams feel an echo of trust
with the freedom to explore new ideas, conduct experiments, learn by doing, and share
freely with colleagues, peers, leaders, customers, users… in fact, everyone!
2 Edmondson, A., (1999). Psychological Safety and Learning Behavior in Work Teams
Amy Edmondson. Administrative Science Quarterly, 44(2), pp.350-383.
Why Is It Important? | 55
Conversely, a closed culture has blockers and barriers all over the place that prevent
these kinds of behaviors. Teams fall back into protective and secretive mode. When
there is a fear of sharing bad news, or the idea of asking for feedback fills someone
with dread, we end up in an environment where we protect what we're doing from the
outside world. We don't want to share our code with others in case they shoot us down
or ridicule us for not getting it right. We don't want to show our managers when we are
behind the plan for fear of being mentally beaten up and being told we have to work
nights and weekends to get back on track. We don't want to show features to business
people or end users because they might tell us that what we've done is not right, which
will trigger more re-work (and more angry managers).
By creating these types of environments, what we are really doing here is delaying the
inevitable. Think about how many IT programs have run into problems that have only
surfaced toward the end of the development timeline. Teams suddenly have to work
above and beyond to meet a deadline. What was missing was an open, honest, and
transparent view of work progress as the project developed.
Have you ever heard of the watermelon effect on IT projects? Nice and green and
healthy on the outside but when you start to peel back the skin and look inside, it's red
everywhere! An example we've seen is where team leads and project managers write
reports with their own RAG (Red, Amber, Green) status and they all get passed up to
a more senior project manager who summarizes everyone's reports with their own
summary RAG status. That gets passed up for even more senior executive steering
governance and an even higher level (or more diluted) set of information with a RAG
status presented to senior customer stakeholders.
Information Radiators
The best way to encourage an open culture is by having information openly available to
all team members and other interested stakeholders.
The term information radiator was coined by Alistair Cockburn for any of a number
of handwritten, drawn, printed, or electronic displays that a team places in a highly
visible location, so that all team members, as well as passers-by, can see the latest
information at a glance. In Cockburn's book Agile Software Development, written in
2001, the information radiator formed part of an extended metaphor that equated the
movement of information with the dispersion of heat and gas. We'll look at the many
different types of information radiator we use with our customers and teams – counts
of automated tests, velocity, incident reports, continuous integration status, and so on.
We also sometimes use the term information refrigerator – where information
becomes cold and sometimes out of date. The information refrigerator is somewhere
you need to go looking for information rather than it being readily available and
accessible. The refrigerator may even be locked! Information stored in files on shared
drives, JIRA repositories, and other digital tooling all risk become refrigerated, so it's up
to us to keep them warm and radiating.
When we have an open culture, we have the opportunity to regularly inspect the
health of everything – the health of the product, the teams, the investment, the
stakeholders, the users, and the technology. An open culture means that when we do a
health inspection and it does not give us positive news, we welcome the bad news with
engagement and conversation. We do not berate people for giving bad news. Instead,
we talk about what we have learned and what we need to do to make things better.
What do we need to change? What should we adapt? Inspection and adaptation are
healthy indicators of an open culture. The more open the culture, the greater freedom
we have to inspect and adapt as we go along.
Before we get into a few specific practices that help build an open culture foundation,
let's share a few thoughts on what we mean by that word culture and how motivation
within teams is what can really drive culture up.
Culture
Culture is a mysterious energy. You can't see it but you can certainly feel it. You can feel
when it's particularly strong within a team or in a physical space. If you've ever known
what great open culture feels like, you'll also quickly know when you can't feel it.
58 | Open Culture
We often liken open culture to the force in Star Wars. The force was strong in young
Luke Skywalker in Star Wars IV – A New Hope: he wasn't wearing a t-shirt that said so,
but others could feel it when in his presence. Open culture is like that. You'll know it
when you've got it and you'll know when it's getting stronger. Your job is to regularly
assess, sense, and check how strong the open culture is and, if you feel it could be
stronger, explore more cultural foundation practices to strengthen it. We'll explore
different techniques and practices that help measure and learn from the impact of
these changes in Chapter 13, Measure and Learn.
In the remainder of this chapter, we're going to share a handful of practices we've had
the most success with in establishing a cultural foundation during Open Innovation
Labs residencies.
Motivation
According to Dan Pink, author of the number 1 New York Times bestseller Drive: The
Surprising Truth About What Motivates Us, people are not motivated in the ways we
expect. He argues that Organizations historically have incentivized employees the wrong
way by offering rewards (money) and creating a culture of fear and punishment for
underachieving. When work requires any cognitive skill or knowledge, then these methods
do not work.
Through his research, he shows there are three things that motivate people beyond
basic task completion:
• Autonomy: The desire to be self-directed and the freedom to take ownership
• Mastery: The desire to get better at something
• Purpose: The desire to do something that has meaning or is important to you
Creating an open culture in your organization should embody these principles. Open
source software development is built on the pillars of autonomy, mastery, and purpose.
Examples include using open source code bases, collaborating on public discussion
forums, and having transparent decision-making records; these practices make it
easier for people to find purpose, gain expertise, and take ownership of their work.
It eliminates dependencies on people and on scarce or hidden information. People
across the globe are willingly contributing to open source projects such as OKD (the
community distribution of Kubernetes that powers OpenShift), thus improving the
product for all. Sometimes they are contributing because it is part of their paid work,
but quite frequently they contribute because they want to. They are working on these
projects because of the deep personal fulfillment that they gain through helping a
community build something better that benefits everyone, being self-directed or
listened to, and building something for the future; they also want to become a better
coder by implementing and testing something on a public backlog.
In order to build this kind of ethos in your organization, leadership needs to set the
environment for people to achieve these things. They should empower teams to make
decisions and adopt self-organization and self-correction, thus providing autonomy.
They should embrace and encourage continuous improvement and enable team
members to continuously learn and improve themselves and their team, providing
mastery. Leadership should remove obstacles and silos that prevent team members
from connecting with and appreciating business purpose. They should enable regular
and direct access to business stakeholders, customers, and leadership to drive up the
sense of purpose to their work. Allowing employees to focus on autonomy, mastery,
and purpose will have a twofold effect. Teams will have the freedom and drive to create
brilliant products. With greater responsibility for the product, people will start to work
harder because they truly believe they are listened to and are connected, and they will
want to work on making the product better.
Let's get back to our PetBattle case study and see how some of the employees felt
autonomy, mastery, and purpose. Specifically, let's look at a few events that happened in
the early days of establishing PetBattle that helped achieve this open culture.
60 | Open Culture
switch on the feature. Aidan thinks the team should look to explore more
advanced deployment capabilities like this.
Dan Pink, author of Drive: The Surprising Truth About What Motivates Us, gave an
awesome talk at the Royal Society for Arts (RSA) in 2010 all about what really motivates
us. The RSA produced a short video3 from this talk, which is awesome – we've shown it
to many teams and leaders. It's openly available on YouTube and we really recommend
watching it at least once to explore this area of autonomy, mastery, and purpose
further.
At Red Hat, autonomy, mastery, and purpose are values that underpin much of our open
culture. We're going to explore tactical practices that help create this. As we move into
product discovery and product delivery, you will find that all practices here are also
helping realize this culture.
Social Contracts
We, the four authors, are in a very fortunate position where we get to help kick-start
new product teams. We use a range of practices to help bootstrap these teams and get
them off to the best start possible. We've come to realize that one practice in particular
is foundational – social contracts.
3 https://youtu.be/u6XAPnuFjJc
62 | Open Culture
The social contract practice is a simple yet highly effective way to enable team autonomy
and self-accountability for engagements. A social contract is created by and for the team.
It looks to codify the behaviors and expectations of the team. It also provides a mechanism
for the team to visualize and share its desired behaviors with management and other
stakeholders.
A social contract is not some big expensive consulting engagement but a simple set of
guidelines and behaviors put forward by a team for how they would like to interact with
one another. It promotes autonomy and self-governance within a team.
One thing we've learned from previous engagements is that if we don't invest time
and use tools such as social contracts upfront with new teams, we risk running into all
sorts of cultural, teaming, and communication problems over time. Figure 4.3 shows an
example of a social contract.
With some items in the contract, the group signs the contract and hangs it high and
visibly. It is now the responsibility of the team to abide by it and call out when others
do not.
The preceding example is from a cyber security company we worked with for 4 weeks.
This team had a mix of developers, designers, and site reliability engineers. Some of the
things they included were as follows:
• Core Hours (10.00 to 16.00): This is the team's agreed collaboration time. This
is not the hours they work, but more the time in which they will have sync ups,
meetings, or pairing and other collaboration. On this engagement, one of the
team members wanted to miss the early morning traffic, so if she came in a tiny
bit later her commute would take her a lot less time. Also, another member of
the team had childcare responsibilities so getting out at a reasonable hour would
have made his life easier.
• Mob to Learn / Pair to Build: This is a simple mantra that describes how the
team wanted to interact when writing code, tests, or even documentation. If
there are new things to be tackled, such as that scary new microservice in a
language or framework that's new, do it as a full team. Get everyone on the same
page from the beginning and ensure that you're not creating individual heroes
with all the knowledge in the team. Pair on implementing features to raise skills
across the board.
• Have a Weekly Social: Celebrating success is an important thing for lots of
teams. Taking the time as a group to get away from our desks, socialize together,
and eat together helps build positive relationships. These events can lead to
improving team morale and creating a positive culture around the people
working on a product.
Now we've considered the key tenets of social contracts, let's look at how to integrate
them with other workplace practices.
To support gaining consensus on ideas and behaviors, use grouping techniques such as
Affinity Mapping, the Fist of Five, and Dot Voting. These are simple yet very powerful
practices and open facilitation techniques that help drive alignment, consensus, and the
inclusion of all people involved. We will explore them in more detail in the next chapter
about open leadership.
66 | Open Culture
When the social contract is first created, it represents a snapshot in time. It represents
a best guess of how a team should interact and is often written when we know the least
about each other or our habits. It is a useful tool to accelerate a team to a comfortable
position built on trust and psychological safety. However, it is not a fixed or static
thing. Items in there could be invalid or possibly missing. A great thing a Scrum Master
or Agile coach can do is bring the social contract along to a team's retrospective
(a practice we will explore in more detail in Section 5 - Deliver It). It provides a great
opportunity for the group to inspect and adapt it, possibly updating it with new ideas
or improvements.
Social Contracts | 67
It's OK to Be Wrong
When working with the cyber security company,
we found a great opportunity to update our
social contract. There were two developers
on that team that were quite hot-headed and
both always liked to be right. Often, they could
not agree on the approach to take for any
given solution, which was creating a toxic vibe
within the team and reducing morale. Through
a retrospective, the team identified this issue.
After some discussion and getting the issue out
into the open, we updated our social contract with one simple phrase: It's
OK to be wrong.
This had a profound effect on the team going forward. The competition
within the team started to evaporate as we could focus less on who was
right and more on does the solution work for our end users. We looked for
opportunities for both developers to have their voices heard by trying more
than one implementation of a feature. Building both solutions and evaluating
each, both developers then started to pair on building out each other's
solution, thus creating a better bond between the two. This also had the
net benefit of writing a much better solution by often combining ideas from
both individuals.
Social contracts are a powerful and easy-to-implement practice – they promote team
autonomy, self-governance, and psychological safety. If your team does not have one,
it's not too late to build one. Most importantly, keep it visible to the team and don't
forget to revisit it when needed.
68 | Open Culture
Consider having a pre-canned social contract that is good for remote meetings, then
allow people to add/change based on their thoughts:
• Everyone should sign their name with a sticky note or virtual pen.
• Add remote working elements to the social contract (for example, mute on entry
to calls, turn your webcam on, and so on).
• Establish global communication norms and individual teams' exceptions:
response time, writing style, tone, and so on.
Some good examples of communication norms we've seen in our virtual engagements
include the following:
• One conversation at a time.
• Assume positive intent.
• Do not delete or move others' contributions.
• Stay focused on our goal, working iteratively.
• Respect breaks.
• Avoid other distractions such as email and chat.
• A contract can be updated based on needs.
Figure 4.7 shows an example social contract created by a team distributed with their
suggestions and signatures:
70 | Open Culture
You can learn more about the social contract practice by going to the Open Practice
Library page at openpracticelibrary.com/practice/social-contract.
Creating a social contract over a video call with everyone remotely takes an extra level
of facilitation skill. There are some great tips for facilitating this and other practices
remotely in a blog post contributed by Ilaria Doria and Marcus Maestri at https://
openpracticelibrary.com/blog/facilitation-tips-for-remote-sessions/.
4 https://itrevolution.com/kata/
Stop the World | 71
The process of stopping a system when a defect is suspected goes back to the original
Toyota System Corporation and something called Jidoka. The idea behind Jidoka is
that by stopping the system, you get an immediate opportunity for improvement, or to
find the root cause, as opposed to letting the defect move further down the line and
be left unresolved. The Jidoka concept was pioneered by the original Toyota founder,
Sakichi Toyoda, also known as the father of the Japanese industrial revolution and also
the founder of the original Toyota Systems Corporation (before they manufactured
automobiles).
We use the Stop the World event during our engagements with customers. When
somebody on the team identifies that the engagement is off-course, the full team works
together to find a solution that is actionable and acceptable to all and progress can then
resume.
72 | Open Culture
There are a lot of hidden benefits to using Stop the World events within your team. In
the Toyota factory, the line manager would physically go and see the station where the
cord had been pulled and ask: how can I help you? Immediately the issue is treated as a
priority and by going straight to the area where the problem is first raised, the process
becomes evidential and fact-based. By thanking the team member who pulled the cord,
this encouraged a safety culture: factory management was saying, You have saved a
customer from receiving a defect. Whenever your application tests fail the pipeline, think
of this! At their core, Toyota believed failure created learning opportunities and that
failures are good things because of that.
Riley pulled the Andon Cord, which, in this space, was a big bell. He stopped
the world. It was loud and disruptive and intended to interrupt everyone
and grab their immediate attention. Everyone stopped what they were doing
and gathered in a circle. Riley explained his concerns and everyone agreed
that this approach was not going to help much with the target outcome
of improving the customers' overall user experience if they just designed
exactly the same application again.
So, as a team, they agreed that they would use the Event Storm to capture
assumptions that needed to be tested, ideas for fresh experiments, and lots
of questions where there were unknowns that they should explore with end
users.
The world was started again and the six weeks that followed took a
hypothesis- and experiment-based approach.
That story showed the user-centered motivations behind stopping the world.
A second story explains how a more technical team member stopped the world.
The team were all familiar with the practice but slightly alarmed to hear
the bell for real. They gathered around and the team had an open, honest
conversation. Tom shared his concerns and frustrations. The team re-visited
their current commitments and reminded themselves of some of the
Real-Time Retrospective | 75
Figure 4.9: Team gathers after the world has stopped to fix the problem
Within just 15 minutes, the team had re-aligned, re-focused their priorities,
and were renewed with energy and commitment.
One wonders if Tom had not aired his concerns and frustrations and just let
them sit in the back of his head, how long would this have gone on? What
ripple effects might there have been? What would the overall impact have
been?
You can learn more and collaborate about the Stop the World practice by going to the
Open Practice Library page at https://openpracticelibrary.com/practice/stop-the-
world-event/.
Real-Time Retrospective
In seeking to establish an open culture and kick-start an ethos of autonomy, mastery,
and purpose across the spaces we work in, we need to have feedback loops in place so
we can sense if we're moving in the right direction.
76 | Open Culture
So far, we've just introduced a couple of very simple practices that help to build the
foundational culture. But will they work? How will you know? What if the new team
is muttering bad things under their breath about how the social contract was fluffy
and that Stop the World thing will never work in a place like this. Hopefully, this isn't
happening, but if it is, you're probably heading in the opposite direction to an open
culture by creating side silos and more closed behaviors and thinking.
We need to create a channel for folks to feed back on their own use of the practices,
even first impressions. Do they like these things and think they're helping to create the
foundation desired? Are there things they already foresee as barriers and blockers?
We've mentioned the practice of the retrospective a couple of times already and
we'll be coming back to it several times during this book as the concept behind the
practices brings the most important philosophy of continuous learning and continuous
improvement.
But the practice being introduced here is a little different. Rather than scheduling in
retrospectives to start capturing feedback on our early foundation activities, could we
have a more continuous and always on self-service approach to capturing feedback?
The real-time retrospective technique shared by Emily Webber provides a simple, visual
tool for anyone to add retrospective feedback on any item at any time. Rather than
waiting for a formal feedback event or some kind of survey (which, let's be honest, is
not everyone's cup of tea), the real-time retrospective enables faster, more accurate,
and more direct feedback from everyone involved.
To make use of this technique, find a wall or surface that is long enough to handle the
amount of feedback you are expecting. Make a long line to represent time.
Draw a happy face above the line, a sad face below the line, and a surprised face along
the line (this represents the type of feedback that you wish to receive – positive,
negative, and surprised).
Explain to your participants your goal of using this practice and how to use your
prepped area. Here are the key things to share:
• Where to find materials for them to contribute.
• How the timeline begins at the left and continues until the timeframe you
designate as the end.
• Any and all feedback is welcome.
• One sticky note per feedback item.
Cycle back and monitor constantly to review your feedback and make appropriate
improvements based on them.
Real-Time Retrospective | 77
We find that making some simple adjustments early based on a real-time retrospective's
comments can really sow the seeds for autonomy, mastery, and purpose. Even reacting
to temperature control, improving catering, or switching people's positioning around
the workspace can have a great positive impact because the feedback came from the
people and they see their ideas being responded to quickly. This is what we mean by
empowering teams.
As always, we like to show more than tell, so let's look at a real example of where we
used a real-time retrospective.
Figure 4.10: A real-time retrospective used throughout a five-day DevOps Culture and Practice
Enablement workshop
This was used during the delivery of a five-day immersive enablement workshop –
DevOps Culture and Practice Enablement. Given the short nature of this, a real-time
retrospective is particularly useful because you don't want to learn feedback at the end
of the week (or even the end of one of the days) if there was a simple improvement or
resolution that could have been applied much earlier.
You'll see that a lot of the stickies in this example are beneath the middle line and
represent the sad feedback. Does this mean this delivery was going really badly? No,
not necessarily. When we facilitate real-time retrospectives, we actually tell people
not to worry about giving us lots of negative feedback. While it is nice to read the
happy or positive contributions (and it helps affirm what we should do more of), it's the
improvement ideas that we really want to capture. How can we make ourselves better
and how can we do that quickly?
78 | Open Culture
You may not be able to read the low-level detail but a few great examples of early
feedback captured on the real-time retrospective that were addressed in the second
day included the following:
• Having more frequent breaks and reducing the long lecture times
• Ordering more vegetarian food
• Adopting a group policy to move on when discussions were starting to rabbit hole
(separately, the team came up with visual cues to call out when these things were
happening)
• Getting some more monitors to help with the collaborative nature of technical
exercises
• Going outside for some fresh air at least a couple of times each day
These are all great feedback points from attendees and very easily addressable. The
continuous feedback and continuous improvement journey starts here!
One of the things that make the real-time retrospective practice very powerful is that
it is always on display and always in sight of the people contributing to and using it.
We often use large eight foot by four-foot portable foam boards to put the real-time
retrospective on so we can move it around the room and purposefully have it in places
that team members have to walk by regularly. This encourages them to contribute,
reorganize, or simply see what others have been feeding back.
If you're working with a distributed team using digital whiteboards and video
conferencing, we can easily start a real-time retrospective. We have a good template
that you can use as shown in Figure 4.11 and available to download from the book's
GitHub repository.
The challenge here is getting the adoption and usage to continue. Digital artifacts
such as this risk becoming information refrigerators where information goes to die
and is only found when someone opens it up, resulting in it being cold and out of date!
Perhaps information should have use-by or best-before dates! To avoid this, we need
strong facilitation. We should encourage all our team members to have these digital
artifacts visible at all times. We should, as a team, regularly do a virtual walk‑the‑walls,
and Engagement Leaders and facilitators should encourage contribution to the
real‑time retrospective where it makes sense.
You can learn more and collaborate about the real-time retrospective practice by
going to the Open Practice Library page at https://openpracticelibrary.com/practice/
realtime-retrospective/.
Social contracts, stop-the-world Andon cords, and real-time retrospectives are three
great practices to start building the foundation of culture. If well facilitated, the
conversations generated from these practices alone should start to drive a sense of
team autonomy and team identity. We're going to explore team identity further as it is a
crucial ingredient of great team culture.
Team Identity
Think about the teams you've worked with in the past. How many of them have been
really great experiences that you enjoyed waking up to every morning? How many of
them were terrible and you couldn't wait for it all to end? Have you ever worked with a
group and thought, That was amazing – I wish it could have gone on forever! If so, what
were the things that made working with that team so amazing?
Team identity, morale, and cohesion are all linked and are critical for building
high‑performing teams. They are all a state of mind, meaning the formula for creating
awesome teams is a bit more abstract and a lot less formal. You cannot force a team
to have high cohesion just by slapping people together, by adding one-part outgoing
individuals and two parts hard-working individuals! It must form more organically
than this.
Individual team identity can have an infectious enthusiasm and spread to other
teams in the organization. When there is a level of energy, team culture, and shared
identity evident, other teams want to be like that team. New teams should form to be
like that team. And, to avoid the risk of a team's identity becoming their own silo, we
can introduce ways to connect teams together through communities of interest and
communities of practice. We'll explore this more in Section 7 – Improve It, Sustain It.
80 | Open Culture
Socializing
There are lots of little cultural nuggets and practices that we've used to try and
accelerate team forming and get us to a performing state quicker. If team members
naturally get along with each other or have shared interests, this can infect the whole
group and create bonds between everyone.
Getting to know people on a personal level and sharing interests easily spills over into
the day job. Humans by their very nature are social creatures. That's why the first thing
we always try to do with our teams is socialize together. Take that moment to get away
from the office and relax. Sometimes that can be in the pub after work or even during
the day while sharing lunches. In fact, on a previous residency, we've even had things
like a weekly curry buffet for the team to use the time to socialize together. On one
occasion we went twice in a week when one of the team was off sick on Curry Day!
If going to the pub after work does not suit, flip it on its head! We've often had teams
go for a big tasty breakfast together first thing in the morning! It's a great way to
start the first day of a new sprint with fresh eyes and fresh eggs! Socializing can take
many forms, but it's important to stop and celebrate the work that's been done. Going
out to play a game together or hit the bowling alley for an hour helps the team build
friendships and get to know one another.
Team Identity | 81
It may sound super fluffy, but teams that share in activities like this will always help
each other out on the day job, even if they're unfamiliar with the tasks being tackled.
Team members will go the extra mile for one another when they've established such
relationships.
Social music playlists provide a simple way for teams to get to know one another.
Creating a shared playlist for all to be able to add songs they like, or upvote and
downvote others, can create a very positive vibe within the team. Getting to know
people's taste in music, or lack of taste in some cases, can open up new connections
with people who may be shy or less likely to engage in a social outing. We've used
Outloud.dj for social playlists, even on training events!
Network Mapping
Network mapping is a handy practice for individuals to get to know each other. The
practice is simple – line up all the people in your group so each person has a partner
they've never met before.
Each person writes their name on a sticky note. Set a timer for three minutes for the
pairs to write on a new sticky note something they have in common. With that in place,
shuffle the group and go again with a new partner capturing something they have in
common.
82 | Open Culture
Two sticky notes are usually enough for each person to get started with this activity,
but you could go around again, especially if the group is small. Select one person to
go first and have them introduce the people they met. Pop their name sticky note on
canvas along with the thing in common and draw a line connecting them. Pass over
to the next person and have them introduce the other person they met, continuing to
connect the people to the interests. With all in place, allow the team to draw additional
lines they have with others in the group, forming a large spaghetti map! It's a simple
practice, we know, but it helps to accelerate new groups getting to know one another.
You can take it one step further to demonstrate metaphorically what happens when
you break up a team that's already high performing. With the group standing around
in a circle, connect each person who has the thing in common with string or wool.
This should form a giant web; take the social contract and lay it on top of the web.
To demonstrate the negative effect that swapping in team members or removing
team members has on the group's inner workings, cut all the strings connecting those
individuals. You will see the social contract is no longer supported and falls away to
the floor, symbolizing the effect that moving team members in and out has on the
collective.
Team Identity | 83
You can learn more about and discuss the network mapping practice by going to the
Open Practice Library page at openpracticelibrary.com/practice/network-mapping.
Figure 4.15: Team identity with some customized t-shirts and a team name
84 | Open Culture
Let's look at a real world example where a team name and a team logo kick-started
team identity:
A team with a strong identity will feel that they have purpose. All of these little
practices and many more can help a team form, storm, norm, and perform as outlined
in Tuckman's stages of group development5. It's important for teams to be allowed
the space and bandwidth to get to know each other. When given space, a team will be
happy to self-organize and share problems, and will want to succeed as a unit.
The responsibility of leadership is to just get out of the way and help facilitate the team.
If the cost of a pizza and a few beers each week makes a better product or a happier
team, it's a negligible cost in the long run. Supporting a team to create bonds like this
helps to foster a positive culture. Creating this environment is key for a team to adopt a
mantra such as—You Build It, You Run It, You Own It.
Finally, don't underestimate the value and importance of being social while having
lunch or breakfast together. This is something we build into our Open Innovation Labs
residency experience because we see the strength it generates in relationships. So,
while it may feel a little awkward, we still do it with our virtual teams. We have our team
members show us around their houses, meet their kids and pets, and create fun selfie
videos. We have virtual socials – order a pizza and chat over a glass of wine, watch a
movie online, and even go to a virtual escape room activity.
A really strong team identity and a high-performing team often result from working
in a great team space, full of energy, color, and information that the team is engaging
with. Earlier in this section, we introduced the information radiator concept (and the
information refrigerator). Let's explore this idea of information radiation further.
Radiate Everything
Have you ever walked into a room housing a high-performing team and found that it
just feels different? It's hard to capture why it feels different in text, but you just get a
sensation of cohesion and openness all at the same time. You may see sticky notes on a
wall showing things being worked on or tracking tools such as a burndown chart visible
to all who enter showing the team's progress so far. Possibly you see things such as
build monitors showing dashboards of code being built and deployed or tests executing,
or even live stats of the applications' and platform's current usage!
All of these things are information radiators, and they are probably one of the most
important things we have in our kit bags.
An information radiator serves one purpose – to show whoever passes by the latest
information. Radiate all the things, from test scores to retrospectives, architecture
diagrams, and even things that could hurt the team! Be proud of them and use them as
talking points in an open environment. In a truly agile environment, we emphasize open
and honest communication. Hiding behind a tool or burying a problem does not adhere
to the principles of transparency. An information refrigerator is exactly that – it's the
deep freeze where information goes to die. How many times have you had to request
access or sign in to something to find the information you needed? How often has that
information become stale or not been updated?
Besides the dashboards and sticky notes we like to have on display, a few other
information radiators are good to have in your foundation kit bag:
1. Visualization of business process design: The following figure shows an example
of this that utilized a practice called event storming. We'll be going into much more
detail on this in the Discover It section but, as a teaser, you can see the amount
of information being radiated – all of it was generated through collaboration and
conversations.
Radiate Everything | 87
2. Visualizing what the team is working on now, what they've worked on previously,
and what they might be working on in the future: The following figure shows lots
of boards of sticky notes providing the team with this information.
In the following section, we'll see how teams can practice this when working in
distributed environments.
Having a second (or even third) monitor can help have information radiators oozing
warm information to you at all times. You can set windows up so you have different
applications giving you real-time information and also have your digital whiteboard and
video conferencing tools running. The benefit of using digital whiteboards is you will
never run out of wall space!
In the preceding example, you'll also see a portable speaker (great if you have a team
social playlist running!), a small tablet that is being used to show a live build monitor,
wall space, and sticky notes – still useful for ideating and planning your own work.
There's even a plant helping oxygenate the space to generate creativity and clearer
thinking!
Check out many more tips (and add your own) by visiting https://openpracticelibrary.
com/blog/guidelines-for-setting-up-a-home-working-space/.
A further set of information radiators we use to visualize a team's health and mood
are supported by team sentiment practices and are a key enabler of high-performing
teams.
Team Sentiment
Team sentiment practices track the mood of a team and provide a mechanism for team
members to rapidly feed back, either publicly or anonymously, when their mood has
changed. Team sentiment practices enable problems to be identified, radiated, and
addressed early. They enable the team to quickly discuss where one or more team
members are suddenly troubled by the direction their work is going in and can also
provide an information radiator to assess the overall team's health and feeling.
One popular approach to introduce a team sentiment information radiator is mood
marbles. To use this, you will need a number of containers to hold enough marbles so
there is one for each team member participating. When we say containers, it can be
a simple transparent box (in Red Hat, we love containers and OpenShift is all about
containers, so we've even found a way to get containers to track our feelings!) or even
just a drawing of a container on large flip-chart paper. You'll also need two different
colored marbles (can be sticky notes) to start with. Other colors can be introduced.
Mood marble containers are placed in a strategic place where all team members will
see and walk past them regularly (for example, near the door or on the way to the
bathroom).
Team members are encouraged to think about their current mood. Start with positive
(often this is green) and negative (often this is red), and each team member adds a
marble to the container that reflects their mood.
90 | Open Culture
Team members are regularly reminded they can change their mood marble at any
time by removing their previous marble and replacing it with a different color. With a
container of all team members' mood marbles, the information radiated is on the team's
overall mood.
It can be interesting and very powerful to aggregate team sentiment data over time.
We can use this to identify trends and patterns and draw conclusions as to how, for
example, making some tweaks to technology practices, to other cultural practices, or
to leadership behaviors can have an impact on team mood and performance. Figure 4.22
illustrates mood distribution over time.
Let's look at another real-world story of one of our authors, where he learned that
sometimes these practices can achieve different outcomes but still very helpful ones!
92 | Open Culture
Many people will have had to try team-forming with groups of people who have
never met each other or are not in the same location. This is particularly true during
the COVID-19 pandemic. Let's look at an example of how we used the same practice
digitally.
Team Sentiment | 93
Figure 4.23: A digital mood marble container for distributed team sentiment capture
You can learn more and collaborate about Team Sentiment practices by going to
the Open Practice Library page at https://openpracticelibrary.com/practice/team-
sentiment/.
Radiate Failures
Having information radiators, dashboards, and mood marble containers show green
can be satisfying and positive validation. But when they are red is when they are more
powerful. When they radiate that something is wrong or failing, it presents a feedback
loop to the people around to react and respond to.
94 | Open Culture
If you're looking for some great ideas for things you can visualize in a creative way, we
cannot recommend this book enough – 96 Visualization Examples by Jimmy Janlén!6
It has some amazing ideas for displaying information that's critical for the team to
be able to respond to. And it has a super bonus section on the correct way to peel a
sticky note!!
You can learn more and collaborate about the visualization of work practice by going
to the Open Practice Library page at https://openpracticelibrary.com/practice/
visualisation-of-work/.
6 https://visualizationexamples.com/
96 | Open Culture
The next item was to form a social contract. Eoin explained how it worked
and asked everyone to write at least one sticky note of something they
would like to add. The social contract started with a discussion about
the core working hours, which they agreed would be 10 AM to 4 PM – all
collaborative activities would be scheduled inside this time box. Emma
suggested a Be on time social contract item, which everyone agreed to. Dave
suggested a relaxed dress code, which everyone was very happy with. During
the next 30 minutes, the team got into some interesting conversations that
resulted in the social contract comprising the following as well:
• Have fun!
• Ensure everyone is heard
• Show empathy
• It's ok to be wrong
• Mob to learn and pair to build
• Be open
• Have a social evening out every week
• Breakfast and lunch as a team three times a week
• Be empowered
Eventually, the team agreed this was a good start but they would all continue
to reflect and think about how to improve the social contract. They all
signed it and Eoin hung it on the wall near the team.
Jen suggested the team try using a real-time retrospective – something she
had seen at a recent conference and was keen to try. Everyone had done
retrospectives before but they were not familiar with the real-time aspect. It
seemed like a good idea and they agreed to give it a go.
The next items on the backlog were introduction to the room. Eoin had set up
a mood marble container by the door and explained how the practice would
work. The team looked a bit skeptical (Aidan made a comment that these
were not the kind of containers he thought he'd be working with today) but,
in the spirit of being open and experimental, would give it a go.
Conclusion | 99
At the other end of the room was a big rope with an old bell at the top. There
was a big sign next to it that said STOP. Susan offered to explain how this
worked and how it could be used to stop the world. The team seemed very
happy with this – it made a lot of sense. In fact, most of them were thinking
of many scenarios that had happened to them in previous jobs where this
would have been very useful!
The team was introduced to the rest of the room. There were lots of
placeholder signs on different parts of the wall. One had now been filled
with the social contract. There were headings for Impact Map, Event Storm,
value slices, product backlog, and sprint 1. Eoin explained that, while the
room may feel empty now, in just a few days it will be filled with information.
The next item on the kick-off backlog was to configure the space. This
was a great activity to do as a team and was really in the spirit of being
empowered. The team wheeled in tables and discussed how they'd like to
be set up with a mobbing corner and pairing stations. They set up every
monitor Eoin had purchased. They even attached some Raspberry Pi mini-
computers to each one. Aidan added a PetBattle digital sign to each one! The
music was on, the space was configured, and it looked great. The volume of
the team over lunch started to rise. There were more laughs and banter.
The team had started to form. The cultural foundation was in place and would only
strengthen from there.
Conclusion
In this chapter, we introduced the foundation from a culture and collaboration
perspective and the importance of building this foundation before we do any
technology or try to do any discovery or delivery of product development. To achieve
strong business outcomes when using DevOps and OpenShift, we need to have high-
performing teams developing and operating software products.
The starting point to creating a high-performing team is enabling an open culture
where the team members feel psychologically safe and can achieve a sense of
autonomy, mastery, and purpose in everything they do.
100 | Open Culture
To help us realize this and build the foundation, we explored several of our favorite
practices that have enjoyed the most success in kicking off new teams – social
contracts, a Stop the World system, gaining team identity, radiating as much information
as possible, including team sentiment, and starting a cycle of regular inspection and
adaptation.
We told a fair number of stories and shared some memories of applying these practices.
And we returned to our newly forming PetBattle team to see how they went about
starting their cultural foundation on their first day together.
Some of the practices in this chapter may work for you and your team. Some may not
and you'll want to throw the tool back into the box and choose something else. There
are many, many more you can explore at https://openpracticelibrary.com/tags/
foundation, or you can contribute your own. Remember, the precise practices and tools
you use are not as important as the fact that you are investing time in establishing a
cultural foundation.
Of course, a foundation can be improved upon and strengthened. In the next chapter,
we'll explore how leadership can help this from the outset and how open physical
environments and spaces can strengthen it. In Section 7, Improve It, Sustain It, we'll
explore further how the foundation can be strengthened to support scalability and
sustainability for product teams adopting the open way of working.
Open Environment
5
and Open Leadership
In the previous chapter, we explored what it means to have an open culture and how we
can enable teams to build this into their way of working from the ground up.
When we talk about having empowered, self-organizing, self-directing teams, many
enthusiasts will say that management is supposed to just get out of the way and
disappear! Some will say if teams have true empowerment, surely they don't need to be
managed and they don't need managers.
Differentiating between leadership and management is important here. We want teams
to manage the organization themselves from the bottom up and for leaders to set
direction and intent that enables that behavior throughout the organization. This is
how open organizations such as Red Hat are led.
102 | Open Environment and Open Leadership
So, what is leadership's main role in all of this? What can leadership do that helps
deliver lasting change within a learning organization? We are going to tackle these
questions by first taking a look at the Kodak case study, the company that invented
personal photography but ultimately missed the digital revolution. What lessons can we
learn from it? In the Changing an organization section, we'll discover that leadership
can achieve far better execution by moving the decision-making to where the
information is. We'll then learn how to use Priority sliders and forced ranking to help us
prioritize where we can focus our efforts. In the final The space section, we'll take a look
at creating the right physical environment for our teams to flourish within.
The Kodak Problem | 103
1 https://www.amazon.com/Creative-Destruction-Underperform-Market-
Successfully/dp/038550134X
104 | Open Environment and Open Leadership
That's right. He invented the digital camera in 1975 and his bosses at the
time buried it – they said no one would ever want to look at their photos
on a television set. Steve also invented the first Digital Single Lens Camera
in 1989, but again, at the time, Kodak marketing thought that releasing the
product would interfere too much with their film processing business – so it
was buried as well.
Now, Kodak made billions on their digital patents; however, these expired in
2007. The company just could not pivot fast enough to change the customer
experience, which their competitors were delivering through digital, and
eventually filed for bankruptcy protection in 2012.
This provides a fine example of the need for continuous learning, continuous measures,
and a continuous appetite to pivot.
Learning from History | 105
This is easier said than done. Human understanding and cognition are heavily
influenced by our cultural beliefs and upbringing. We just don't see what
others see. In some ways, this is what happened at Kodak. Their background
and company history were from a chemical engineering viewpoint; indeed,
Eastman Chemicals (which was spun off from Kodak) is still a very successful
company today. At the time, Kodak's management was incapable of seeing
the transformative change to the user experience that digital photography
represented.
• The Kodak story shows us that innovative change can come from anywhere
within an organization.
It often requires a different point of view from leadership for internal views to be
promoted successfully as a business strategy. The old command and control structures
themselves need to change so they are better able to encourage and embrace
innovative change.
Open Leadership
Shabnoor Shah, Open Leadership Global Lead and Coach for Open Innovation Labs,
explained to us how Open Leadership is a new refreshing gaming-changing way of
leading in a digitally transforming world. The foundations of leadership are rooted in
the open-source way of thinking, working, and behaving.
A unique aspect of open leadership is that, because it is really a mindset and a way
of being, it is not restricted to top levels of management in the hierarchy. Open
Leadership can be practiced by anyone and at any level in the organization. However,
when leaders lead openly, the impact is significant and palpable in shaping an open,
positive, and progressive organizational culture. The results are reflected in employee
happiness, well-being and engagement, customer satisfaction, and overall profitability
and success of your organization.
106 | Open Environment and Open Leadership
The guiding principles of Open leadership and open organizations are transparency,
inclusivity, collaboration, community and participation, adaptability, meritocracy, and
releasing early and often. The overarching belief of the open leadership mindset (at
Red Hat) is to default to open because open is a better way. This is supported by four
supporting beliefs that everyone has something to contribute, everyone has untapped
potential, everyone has the responsibility to lead and everyone benefits when we (all)
put the organization first.
Changing an Organization
Organizational charts are usually drawn top-down in a hierarchy, and they tell
you nothing about the nature of the company or its interactions. Let's redraw our
organizational chart to see if we can better represent the interactions that might be
occurring. Our customers are drawn as our roots, the foundations that allow the whole
organization to survive. Next, we have the different business units drawn as petals
that interact using business processes to achieve their goals. The company president
is a rain cloud who is downward-looking, shielding the organization from the external
board and stakeholders, who are represented by the outward-looking face of the
company, the CEO.
Humans are biological organisms. Organizations should also be treated like organisms
rather than the product of their organizational chart. It makes sense – organizations
are the product of the people they employ. So, when looking at how organizations
change, we should ask the question how does change occur with humans? If the majority
of people find change unsettling, what makes a person go through change rather than
accepting the status quo?
People make changes when they are insecure, unhappy, or unsatisfied with their
current situation. Similarly, companies that are under stress are going to be forced to
implement change to survive. It is insightful that when putting into action the practices
described within this book, that it will feel uncomfortable. Often, team members will
describe the changes as hard and difficult to make. This is actually a good thing! Change
is hard. Companies (just like humans) need to embrace creativity to be innovative
– so they can deliver awesome experiences to their paying customers. A successful
transformation isn't something that happens for only a focused period of time, with
no changes happening after that. Embracing innovative change as a sustained way of
operating is the challenge.
Does any of this sound familiar in organizations you have worked in?
• We have many unhappy silos within the organization.
• We are one team with many clans.
• There are too many hand-offs.
• There is too little accountability.
• We have misaligned incentives.
• We are punished for failures.
• We have people with egos who don't want to share.
• There are barriers to change – this is the way we have always done it.
If you identified with any of these, they are all traits and characteristics that require
executive sponsorship to change. It is the leadership's role to:
• Create a shared purpose and vision.
• Give permission to Change the Rules.
• Remove unnecessary gatekeepers and roadblocks.
• Encourage actions that Get Work Done.
• Decentralize decision-making.
• Demonstrate transparent communication.
• Break down barriers between silos and departments.
• Measure the system as a whole.
• Make Transformation everyone's job.
• Value results based on delivering organizational outcomes over process
adherence.
• Help build an engineering sanctuary where trust exists between the business and
technology.
• Demonstrate results with metrics and data.
• Encourage teams to own the full engineering stack.
• Build transparency into the full development process.
• Take the risk to start the first transformational change.
This is a long list. In the next section, we'll tackle three ways of bringing about these
traits by taking some advice from a submarine captain!
Achieving Greatness | 109
Achieving Greatness
In Captain David Marquet's book, Turn the Ship Around,2 he defines leadership as
embedding the capacity for greatness in the people and practices of an organization, and
decoupling it from the personality.
Giving Intent
In it, he talks about his experience of being a captain of a nuclear submarine and how
they learned not to follow the leader into disastrous situations. He vowed never to give
another direct order but, instead, to set intent. Rather than giving instructions, give
intent. Officers stopped requesting permission all the time, which meant psychological
ownership shifted towards them. Marquet talks about the two pillars that supported
this idea of giving control – technical competence and organizational clarity.
2 https://davidmarquet.com/turn-the-ship-around-book/
3 https://youtu.be/OqmdLcyES_Q
110 | Open Environment and Open Leadership
Figure 5.4 represents the sketch produced during the ten-minute video and has several
important messages.
If you haven't yet watched the video that produced the graphic above, please take the
time to do it now. It is inspiring, thought providing and sets the leadership context for
the rest of this book.
4 The Phoenix Project, A Novel About IT and Helping Your Business Win – Gene Kim,
Kevin Behr, George Spafford
Achieving Greatness | 111
A business product owner and business SMEs are key people supporting our cross-
functional teams. One of their key tasks is to represent and communicate with all of the
product stakeholders and customers and share those conversations and outcomes with
the team. By helping the team decide on what should, and more importantly should
not, be delivered, the product owner vastly improves customer satisfaction because the
software that represents the most customer value can be worked on and delivered first.
By being part of the cross-functional team, product owners gain a shared
understanding of the technical challenges involved in writing and operating the
applications that deliver business services to their customers.
Ultimately, it is through this collaboration and shared understanding between technical
and non-technical team members that we get business leaders on board with IT. By
successfully delivering customer and organizational outcomes, the product owner can
show success to their management and in so doing, own the IT challenges with their
technology counterparts.
Now that leadership is enabling and supporting our product team's efforts, we are
going to switch gears next and take a look at the practice of using priority sliders,
which can help our team prioritize what product areas or team building topics are most
important.
112 | Open Environment and Open Leadership
Priority Sliders
Priority sliders are a great, simple tool to have in your kit bag! Like most of the practices
we're exploring, they're really just a tool to help us facilitate a conversation and drive
some shared understanding. We use them to drive team consensus in the direction we
should go for a given time length.
Running the practice is easy. Just get a small area of a whiteboard and do some
brainstorming around key topics of focus for your engagement. They could be things
like:
• Functional Completeness: How important is being 100% complete on some
piece of app functionality, or are we looking for some sort of thin thread through
all the functional areas?
• Security: We know security is important but how much time do we want to
invest now in hardening our software?
• Skills Acquisition: Happy, motivated individuals make great teams. Making sure
the team has all the expertise to build, operate, and own their software could be
important.
• User Experience: Are we building what we want or what we think our customers
want?
• Test Completeness: There will always be some testing. How important is it for
us to automate this right now or should we focus on test automation from the
beginning?
It is important to call out that these topics are not product features. They are product
areas or team-building topics, without which we would not get a great product. For
example, perhaps you want to drive more sales through your application so you decide
to implement some form of push notifications to drive your customers directly to
your app. Push notifications are not something you would add to your priority sliders,
but perhaps market penetration could be. This topic could include a raft of additional
features or experiments you could run. It's good to have a few examples of these ready
and the team should always ensure they know what is meant by each item on the slider
before you begin.
With your list in place, get yourself some Post-Its and Sharpies and write them up in
one column. For example, let's say we use the five headings from above. With five items,
we now need a scale numbered from one to five for each one. The more items your
team is prioritizing against, the higher your scale will go.
Priority Sliders | 113
With the scale in place, the team needs to decide what they think is the most important
item relative to the other items. The key with priority sliders is that no two items
can take up the same priority – so if 5 is your top, then User Experience and Security
cannot be on the same level. Each item must be traded off against another – hence
some teams call this practice as using trade-off sliders.
In the picture above, we can see a team discussing a priority sliders session in progress.
Trying to figure out as a team how to force rank the options leads to a lot of great
discussions. Below we can see two complete sets of sliders. It's important that everyone
agrees on whether 1 or 10 is the highest importance!
There are many ways for a group to gain consensus – they could just have a good
old-fashioned chat about each item and agree one by one as they go down the list. This
can work well for some groups but for others, a more formal approach can ensure all
people participate. Here are a few ideas:
Priority Sliders | 115
With the completed canvas, we can get a sense of our upcoming priorities. These could
be set for a few weeks or for longer but like all practices, this is not a one-time thing! It
should be returned to once some work has been done in any given area on the canvas,
checking whether it is enough and whether our priority has changed.
Great product owners and scrum masters could take the priority sliders with them
to sprint planning events. When deciding what to bring into sprint or what to omit,
being able to shortcut those decisions by referring to the canvas can save time. Like
all these practices, it's very important to keep them visual at all times and, of course,
this is much easier when everyone can be in the same room. Of course, there are still
excellent ways to run a priority sliders session with a distributed team, which we'll look
at next.
116 | Open Environment and Open Leadership
Figure 5.8: Digital priority slider canvas for use by distributed people
When there are sliders on the wall and everyone's on their feet and having to add their
dot, the environment is so well set up for conversation and collaboration, people can't
help themselves! On a video call, it's too easy for some folks to disappear and not get
involved. So, really strong facilitation is needed. Adopting a liberating structure such
as 1-2-4-all will help get engagement from the outset. 1-2-4-all is a simple facilitation
technique whereby individuals are first asked to independently and privately provide
their input or opinion. They then pair up with one other person to discuss each of their
ideas and merge them. Then two pairs group together to converge each of their ideas
before the full group gathers together to bring all the input together.
The Space | 117
You can learn more and collaborate on the priority sliders practice by going to the
Open Practice Library page at openpracticelibrary.com/practice/priority-sliders.
The Space
One of the big focus areas that can really help leaders establish a strong foundation for
their team is in finding a suitable space for team members to work.
In the previous chapter, we explored the importance of motivation and autonomy,
mastery, and purpose. The physical working space is a great enabler. Great teams work
in great spaces.
118 | Open Environment and Open Leadership
When we run our Open Innovation Labs residencies, we do them in one of our own
physical labs, which was purpose-built for this type of working. It is often the number
one concern our customers' residents have – they say their own buildings are not set
up for this type of work and there will be many blockers put up by Facilities, Security,
Health and Safety, and other departments. We aim to use the time in the labs to show
leaders and stakeholders how much engagement, energy, and value is delivered from
this way of working and how much the physical space enables it.
We have some space recommendations based on a collection of best practices and
lessons learned from real facilities built out of global, permanent Open Innovation Labs
facilities, as well as experiences building temporary, pop-up team spaces. Val Yonchev,
Open Innovation Labs EMEA Leader, and Mike Walker, Global Senior Director of Open
Innovation Labs, provided much of this thought leadership in their own contribution to
the Open Practice Library, which you can read at openpracticelibrary.com/practice/
team-spaces/.
Let's look at a few examples. First, a lab space such as that in the next picture is very
open. There are no barriers or wall dividers. It is highly configurable with all the tables,
chairs, monitors, and even plants on wheels!
Open workspaces are well lit, ideally with some natural sunlight, with tables and chairs
well distributed around the room. There is a lot of wall space for all the information
radiators.
Information radiators should be easily accessible and in the line of sight of team
members. There's little value in having all this information if the team is not going to be
seeing it regularly.
Most importantly, the space should enable conversation and collaboration to happen as
much as possible.
Figure 5.13: Configurable workspace set up for pairing and mobbing collaboration
Just how much do we need to invest in the space and what is most important?
To this:
After the four-week engagement, I got some feedback from the now CIO. He
said that Security watched the whole engagement with interest and moved
from a mindset of saying No, you can't do anything to we see what you want
to do and why and we'll help you get there.
The Space | 123
You may be looking at the images and examples above and thinking it is impossible to
have so many people in the same space at the same time. We often continue to probe
such statements as we believe so strongly in the value of co-location and energetic
workspaces full of visualization and information radiation. Of course, with our recent
experience of the COVID-19 pandemic, it really did become physically impossible for
most of us and we've all pivoted to using virtual spaces.
Virtual Spaces
In this chapter and the previous chapter, we've looked at individual practices and given
some thoughts and guidance on how to run them with people distributed. With virtual
engagement and a distributed team, there usually is no physical space to consider.
We are immediately challenged and have some risks in not creating the autonomy and
psychologically safe space to facilitate great collaboration.
There are two spaces we still need to consider and invest in:
• Firstly, each individual's space. In the previous chapter, we explained how we
should radiate everything and, even in virtual settings, individuals need good
spaces, multiple monitors, tablets they can use as build monitors, their own wall
space, and so on. Investing time and money upfront to suitably equip each team
member will set the overall team up for success.
• Secondly, there is the digital space. That tends to mean investment in solid
tooling, of which there are a growing set of vendors who offer support. Our
guidance is to think about all the practices and ways of working that work so well
in the physical space and find a tool or set of tools that will allow the emulation
of that exact same practice. So, if you use the burndown chart practice, find a
way to do that digitally (such as JIRA5). If you have a product backlog of index
cards that everyone works from, find a tool that can manage that (such as Trello).
If you regularly collaborate with people moving sticky notes around walls and
annotating them, find a tool that enables that (such as Miro6 or Mural7). Don't
settle on one tool. Don't decide on one upfront. Take an agile approach and
encourage the experimentation and empowerment of the people actually using
them.
5 https://www.atlassian.com/software/jira
6 https://miro.com/
7 https://www.mural.co/
124 | Open Environment and Open Leadership
You can see the differences between physical space and digital space and, if you have
worked in both environments, you'll know the difference in feeling and energy. They
are very different, and we've learned there are benefits to both of them. Whilst the
virtual workspace does not provide the same culture, energy, and multi-dimensional
visualization you can have in the 360 degrees around you, it does provide you with
infinite space. You will not run out of wall space. There are environmental benefits to
not using thousands of sticky notes. Inviting subject matter experts from the other side
of the world to review and comment on your team's collaboration activity is suddenly
a lot easier and cheaper. Your security and audit folks may also feel more assured by
using these digital tools. It will be interesting to see, in coming years, whether the
virtual and physical space co-exist, if one will become the norm, or if they will merge
together into some kind of hybrid.
Conclusion | 125
Conclusion
In this chapter, we explored further what it means to be open with a particular focus on
introducing open leadership and open spaces. The open culture practices introduced
in Chapter 4, Open Culture, help teams become autonomous and self-managing.
Leadership has a role to play in creating an environment that facilitates and enables this
way of working.
We emphasized the importance of leadership setting intent when establishing a
solid foundation for a team to base its product discovery and continuous delivery
on. This is crucial for a DevOps culture to be successful. As we get further into this
book to explore technical practices and how platforms such as OpenShift can deliver
continuous business value, we need our teams to have a strong cultural foundation
enabled and supported by an open leadership culture driving open organization
behaviors from the top down.
We looked at some examples of strong, open leadership – just-in-time decision-making,
collaborative priority sliders, and their role in helping find the right workspace for the
team, either physical or virtual.
We explored some of our top recommendations to be considered in designing team
spaces and also the minimal needs to start a team off with working with information
radiation and in an open, collaborative space.
With the mindset and practices adopted in this chapter and the previous chapter, we
have a strong foundation of culture. In the next chapter, we will look at the other aspect
of the foundation – technical practices and technical landscapes and what we do to
establish a strong technical foundation prior to starting product development.
Open Technical
6
Practices – Beginnings,
Starting Right
So far, we have talked about the foundational culture and collaboration practices that
support our discovery, options, and delivery Mobius loop. The next two chapters
establish the technical practices that teams should implement to make the foundation
even stronger.
Think of the Mobius loop as an engine turning from discovery through options
generation and into delivery. This cycle continues by doing more delivery until we need
to revisit the outcomes we've targeted. Delivery is where we take the concept and make
it real. As we deliver, we will learn a lot and garner feedback from our stakeholders and
our team. At some point in time, we will need to revisit the discovery side of the loop,
either to adjust what we know or to realign what we deliver next.
128 | Open Technical Practices – Beginnings, Starting Right
Imagine we've done one iteration and built some new features for our product, such
as a leaderboard for our PetBattle app – it's very likely we'd be OK to demo and release
that functionality without investing time or effort in building out a solid technical
foundation. But as our iterations continue and the application's complexity grows, we
may find ourselves beginning to drown in manual testing or technical debt. As we hit
iteration five or six, things that were working will most likely start to break and our
ability to predict how much we can do will start to fall apart. This could demotivate the
team and have people lose confidence in our product or ability to execute. Breaking
trust at this stage is a difficult thing to come back from.
To successfully build software incrementally, we need to ensure we can operate
smoothly and sustainably. Constant rewrites and fixes will hinder this.
For these reasons, it's important we support our delivery capability with a set
of foundational technical practices, much like we did with the collaboration and
culture. Practices such as configuration-as-code, infrastructure-as-code, and even
everything‑as-code can help ensure a team's work is repeatable. Identifying how a
team will do testing and automating that testing can lead to higher-quality output
and simplify defect management. Picking the right development workflow and tools
will accelerate the team's ability to deliver the software and not spend all their time
managing their tools instead.
Adding practices to the foundation is not a one-time activity. As the applications grow
in number and complexity, it's important to bolster the foundation with new and more
comprehensive use of practices.
In the next two chapters, we will share the technical practices implemented on the
foundational level that have enabled us to achieve the best success. They are part of
the foundation because they are not time-boxed practices; rather, they are continually
Green from Go! | 129
carried out as part of our daily work. Later in the book, we'll explore how bolstering
these practices with great use of the platform can enable sustainability and scalability
of continuous delivery.
It's also important to not have too much in the kit bag – it is our experience that when
bringing new teams on board using an accelerator, the technical burden it can have can
cause some team members to not engage with it. In order for teams to take ownership
and maintain something that accelerates their delivery, they often need to feel like
they helped to build it. If something is too complex to engage with, it becomes "that
thing that Brian knows all about, so I don't need to know it." This kind of behavior is
dangerous in a team as it can lead to silos and gaps in the team's collective ownership
and responsibility. Often, when the build system then fails or breaks in some way, that
person becomes the single point of failure when trying to recover it.
In the next section, let's look at two practices that help build collective ownership and
understanding.
Pairs and mobs of engineers allow for more knowledge transfer and create a shared
understanding of the problem and the journey to get to the solution.
• No pull requests: Pairing means you have to share your thought process with
your pair. This synchronization means code is being reviewed as it's written.
Often, people reviewing pull requests are too busy writing their own code and
they can only give very superficial reviews. When you pair, you review as you go
and therefore write leaner, better-understood code. You won't cut corners when
pairing as someone is watching.
• Team bonding: Humans are social creatures; we share and interact all the time.
Pairing and mobbing facilitates this interaction. Instead of sitting in a room with
headphones in, ignoring the world around you, developers in pairs look happier.
A room with mobbing and pairing going on is louder. Happy coders lead to better
code.
• The knowledge stays in the team: With more than one mind solving the same
problem, the understanding and logic stays with them. As pairs naturally shuffle
from task to task, the depth of knowledge stays with the team and not with an
individual. This means when holidays or even flu season take over, the team
can still continue to work at pace knowing the Unicorn is not leaving with key
information.
When tackling new problems (whether that's a new framework, a new language, or even
a particularly hard problem), we will all group together. Huddled around one computer
with a very large screen, we can explore the problem we're trying to solve. We mob
around the problem until the cross-functional team is satisfied that they have enough
knowledge or a rough scaffold of how to complete their tasks. The team then breaks
away into groups of two to pull items from the backlog and begin implementation.
Containers and Being Container-Native | 133
Mobbing and pairing allows teams to cross-skill. Sharing experience and expertise leads
to better teams. Well-oiled teams working like this can continue to build their product
sustainably and at pace, driving toward outcomes, not outputs.
You can learn more and collaborate about these practices by going to the Open Practice
Library site at https://openpracticelibrary.com/practice/pair-programming/ and
https://openpracticelibrary.com/practice/mob-programming/.
Container History
If you are over a certain age (over 30!), it is very likely your first computer program
involved compiling source code and statically linking it with libraries from the operating
system. Computer scientists then invented dynamic linking – which is great: you
could patch one library and all of the programs you had written would pick up that
change once restarted. This of course created a different problem – managing all of
the dependencies. Packaging technologies such as RPM and YUM were created to
help solve the dependency problem when distributing and managing Linux operating
systems. Operating system distributions are one mechanism for collaboratively sharing
and managing lots of different software packages at scale, and ultimately it is the
software communities that evolve around these different software packages that solve
real-world problems.
134 | Open Technical Practices – Beginnings, Starting Right
Of course, running your application on one physical machine is fine, but running lots
of applications across lots of machines becomes a standard requirement as you scale.
Virtualization solved how to run many operating systems on one machine in an isolated
fashion. Indeed, the prominent form of cloud computing has been running virtual
machines on someone else's hardware.
Virtualized infrastructure solved many problems of running applications at scale.
However, configuring all of the pieces required to manage a fleet of virtual machines
(VMs) leading to an explosion of tooling and technology related to configuration
management. There was also the problem of "VM sprawl" – lots of VMs everywhere
using too many resources that were difficult to patch and manage as a whole. Every
application was managed independently, possibly by different teams. It was seen
as desirable to reduce the interdependency between each application and so each
application was also deployed in its own VM. To help control this spread of VMs, they
were managed centrally by an infrastructure and operations team. Silos between teams
were built! Many tools were created to help configure VMs. Each VM has overhead for
running system processes and daemons, and so a lot of effort has been spent building
tools that help avoid over-allocating machine resources to help save money.
For developers, the user interface in a VM deployed within an organization was not
particularly self-service. Requesting a VM to be provisioned takes time. Workflow,
ticketing, and provisioning systems were automated to try and help speed up this
service request process. This was made radically better by public cloud services with
an API-driven infrastructure, where provisioning a VM takes minutes and there is real
self‑service for developers. The control and VM sprawl issues still exist, though.
The application stack that developers used is still dependent on the operating system
and libraries packaged into the VM that came with the kernel (for example, libc, libssl).
And developers were usually not allowed to change the VM configuration, either
because of perceived security or stability concerns. This was an infrastructure or
operations team responsibility. Often, VMs were not easy to update, patch, and manage.
It was not clear to the infrastructure or operations team what the effect of updating
and rebooting a machine would have on the applications they supported.
2 https://www.section.io/engineering-education/history-of-container-technology/
Containers and Being Container-Native | 135
Technology companies that run a large number of workloads are always looking at
ways to save resources and ultimately be more efficient. Roll forward to 2006 and a
number of technology enhancements were made within the core Linux kernel that was
related to the isolation of Linux processes. Google introduced the technology initially
called process containers that was later renamed cgroups. It was designed for limiting,
accounting, and isolating resource usage (CPU, memory, disk I/O, and network) of a
collection of processes.
A novel solution to the dependency problem for containers was introduced by Docker
in 2013. Packaging applications and their dependencies into container images lead to an
explosion in popularity for containers. Container images were made freely available and
distributed online via container registries such as dockerhub.io and quay.io.
A running container is really just Linux processes with extra protections and data
structures supporting the process in the kernel. Running containers on a single
machine was easy; running thousands of containers across a compute farm was a
much harder problem to solve. Enter into the scene container orchestration engines
of which the Kubernetes project is by far the most widely used today. The OpenShift
Container Platform is a product that brings together Linux, Kubernetes, and container
technologies to allow enterprises to run containers safely and securely at scale in the
enterprise.
Of course, to get to real business value, it is not enough to package your applications
as containers and deploy a Kubernetes platform such as OpenShift. Just because you
build it does not mean that all users will immediately flock to the platform! Modern
application delivery using trusted supply chains forces new tools and ways of working
onto your teams. New behaviors are required.
With containers, the developer's user experience has been radically changed.
Developers can now self-service their applications without having to go through
the provisioning of a VM. Of course, someone still had to provision the platform!
Provisioning and starting of containers took seconds and minutes, and today with
serverless-focused technology stacks, milliseconds.
Developers can control the packaging, running, and upgrading of their applications
easily using container images. The application is no longer tied to the version of
libraries packaged in the kernel. It is possible to pull out all of an application's code
and dependencies into a container image. You can run multiple versions of the same
application together without being dependent on the same version of libraries in the
kernel.
136 | Open Technical Practices – Beginnings, Starting Right
The immutable nature of a container image also improved the overall service quality
of applications. Teams could ensure that exactly the same container image would be
run in different environments, such as development and production. To be able to
run this immutable container image in different environments, developers started
to learn that by externalizing their application configuration they could easily run
the same container anywhere. The application configuration management was now
built in as part of the container deployment process and the platform. This led to
clearer boundaries between what the developers controlled (their applications and
configuration) and what ITOps controlled (the platform itself).
levels under change, security, storage, high availability, and load balancing/networking
concerns usually require everyone's involvement. It is the coming together of these
teams and the DevOps conversations between them that form the backbone of modern
DevOps practices today.
You can learn more and collaborate about the containers practice by going to the Open
Practice Library page at openpracticelibrary.com/practice/containers.
Pipelines — CI or CD or CD²?
"The job of a pipeline is to prove your code is not releasable." – Jez Humble
OK – let's set the scene and get some basics out of the way. How do we take our code
from individual lines of text on a laptop to being an application running in a container
in production? Well, there are lots of ways with lots of kooky-sounding names! Teams
call the journey our software goes through a pipeline, but there are numerous ways to
implement one.
Let's pause for a minute and think about what a software pipeline really is with the help
of our friend Derek, the DevOps Dinosaur!
So, how do we know Derek is as big and scary as I need him to be? Let's start
with his parts. First, we might analyze each of the individual parts of our
dinosaur. Give them a quick check-over and ensure they meet the standard
we set. For example, do I have two arms and two legs for my dinosaur? Has
he got enough teeth? If it all looks good, we can then pop the parts in the
Dino-Constructor 5000™.
Pipelines — CI or CD or CD²? | 139
We'll start him on a climbing wall, much like the one you'd see recruits on
in an army movie. Then if he's lucky enough to get over that hurdle, he's on
to the next obstacle where he must jump over some spikes, Indiana Jones
style! Next, we check how fit our dinosaur is; if he's able to run fast on the
treadmill, he gets to pass on to the next step. Here he must try swimming
past some fish that are trying to nibble on him. Once through that, perhaps
he has to jump through a ring of fire. If Derek is capable and makes it
through the obstacles, he can then run toward his pen – however, if Derek is
not careful, he may be stopped by the swinging blade that was menacingly
looming over him the whole time, like something from a Mario level. At any
time, the blade could drop and stop Derek dead in his tracks. Let's for a
moment assume Derek was careful and has made it into the pen where the
other dinosaurs are.
Derek can now live out his days with the other dinosaurs in the Dino Petting
Zoo, Danny and Debbie. However, unfortunately for Derek, Debbie the
dinosaur is quite mean to him. She keeps stealing all of the precious grass
and water that Derek likes to eat (Derek is vegetarian in this metaphor!). So,
in order to give Derek the isolation and correct amount of things he needs to
be strong and healthy, the zookeeper comes along and moves him to a new
pen.
Pipelines — CI or CD or CD²? | 141
The obstacle course we built for Derek represents the steps we should carry
out to further assess our code quality. The initial hurdle Derek must get over
could represent some unit testing. It is important that these obstacles are
tough enough of a challenge while also not being so easy that they provide
no value. For example, if Derek can make it over the climbing wall with ease,
then it's probably not testing all the parts of him. Imagine for a moment that
we decided to add another arm to Derek. We now have a terrifying three-
armed dinosaur! If we were to ask him to climb the wall again, he would find
it much simpler than before. In this regard it is important to increase the
difficulty of the climb, perhaps widening the gaps or making it steeper so
it presents more of a challenge. Thinking back to code, the logic still holds.
When we introduce new features to our applications, we need to improve
the testing coverage to include this. Writing tests is not a one-time thing; it
must continue to evolve alongside our application development.
The other obstacles represent additional testing types. The small piranha
pool Derek must swim through in order to get to safety could represent
some early integration tests. The treadmill he must run on may be a kind of
performance testing. The final obstacle Derek must pass unscathed is the
giant blade hanging above him. Constantly looming, this testing type is, in
my eyes, often the one that gets forgotten about. Derek may think he is free
and run toward the pen only for the blade to drop on him and mean he can
go no further – this is an example of security testing. Often forgotten about
until the last minute, it can be a showstopper for final deployment in a lot of
cases.
Figure 6.11: Failing to make the cut when moving to a new pen
Pipelines — CI or CD or CD²? | 143
Once Derek arrives at the dinosaur pen, he has to share the space with
some other dinosaurs. Perhaps, at this point, the code has been deployed
to a public cloud or a VM with competition for shared resources. Hopefully,
by now, the operations team has noticed the application is running out of
memory or there is a lack of compute. To combat this problem, the team
might automate the containerization of the application. Once the code is
in a container, it becomes shippable. We can move the container between
cloud providers or even just between environments. At this stage, the code
is packaged up with all of the dependencies it requires to run. This ability
to move code without the need to rebuild and test can be safely achieved
by building immutable container images. Versioning the application
configuration separately from the built software means we can also
horizontally scale our software easily by running more instances based on
user demand.
Thanks to Derek, we now know that a pipeline is a series of steps we use to build,
package, test, and deploy our software. Now, let's look at some of the terminology
people use to describe a software delivery pipeline.
144 | Open Technical Practices – Beginnings, Starting Right
Continuous Integration
Continuous Integration (CI) is a software development practice that was popularized
by the authors of Extreme Programming. There have been countless books written
about it but the shortest definitions are sometimes the simplest! The three-word
definition of CI is to "integrate code continuously." That is to say, developers and
teams should regularly commit and push their code into the repository and have some
automated process to compile, package, and test that code. This process should happen
frequently – many times throughout the day for maximum effect.
More teams fall down on this CI hurdle than you may think. Often, teams think they are
practicing CI when in fact they are not.
Pipelines — CI or CD or CD²? | 145
Integrate Continuously
I worked on a Labs residency for a security
company a few years ago. The team was fairly
junior with several team members who'd just
graduated. The team decided to create feature
branches when writing their code so as to not
break the workflows of others. Unfortunately,
this led us to having these branches that lived
for the duration of the sprint. We had lots of
automation that was triggered when code was
merged but we weren't merging frequently
enough.
For two sprints, we had the same mad dash to merge all our features at
the end of the sprint just before the weekly demo – and it was chaotic, to
say the least! This resulted in bi-weekly "mini-integrations." We had lots of
automation set up to validate our code but we were not using it frequently
enough. As you can imagine, there is nothing continuous about this process
– we were not integrating continuously!
To remedy this, we talked about it over our retrospective. If the tool you're
using, in our case Jenkins, can give you data about the frequency of builds
or its usage stats, these can be great things to print out or bring to a sprint
retrospective. A brilliant Scrum Master I once worked with always did
this and it helped the team focus during the retro on actionable things
that we could do to make things faster. In our case on this residency, we
were operating in one-week iterations. This meant only four days of actual
development time! Through the retrospective, we identified a few actions
from looking at the data being supplied to the team:
1. Integrate continuously – This was a big change for us, to try as often as
possible to merge features together and get that validation we needed to
avoid the merge hell we were encountering during demos.
2. Smaller features – The team realized that work was being broken down
into too-large chunks. Each chunk was taking most of the sprint to
complete. A smaller task size for each feature meant we could validate
faster in smaller chunks whether things would work or not.
146 | Open Technical Practices – Beginnings, Starting Right
You can learn more and collaborate about the CI practice by going to the Open Practice
Library page at openpracticelibrary.com/practice/continuous-integration.
Continuous Delivery
Continuous Delivery (CD) is a development process where on every code change,
teams build, test, and package their code such that it can go all the way to production.
It is delivered to the doorway of production in an automated way but not let in. Lots of
teams get to this state, and it is a great place to get to, but are held back from releasing
all the way to production usually due to organizational release cadences or additional
approvals being required. The important thing here is that they could release to
production if needs be.
It was a new experience for them, as they were only used to seeing the
whole thing at the end.
Fast forward to the end of the project and the business had started to see
the value of the tests we'd written. On every change, we had automated the
building of the mobile app, deploying it to the app store, and we ran a huge
suite of integration tests. They continued to do their manual testing phase
at the end, which did throw up a few bugs (which we then wrote automated
tests for and fixed). However, when they compared the number of issues
found during this phase against other similar projects, there were far fewer.
On the day of go live, the team was all set to push the app to the app
stores and do the final deployment of the APIs. The retailer had marketing
campaigns and other events aligned with this go live date, so the pressure
was on! The teams were making minor app fixes right up to this point. Every
change required the business to sign off the release, which meant involving
the manual test team. Due to the pressure of the release window, the
business decided to only do a quick smoke test of the app to see if the issue
being fixed was resolved on a specific release candidate. This smoke test
passed, so they were ready to roll – however, our automated tests threw up
two failures in a service delivering product reviews within the application.
There had been a minor change to the data format in the system of record
further down the architecture that meant some data transformation
functions were not working. This was not caught by the manual test team
as they were not smoke testing this functionality. We flagged it up that our
tests had spotted a regression, and the release was paused while this issue
was resolved.
It may seem like a trivial example, but this marked a big turning point for the
retailer. They'd witnessed first-hand the speed, reliability, and effectiveness
of our automated test suite as well as the speed at which we could build,
validate, and deliver a production-ready application. The act of writing and
running automated tests built huge trust within the wider organization,
prompting them to change their ways radically in favor of more automation
and more test automation.
Pipelines — CI or CD or CD²? | 149
You can learn more and collaborate about the CD practice by going to the Open
Practice Library page at openpracticelibrary.com/practice/continuous-delivery.
A number of sprints into the engagement but still early in the development,
the team was creating the authentication flow for users. I was pair
programming with one of the engineers and we'd written the logout
functionality. We had test-written and demonstrated the feature to the
product owner running in our test environment. The Definition of Done
the team agreed meant we had to show the feature to someone from the
product team so they could accept it. So, as far as the engineering effort was
concerned, we were done. The product owner did a quick test and it looked
good in the test environment, so at the end of the sprint when we promoted
all our changes up to production, our feature was released.
The UX folks were doing some usability testing with the latest increment
of the application when they noticed some buggy behavior with logout not
working from one of the screens. This was reported to the engineer and me,
who worked on it initially, and we could spot the issue immediately. This was
a small fix, so we wrote another test and made the change.
We demonstrated the process to the product owner – writing a test that
failed, writing code that would make the test pass, iterating, and then
delivering that fixed logout code all the way to production. The ability to
deliver small incremental improvements into the hands of our end users
when they were ready to be shipped paved the way to deliver continuously.
The lesson here is that Scrum may start to impede a team's ability to
continuously deliver small incremental changes safely to production because
Scrum delivers at the end of a sprint. "When the work is done, just ship it to
production."
We have learned about the role of software pipelines, which codify the steps required
to build, package, test, and deploy our application code into various environments
up to but not necessarily including production – the practice of CD. We then looked
at an approach to continuously deploying small incremental changes all the way to
production.
You can learn more about CD² by going to the Open Practice Library page at
openpracticelibrary.com/practice/continuous-deployment.
152 | Open Technical Practices – Beginnings, Starting Right
Everything-as-Code
You may have heard about this one before: [insert software term here]-as-code.
Examples include infrastructure-as-code, config-as-code, tests-as-code, and
now everything-as-code. This practice has been around for a long time but some
organizations have been slow to adopt it.
Here's the problem – historically, organizations have had to get expensive specialists
to deploy complex environments. They would spend hours going through pages of
instructions, line by line, eventually getting the deployment to work. A number of weeks
would pass and the organization would like to create another environment, exactly like
this one, for further testing. What do they do now? Call the specialist and ask them to
come back at a great cost! This is fine, if you like hiring expensive specialists a lot.
So, what's the solution? The everything-as-code practice is simple: you treat every
part of a system as you would any other line of code. You write it down and store it in
a version control system, such as Git. Do we really mean to automate every part of the
system? Yes.
We start by automating the infrastructure layer, the lowest level, from the bare metal
servers to the operating systems, networks, application configuration, and on up
through to application deployments.
This automation effort sounds like an awful lot of work, and could be expensive in terms
of people's time – why should you invest in doing it? Here's why:
• Traceability: Having your environment descriptions and structure stored in a
version control system allows us to audit changes made to the system, tracked to
the individual who made them.
• Repeatable: Moving from one cloud provider to another should be a simple task.
Picking a deployment target should be like shopping around for the best price
that week. By storing all things as code, systems can be re-created in moments in
various providers.
• GitOps: A single source of the truth means no more tribal knowledge or experts
needed to set up cables or attach hard drives.
• Phoenix server: No more fears of configuration drift. If a server needs to be
patched or randomly dies, that's OK. Just create it again from scratch using the
stored configuration.
Everything-as-Code | 153
Teams who treat the whole system as code are stronger, faster, and better for it. We
should no longer think about just infrastructure-as-code but automating the whole
system – everything from application properties to networks and security policies.
Then we codify it!
154 | Open Technical Practices – Beginnings, Starting Right
Some years later on another engagement for a public sector client, I saw
similar behavior. I thought maybe creating servers in this way was a localized
instance but on the government contract, there were teams spinning
up servers for the developers to use that were not using any scripting
or automation. If you wanted a server, you raised a ticket and waited a
week. If you wanted an exact copy of that one, you raised another ticket
and sometimes received one that was identical. In this case, the team was
manually executing shell commands inside each VM and more often than
not forgot to run a command or two!
These examples may feel a bit old now – but the reality is that I still see
organizations with a traditional approach to infrastructure, automation, and
repeatability. Not being able to test changes on representative hardware can
be a challenge for teams trying to go fast. Teams need to have the power to
spin up and spin down application stacks on demand. Modern approaches
to how we package applications, such as containers, can really help to bring
down this wall. No longer does a developer need to stub out test cases with
database calls, because they can just spin up a real database in a container
and test against it.
You can learn more and collaborate about the everything-as-code practice by going to
the Open Practice Library page at openpracticelibrary.com/practice/everything-as-
code.
So, what approach did the PetBattle team take while practicing everything-as-code?
156 | Open Technical Practices – Beginnings, Starting Right
Now that we have established PetBattle's technical foundation, let's explore Jenkins a
little more and the role it can play in strengthening foundations.
Everything-as-Code | 157
Helm Overview
This next section is going to get a bit more detailed on the technical side of things.
Prepare for some code snippets and whatnot! If this is not your thing, feel free to skip
over it to the next section all about Git and developer workflows. We'll mark any section
that's going to have code snippets and be a bit lower level with this handy sign!
Jenkins comes with OpenShift, and there are several ways
for the team to install and configure it. Any member of the
cross-functional team could go to the OpenShift console
and install it from the catalog. It's as simple as clicking a
few buttons in the UI and choosing to add a persistent
hard disk or not. This is a great way for the team to get
moving fast but also would not honor our technical
foundation practice of everything‑as‑code!
PetBattle now has two choices for how they could create
an instance of Jenkins while honoring our everything-
as‑code practice. They could use OpenShift or Helm
templates containing all of the Kubernetes and OpenShift objects that would be
required to deploy a working Jenkins. For the purposes of this book, we will focus
exclusively on Helm as our Kubernetes package manager.
Helm is an application package manager for Kubernetes that allows both developers
and operators to easily package the resources and configuration that make up an
application into a release. Helm is used for application life cycle management for
installing, upgrading, and rolling back application deployments, thus simplifying
the installation of an application on an OpenShift cluster. In Helm, applications are
packaged up and distributed as Helm charts. A Helm chart is made up of several YAML
files and templates. These Helm templates should output Kubernetes YAML once
processed. Let's take a look at an example Helm chart.
From our experience using Jenkins on OpenShift with customers, we have written a
chart to deploy the Red Hat instance of Jenkins and give it a few superpowers. We'll
look at those afterward. Let's first explore the anatomy of a chart:
jenkins
├── Chart.yaml
├── README.md
├── templates
│ ├── PersistentVolumeClaim.yaml
│ ├── buildconfigs.yaml
│ ├── deploymentconfig.yaml
│ ├── imagestreams.yaml
Everything-as-Code | 159
│ ├── rolebinding.yaml
│ ├── route.yaml
│ ├── secret.yaml
│ ├── serviceaccount.yaml
│ └── services.yaml
└── values.yaml
The Jenkins chart, like all Helm charts, is made up of a number YAML files:
• Chart.yaml: This is the manifest of our Jenkins chart. It contains metadata
such as the name, description, and maintainer information. The manifest also
contains the application version and the version of the chart. If the chart has any
dependencies on another chart or charts, they would also be listed here.
• README.md: Instructions for the chart, how to install it, and how to customize it.
• templates/*: This folder contains all the resources that need to be deployed to
install and configure a running Jenkins instance, such as deployments, services,
routes, and pvc.
• values.yaml: These are the sensible (default) values that the chart can be
run with so a user can just install the chart and get up and running quickly.
Customizations to these values can be supplied on the command line or by
supplying your own values.yaml file when installing a chart.
Figure 6.15: Searching the helm repository for the Jenkins chart
160 | Open Technical Practices – Beginnings, Starting Right
Other OpenShift clusters you have access to may also work, as long as you have
sufficient resources and privileges. The CodeReady Containers install gives you cluster
administrator privilege (the highest level of privilege) and is limited by how much RAM,
CPU, and disk space your laptop has. We recommend 8 G RAM, 4 vCPUs, and 31 GB of
disk space as a minimum, which would correspond to starting CRC on linux with:
crc start -c 4 -m 12288
Everything-as-Code | 161
The string after installation, my-jenkins, is the release name that is used by the Helm
template engine:
oc login <cluster_api> -u <name> -p <password>
oc new-project example
helm install my-jenkins redhat-cop/jenkins
It allows us to create multiple releases in the one namespace, which is useful for testing
purposes:
Helm charts can be installed in a number of ways. You can also run helm template
against a local copy of the chart. If you are interested in doing this, fetch the chart and
run this:
helm fetch redhat-cop/jenkins --version 0.0.23
helm template test jenkins-0.0.23.tgz
This can be useful if you want to play around and see what the output is before applying
it to an OpenShift cluster or if you want to validate things while debugging or testing
a chart's configuration. You can also supply --dry-run to the helm install command to
verify the chart before installing it.
Let me just pause a minute and say that this is not a book about Helm! There are great
books out there written specifically for it, such as Learn Helm (https://www.packtpub.
com/product/learn-helm/9781839214295) by Andy Block and Austin Dewey. Our aim
is just to scratch the surface to show how easy it is to get going in a reusable and
repeatable way with Helm and OpenShift.
Using helm install as demonstrated previously is great as it will create a life cycle
managed by the Helm CLI to run upgrades and roll back releases if needed. These
revisions are integrated into OpenShift and can be viewed in the UI or on the command
line. Every time a new revision is deployed to the cluster, a new secret will be created,
making rollback very simple:
oc get secrets -n example | grep helm
To see all the pods being spun up by the Jenkins chart, you can run this:
oc get pods --watch -o wide -n example
You should see a large volume of pods being created – this is because this Helm chart
contains lots of additional configuration-as-code for Jenkins. Write once and deploy
many times:
You may notice a bunch of agent build pods in the output. Jenkins by itself is a bit
useless. One of Jenkins' superpowers is his ability to be extended using what are called
plugins – small bits of code that provide new functions and features. To install these
plugins, we could wait until Jenkins is deployed and configure the plugins manually
through the UI – but this is the everything-as-code world, so we don't want to do that!
The Jenkins Helm chart is configured to pre-install a bunch of useful Jenkins agent
plugins. These agents know how to build container images using various language-
specific stacks. The configuration for the agent plugins is defined in the Helm chart's
values.yaml file, which you can see by using this:
helm show values redhat-cop/jenkins
buildconfigs:
# Jenkins agents for running builds etc
- name: "jenkins-agent-ansible"
source_context_dir: "jenkins-agents/jenkins-agent-ansible"
source_repo: *jarepo
source_repo_ref: "master"
...
The Helm chart is defining a list of build configurations to build each agent image.
The Jenkins agent images use an OpenShift project called Source-to-Image (S2I) to
do their language-specific build of your applications. S2I is a toolkit and workflow for
building reproducible container images from source code; you can read about it here:
https://github.com/openshift/source-to-image. You basically feed S2I your source
code via a Git repository URL and it takes care of the rest.
Using language-specific agents makes Jenkins easier to extend. We do not have to
install tools into the base Jenkins image; rather, we define an agent and add it to the
Helm chart agent plugins values list. OpenShift makes it very easy to create agents for
Jenkins. We can extend the base image with any binary we want to use in our pipelines
and apply the label role=jenkins-slave to make it discoverable in Jenkins. This gives
us a near "serverless" ability for Jenkins to dynamically provision an agent when it's
required. In this case, a pod gets launched and Jenkins will connect to it, execute its
tasks, and destroy it when it's done. This means no agents lying idle waiting to be
executed and a clean slate every time we run a build.
There are a bunch of Jenkins agents available in the CoP; you can use them or create
your own: https://github.com/redhat-cop/containers-quickstarts/tree/master/
jenkins-agents.
164 | Open Technical Practices – Beginnings, Starting Right
Apart from the agent plugins, the Jenkins image is extensible from the base image in a
number of different ways. You can specify a list of plugins to install when you build the
Jenkins image. We use S2I to build our Jenkins image and add our list of plugins.txt
from this Git repository: https://github.com/rht-labs/s2i-config-jenkins.
Once the Jenkins build has completed, a Jenkins deployment and running container
instance will be available.
All of the S2I plugins and agents are configured. You can log in to Jenkins using its
route, which is available in the OpenShift web console, or by running this:
oc get route jenkins
By running this single helm install command, we get a sensible starting point to
be able to do lots of things with our build server, Jenkins. By codifying the Jenkins
configuration, we can repeatedly deploy Jenkins into many environments without ever
having to touch the Jenkins UI.
Now that we have our build server, before starting development we should familiarize
ourselves with the types of code workflows developers use. If you are an experienced
developer, you will already be pretty familiar with the next section's content.
Developer Workflows | 165
Developer Workflows
Git is a version control system (VCS) created by Linus Torvalds (author of the Linux
kernel) to track changes in source code and easily manage these changes across many
file types and developers. Git differs from other VCS in that it is decentralized. This
means that unlike, for example, Subversion (svn), each developer retains a complete
copy of the source code locally when they check it out. Locally, each developer has
a copy of all the history and can rewind or fast forward to different versions as they
need to. An engineer makes their changes and applies those changes as a delta on top
of another's work. This is known as a commit. Git can be conceptualized as a tree, with
a trunk of these changes or commits on top of each other. Branches can spring out
from the trunk as independent pieces of functionality, or work that is not ready can
be merged back to the trunk. Once something is committed to Git, it is forever in the
history and can always be found – so be careful not to add something secret, such as a
password, by mistake!
Git is the underlying technology behind some big companies such as GitHub and
GitLab. They have taken the Git product and added some social features and issue-
tracking capabilities to help manage a code base.
There are many workflows for Git that development teams can use when writing code,
and choosing the correct one can seem like a daunting task. Some are designed to give
teams a sense of safety and security, especially in large complex projects, while others
promote speed and trust within the teams. The most popular source code management
workflows for Git are Trunk, GitFlow, and GitHub Flow. Let's explore each in detail and
see how we could use them to help us promote CD.
GitFlow
GitFlow was first published about 10 years ago by Vincent Driessen. The workflow was
built from his experience using Git, a relatively new tool at the time. As teams moved
to Git from a non-branching-based code repository, some new concepts and core
practices had to be defined. GitFlow tried to answer this by adding a well-thought-out
structure to branch names and their conventions.
A well-defined branching strategy is at the heart of GitFlow. Changes are committed
to different named branches depending on the type of change. New features are
developed on branches that are called feature-* branches. hotfixes-* branches
are created for patching changes to bugs in production and a release branch. GitFlow
describes two reserved and long-living branches:
166 | Open Technical Practices – Beginnings, Starting Right
• Master: This is the branch that contains our releases or our production-ready
code. Sometimes this branch is referred to as the main branch.
• Develop: This branch is our integration branch. It is usually the most turbulent
and very likely to contain bugs or other issues as it is the place where teams first
bring their code together.
The naming and usage conventions defined in GitFlow make it easier for a new
developer to discover what each branch is doing. The developer can bring additional
changes made by other team members into their feature branch, when they choose to,
by merging in any new changes. Branching in this way avoids breaking things for other
engineers by ensuring that the feature functionality is complete before asking to merge
their code from the feature into the develop branch. When a set of features is ready to
be promoted to the mainline master branch, the developers merge their code to the
master via the release branch.
You may be reading this and thinking, this sounds complex! And in some ways, it is. But
in a large project with a single code base, this can be exactly what is required to ensure
developers are free to work on their code without having to manage their code.
GitHub Flow
GitHub Flow is similar to GitFlow in that it shares some of the same words in its name.
Branching is a core pillar of Git, and GitHub Flow uses this by keeping one long-lived
branch, that is, the main or master branch. Developers then work in branches off main,
where they can commit changes and experiment without affecting the main branch.
These could be feature branches like in GitFlow, but there is no naming convention to
be followed. It is important to name the branch sensibly using a descriptive name, such
as sign-up-form or refactor-auth-service. No branches called another-new-feature-
branch, please!
At any point in time, the developer can raise a pull request, where other engineers can
discuss the code and its approach, and design by providing feedback for the changes
that are still in progress. The original author can then incorporate this discussion into
the software. When the team is happy and the code is reviewed, the changes can be
approved and merged to the main branch.
GitHub Flow is great at promoting the peer review of work and promoting transparency
in how a decision was made. Git by its nature is searchable, and the discussion on a
merge request provides valuable insight and traceability into how architectural and
coding decisions were made.
Developer Workflows | 167
Trunk-Based Development
Both GitHub Flow and GitFlow use branching strategies and merging patterns to bring
independent development activities together. Branching in Git is made extremely easy.
However, when merging all of the branches together, conflicts can still occur that
require human intervention.
Depending on the frequency of this branching, some teams end up in merge hell, where
everyone tries to bring their changes in at once, leading to the complex and often
frustrating event of trying to unpick all the changes that were made while maintaining a
working code base!
Trunk-based development (https://trunkbaseddevelopment.com/) takes a somewhat
different approach to this particular problem by saying no to branches!
In a large-scale enterprise, this constant merging to master sounds like it could create
headaches. How, for example, do you do a peer review of the code? For large-scale
application development with many engineers and many teams, it is suggested that
very short-lived feature branches can be a great help. They provide decision logs during
the review process, but the key here is short. Short-lived feature branches should only
be alive for a day or two at most (definitely no longer than a sprint) and are deleted
once the code is merged to prevent them from becoming feature release branches.
Short-lived feature branches are a great way to help with some of these concerns.
Developers can work away on small chunks in an isolated way and still merge
frequently. Short feedback loops are the king of improving software delivery metrics. If
branches add time to this loop, how can we tighten it further? Peer reviews can often
be a burden to teams by creating a dependency on one individual or breaking the focus
of another engineer in order to complete a feature. By pairing engineers, you gain
implicit peer review. Pushing changes as a pair straight to the trunk is a great way to
achieve speed. In a container ecosystem, you only want to build once and verify your
application is working before deploying it to many places. Trunk-based development
underpins this by encouraging frequent small changes pushed straight to the head,
where CI and CD can then take over.
From our experience in kickstarting product teams with varying skill sets, choosing the
right one should be seen as more of a pathway, a sliding scale from immature to mature
teams. Teams that are new to Git may find the use of feature branches a comforting
way to not step on the toes of other developers. The book Accelerate3 measured the
software delivery performance of many teams and concluded that high-performing
teams use trunk-based development.
No matter what you choose as your approach to managing code, the key here is the
frequency of delivery. How long will it take you to get software into the hands of your
end users? Does having feature branches slow you down? Or do those branches provide
you with a safe place for your team to start? As the team matures and becomes more
familiar with each other and the tools, your software output can increase.
The big call to action here is to let the teams choose the way that works best for them
and build the automation around the workflow and tools. This allows the developers to
focus on the hard stuff – writing code not managing the code. Initially, this will just be
a guess. Teams should use retrospectives to assess whether things are working or not
and evolve accordingly. It's important to not set out one dogma to fit all development
activities across all teams because every team is going to be different. One shoe size is
not going to fit everyone!
3 https://itrevolution.com/book/accelerate/
170 | Open Technical Practices – Beginnings, Starting Right
Conclusion
In this chapter, we learned that we could get off to a great start by being green from
go! By automating the deployment of our application build and packaging tools, Jenkins
and Helm can establish a technical foundation that will allow our teams to integrate
continuously (CI) and continuously deploy (CD) our code to production.
We learned that we can align our developer code workflow across our team and begin
to iterate on our CI/CD pipelines to help us deliver applications faster. We can increase
code quality and understanding by pairing developers together to help shorten the
code review feedback loop.
As a team, we learned all of these new skills and techniques together by trying mob
programming and, in the process, said goodbye to our love for Unicorn developers.
In the second half of Open Technology Practices, we will learn about the bigger picture,
discover what GitOps is all about, vastly improve our code quality through testing, and
finish off with some lessons about our emerging architecture.
Open Technical
7
Practices — The
Midpoint
In this chapter, we are going to build on the foundational technical practices that we
started in the previous chapter. We will acquire a shared understanding of our software
delivery pipeline using the Big Picture practice. Even the less technical team members
will be able to follow what happens to our software as it is being written and delivered.
We will then explain a technique that allows DevOps teams to deliver software changes
using Git as the driving tool. The practice of GitOps leads to greater visibility of changes
within our system, allowing the team to debug and resolve issues faster. We will explore
how to improve our code quality through test automation and conclude this chapter by
asking the question How do we know if our architecture is good?
This chapter will cover the following topics:
• The Big Picture
• GitOps
• Testing
• Emerging architecture
172 | Open Technical Practices — The Midpoint
A Big Picture can easily be created with just some stickies and a clear board or space.
Of course, if you're feeling more artistic, it can also be doodled!
You may be reading this and thinking Sounds fluffy to me – why should I bother to make
one? Here's why:
• Shared understanding: When the whole team collaborates around making the
Big Picture, they get a shared sense of how their pipelines connect code to users.
• Prototype quickly: It's cheaper to write and draw before implementing a single
line of code! Rapidly prototype with some markers and Post-Its, moving stages of
your pen and paper pipeline.
• Complexity simplified: The Big Picture helps bring non-techies into the mix by
showing them the components required to manage the software life cycle. Build
it up one step at a time to demonstrate the complexity in a simple visual flow.
• Information radiator: Like all these practices, the Big Picture is an evolving
artifact. As the complexity of a software delivery pipeline grows, the Big Picture
should be updated to reflect this. It is a graphic that can be displayed to all and
should not be hidden.
Big Pictures can also be drawn using online collaboration tools. We used Miro to draw
the following digital Big Picture online.
The material required for this practice is fairly simple: some stickies, marker pens,
painters' tape, and a big blank wall or canvas are all that's required, and these are fairly
common things to have in our kit bags! There are a number of simple steps to follow in
creating your Big Picture, but let's use our PetBattle example to show how a team might
use this in practice.
The Big Picture | 175
They draw a box to the left side to represent their local development
environment. This is their laptop for now, but it could also be a cloud-hosted
IDE that the development team could write their code in that is deployed
inside the cluster. One such product, CodeReadyWorkspaces, is a cloud-
hosted IDE that runs in the OpenShift cluster that could be of great use to
the team. Using an IDE like this allows us to further our everything-as-code
practice by providing developers with their coding environment as a code
artifact.
176 | Open Technical Practices — The Midpoint
Next, they slice up the OpenShift cluster into smaller boxes. Each of these
represents the OpenShift projects (or Kubernetes namespaces). We can
think of these projects as rooms that separate one collection of applications
from another. To keep things simple, the team decides on four namespaces
initially:
• Dev: A sandbox project for the dev team to validate their app or get fast
feedback from.
• Test: A project to deploy all our applications to and run our system tests
against.
• Production: The project that PetBattle's customers will use to access the
applications once they've cleared our tests.
• CI-CD: The project that houses all the tooling that supports Continuous
Integration (CI) and Continuous Delivery (CD).
The Big Picture | 177
With the OpenShift cluster logically sliced up into the projects the teams will
use, the team draws the tools they will use in each project.
PetBattle will use GitHub to store its source code. When building images,
it is likely the team will need to store the built image internally on the
OpenShift cluster using the internal registry. The team also wants to make
their images available externally and so have decided to also make use of
Quay.io, an external registry hosted in the public cloud.
The team then starts to add the tooling they will use to create their pipelines
in their CI/CD namespace. They use more stickies to draw the tools and add
a one-liner definition of what each tool is or how they will use it.
For example, the team is going to use Jenkins for their build and test
automation. To store and cache application build dependencies and
artifacts, the team opted to use the open-source artifact repository called
Nexus. For Nexus, they add a simple one-liner to highlight the fact that it
is used to house their software artifacts as well as their Helm repository.
Shared understanding is key here, so it's important for the team to make
sure everyone is aware what the purpose of each item is – this includes the
product owner, designers, and all other interested parties. They don't need
to be experts, but having an understanding of what the tools are used for
can help them establish better empathy with the development team and see
for themselves all the things needed to be able to ship code so quickly to
users.
With some of the tools in place on the Big Picture, the PetBattle team can
now start to implement the design they've put in place.
The Big Picture can be created in a physical room with lots of colorful sticky notes
or with everyone distributed using a tool such as Mural, Miro, PowerPoint or Google
Slides. We have provided a useful template with all the icons we use which should help
you get started. You can download this from the book's GitHub repository.
The Big Picture allows us to get a shared understanding and team alignment around
the use of technical tools at a high level. Like all the practices we put in place in this
book, the Big Picture is not a one-time thing. The Big Picture is a tool we will revisit and
enhance as we add more complexity to our architecture and begin implementing our
pipelines. We will continue to explore the Big Picture in Section 6, Build It, Run It, Own It.
You can learn more about, and collaborate on, Big Picture practices by going to the
Open Practice Library page at https://openpracticelibrary.com/practice/the-big-
picture/.
GitOps
Up to this point, we've talked about Git and the developer workflows available to our
teams. We've spoken about everything-as-code, from infrastructure to tooling to
applications all along the stack. Now, let's tie this all together with GitOps.
GitOps sounds a bit like a buzzword, as DevOps was when it was first coined. In fact,
we heard someone describe it to us as DevOps for the year 2020. GitOps is a simple
process of managing all of your systems, environments, and applications via Git. Git
represents the single source of truth for all your applications, your tooling, and even
your clusters. Changes to any of these things can be pull requested and discussed
before an automated process applies them.
The difference between infrastructure-as-code (IaC) and GitOps is the approach
to managing the configuration. IaC is agnostic to where you store the configuration;
it could be on a flash drive in your drawer or it could be a shared drive in the cloud.
GitOps, as the name suggests, means storing the full system specifications in Git.
The same principles hold true for IaC and GitOps – ideally, every action should be
idempotent. Every action or operation can be applied multiple times, producing the
exact same result. This is a very useful property in many situations, as it means that an
operation can be repeated or retried as often as necessary without causing unintended
effects. Configuration should be created declaratively. That is to say, you write the
configuration to describe the desired state of an application or set of apps.
GitOps can be seen as a developer-centric approach to Ops. It teaches developers
good practices around taking ownership of code once it leaves their machines and the
approach to deploying and monitoring this code once it's running.
As developers, we hate repeating ourselves, so much so that we even have an acronym
for it – DRY = don't repeat yourself! When encountering something that needs to be
done more than once, our first instinct should be to try to automate it. Once something
is automated or repeatable, the next step is simple. Check it into Git so that it can be
audited, shared, and managed.
GitOps | 181
ArgoCD
From ArgoCD's website, this is described as a tool that:
automates the deployment of the desired application states in the specified target
environments. Application deployments can track updates to branches, tags, or be pinned
to a specific version of manifests at a Git commit.1
When something is seen as not matching the required state in Git, an application becomes
out of sync. Depending on how you have implemented your GitOps, ArgoCD can then
resync the changes to apply whatever is in Git immediately or fire a warning to initiate
some other workflow. In the world of Continuous Delivery as implemented by ArgoCD, Git
is the single source of truth, so we should always apply the changes as seen there.
What types of things can ArgoCD apply? ArgoCD recognizes traditional Kubernetes
YAML, Kustomize,2 Helm, and all sorts of other things. Unlike Helm, which uses
templating heavily, Kustomize allows you to take YAML files and emits text in a
template-free declarative way. You can patch Kubernetes resources and use folder-
based structures to apply what is termed an overlay or YAML override, which emits
text, leaving the original YAML untouched. For our purposes, we will stick to Helm and
a little bit of Kustomize where appropriate.
ArgoCD is another tool (and there are others like it, such as Flux) in the long list of tools
that we need to be able to implement CI and CD. Unlike Jenkins, which we could also
use to manage our application deployments, ArgoCD is specialized and very good at
managing and maintaining just our deployments.
1 https://argo-cd.readthedocs.io/en/stable/
2 https://github.com/kubernetes-sigs/kustomize
182 | Open Technical Practices — The Midpoint
Jenkins could apply our Helm charts in a done once and finish sort of way. It doesn't
have the capability to keep watching our Kubernetes resources to ensure the desired
state in Git stays that way in our clusters. If someone decides to change something
in the cluster, for example, add a new environment variable to a running application,
ArgoCD will detect that change and overwrite it. This means no more one-of-a-kind
deployments or manual tweaks once they're deployed.
ArgoCD enables teams to enforce this golden rule – if it's not in Git, it's not real. This
is perfect for audit tasks – all you have to do is check the Git log to see who committed
and pushed the code.
to sort the issue, with everyone jumping on a call and mobbing around
the problem. Rewinding what could have changed in the past hour, the
only thing that had executed was the cleanup job we had written. We
immediately thought we'd written something incorrectly with our job and so
went to debug it, but it was fine. The next step was to look more closely at
the build and the failure message.
It gave rise to a mantra we added to our social contract from Chapter 4, Open Culture.
184 | Open Technical Practices — The Midpoint
Implementing GitOps
Let's build the big picture with some real working code! In
this section, we are going to take a technical detour!
Prepare for some code snippets. If this is not your thing,
feel free to skip over it to the next section all about testing!
We'll mark any section that's going to have code snippets
with this handy sign.
Let's explore ArgoCD and create the components of our Big
Picture from code. To do this, we will first explore a sample
project that can be used as a starting point for development.
At Red Hat Open Innovation Labs, we have automated the bootstrap of Labs Residency
CI-CD tooling to accelerate setup and onboarding. The code repository is called
Ubiquitous Journey, so it makes sense for us to start here. We will explore this
repository and set up our technical foundation using it. In later sections of the book,
we will extend it with new technology and tools. This repo is available on the PetBattle
GitHub organization – https://github.com/petbattle/ubiquitous-journey.
The first task we would normally perform on our OpenShift cluster when deploying
Jenkins is to create a new project using the command line. We could follow this manual
approach again, adding in role bindings and quotas for our project, and repeat these
steps for each bit of our Big Picture. But let's do it in a way that honors our everything-
as-code practice.
From your laptop, fork the sample project and open it up in your favorite code editor.
We are going to make changes to our project so maintaining your own copy of it is
necessary for GitOps. From here on out, when we encounter a new repo, you'll probably
find it easier to fork it so you can make changes to it. For the purposes of the book
going forward, we will continue using PetBattle organization so feel free to equate this
to your own organization or user.
git clone https://github.com/petbattle/ubiquitous-journey.git
The Ubiquitous Journey project is broken down into two main components (some of the
files are removed from the breakdown below for simplicity), Bootstrap and Ubiquitous
Journey. If you're wondering why we named the project Ubiquitous Journey… well, we
didn't! We hit the generate random name button on GitHub and this is what it chose
for us. As is the case with most things in software, naming things is hard! We did plan
on renaming the repo at some stage, but now the name has kind of stuck and we like it!
$ tree ubiquitous-journey
ubiquitous-journey
├── argo-app-of-apps.yaml
GitOps | 185
├── bootstrap
│ ├── charts
│ ├── Chart.yaml
│ └── values-bootstrap.yaml
├── docs
├── ...
├── README.md
└── ubiquitous-journey
├── Chart.yaml
├── templates
│ ├── argoapplicationdeploy.yaml
│ └── _helpers.tpl
├── values-day2ops.yaml
├── values-extratooling.yaml
└── values-tooling.yaml
The bootstrap folder contains a Helm chart definition with a values-bootstrap.yaml file
and Chart.yaml manifest. There are no templates for this chart because it's actually just
a wrapper for other Helm charts. If we look at the Chart.yaml manifest, we can see that
it has a dependency of the ArgoCD chart, another called bootstrap, and a helper chart
called sealed-secrets. The bootstrap folder Helm chart is acting as a wrapper chart,
allowing us to control the variables we pass to these dependencies. In this case, our
variables are stored in the values-bootstrap.yaml file:
bootstrap-project:
enabled: true
ci_cd_namespace: &ci_cd "labs-ci-cd"
pm_namespace: &pm "labs-pm"
ops_namespace: &ops "labs-cluster-ops"
dev_namespace: &dev "labs-dev"
test_namespace: &test "labs-test"
staging_namespace: &stage "labs-staging"
bindings: &binds
# this labs-devs is the GROUP NAME in IDM
– name: labs-devs
kind: Group
role: edit
# this labs-admins is the GROUP NAME in IDM
– name: labs-admins
kind: Group
role: admin
– name: jenkins
186 | Open Technical Practices — The Midpoint
kind: ServiceAccount
role: admin
namespace: *ci_cd
namespaces:
– name: *ci_cd
bindings: *binds
- name: *pm
bindings: *binds
- name: *ops
bindings: *binds
- name: *dev
bindings: *binds
- name: *test
bindings: *binds
- name: *stage
bindings: *binds
The bootstrap chart is responsible for creating the listed projects in our OpenShift
cluster. In the example, these are labs-ci-cd, labs-dev, labs-test, labs-staging, labs-pm,
and labs-cluster-ops. Dev, Test, Staging, and CI/CD will hopefully be self-explanatory;
if not, take a look at the previous chapter, where we discussed CI/CD in depth. The
labs-pm namespace is for deploying other project management tools (for example,
collaboration tools such as etherpad). The labs-cluster-ops namespace is used for
operational jobs and tasks.
Resources in OpenShift have role-based access control (RBAC) applied.3 RBAC
determines whether a user is allowed to perform a given action within a project. We
bind the listed user groups to the service accounts within these projects. Don't worry
if your cluster does not have the labs-dev and labs-admin groups set up right now. It is
enough if you are logged in to your cluster with a user who has cluster admin privilege.
argocd-operator:
enabled: true
name: argocd
namespace: *ci_cd
argocd_cr:
applicationInstanceLabelKey: petbattle.app/uj
3 https://docs.openshift.com/container-platform/4.6/authentication/using-rbac.
html
GitOps | 187
operator:
version: argocd-operator.v0.0.14
channel: alpha
name: argocd-operator
The second part of this file overwrites some variables in the ArgoCD chart. This Helm
chart installs the ArgoCD operator and configures it with sensible defaults. For a list
of all the possible variables that could be passed to this chart, you can check out the
Operator Docs for ArgoCD – https://argocd-operator.readthedocs.io/en/latest/.
There is no point in recreating those docs in this book, but it's useful to have them
saved if you want to do some exploring.
It is worth calling out the applicationInstanceLabelKey variable. This needs to be
unique for your cluster. If you deploy more than one instance of ArgoCD to a cluster
with the same instance label, the two ArgoCD instances will try to manage the same
resources and then they'll fight over who actually owns them and get you into a world
of pain, so make sure the applicationInstanceLabelKey is unique!
Let's deploy this setup and see what it gives us. If you wish to change the names of the
projects that are created, you can edit the values file, but for now we'll use the defaults.
In a terminal on your laptop, try the following command:
$ helm template bootstrap --dependency-update -f \
bootstrap/values-bootstrap.yaml bootstrap
Running a Helm template like this should bring down our chart dependencies and
process our templates. This can be a handy way to validate that the YAML file looks as
we expect. Let's install the bootstrap Helm chart into its own namespace.
$ helm upgrade --install bootstrap-journey \
-f bootstrap/values-bootstrap.yaml \
bootstrap --create-namespace --namespace labs-bootstrap
The output of this command should be a successful installation of the bootstrap Helm
chart:
Or, if you have a look in the UI, you should see the topology with all the components of
ArgoCD:
Let's take a look at ArgoCD by clicking the link in the UI, or you can get the URL from
the command line using the following command:
oc get routes argocd-server -n labs-ci-cd
Log in with your OpenShift credentials. We should see an empty ArgoCD instance:
GitOps | 189
At this point, we should ask ourselves the question What happens when someone
changes the bootstrap configuration values for our cluster? for example, to add more
projects or change the roles or groups? Can we do this in an automated and tracked
way, in other words, using GitOps? Fear not, ArgoCD to the rescue! We can now point
ArgoCD to the Git repository we've been working on.
We can create an ArgoCD application from the ArgoCD web interface by selecting
+New App -> Edit as YAML and copying and pasting the following definition:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: bootstrap-journey
namespace: labs-ci-cd
spec:
destination:
namespace: labs-bootstrap
server: https://kubernetes.default.svc
project: default
source:
helm:
parameters:
- name: argocd-operator.ignoreHelmHooks
value: "true"
valueFiles:
- values-bootstrap.yaml
path: bootstrap
repoURL: https://github.com/[YOUR FORK]/ubiquitous-journey.git
targetRevision: main
syncPolicy:
automated: {}
190 | Open Technical Practices — The Midpoint
Hit Save, followed by Create. You should see the bootstrap-journey application synced:
We can also create the same application on the terminal. You can log in using single
sign-on to OpenShift from the terminal using this one liner. It requires a terminal that
is not headless, in other words, it can connect to your screen and browser:
$ argocd login $(oc get route argocd-server --template='{{ .spec.host }}' \
-n labs-ci-cd):443 \
--sso --insecure
Create the new app and sync the changes. With this in place, argocd will now actively
track changes to our Git repository and roll them out for us! Simple!
argocd app create bootstrap-journey \
--dest-namespace labs-bootstrap \
--dest-server https://kubernetes.default.svc \
--repo https://github.com/[YOUR FORK]/ubiquitous-journey.git \
--revision main \
--sync-policy automated \
--path "bootstrap" \
--helm-set argocd-operator.ignoreHelmHooks=true \
--values "values-bootstrap.yaml"
GitOps | 191
You can select the application in the web interface to drill down into it:
Excellent – we are on our way to completing our Big Picture as code and laying down
our technical foundation! We've created the projects and added the first tool, ArgoCD,
to our kit bag. Now, let's take it a step further and fill our cluster with some of the
applications we think would be initially useful for building out CI/CD pipelines. At the
beginning of any project, this will usually be a best guess. As we start to build out the
product, we must continuously evolve the toolset we use. This is not a one-time process;
it's a set of tools that need to be extended when required or trashed if no longer useful.
The important thing here is to ensure that things are deployed in a repeatable way.
192 | Open Technical Practices — The Midpoint
Let's add some tooling. Open your editor on the ubiquitous-journey project. Inside
ubiquitous-journey/values-tooling.yaml, we have some useful variables referencing
Helm charts ready for us to pick from, including Jenkins, which we manually deployed
previously!
##############
# 🛎 Argo App of Apps declaration
#############
# enabled true on an app is to tell helm to create an argo app cr for this
item
# Custom values override the default values in Helm Charts
applications:
# Nexus
– name: nexus
enabled: true
source: https://redhat-cop.github.io/helm-charts
chart_name: sonatype-nexus
source_path: ""
source_ref: "0.0.11"
sync_policy: *sync_policy_true
destination: *ci_cd_ns
ignore_differences:
– group: route.openshift.io
kind: Route
jsonPointers:
– /status/ingress
# Jenkins
– name: jenkins
...
# Sonarqube
– name: sonarqube
...
The layout of this file is simple. For each item in the applications array, it expects to
find a Helm chart or a reference to a Git repository with some Kubernetes yaml (or
Kustomize) at a particular version.
GitOps | 193
When using Helm, any overrides to the defaults supplied by the chart can be added
here, but for the Nexus chart shown, we are using the default values, so there is no
need for value overrides for Nexus. There are other fields for each application, and
these are mostly related to the operation of ArgoCD. For example, you can configure
the application synchronization policy – sync-policy – which tells ArgoCD to always
keep your application synced when set to automatic. The destination namespace may
be specified. With some Kubernetes and OpenShift API objects, ArgoCD needs to
be asked to ignore differences it finds; this is particularly true when controllers and
operators write back the status and other fields into the objects themselves. We have
found over time that each release of ArgoCD lessens the need to specify these ignores
as the generated differences are taken care of automatically.
The other important field for each application entry is the enabled: true | false –
it's easy to run down the list and enable the tools we know we need straight away. For
now, we are going to start with just four tools: Jenkins, Nexus, Tekton, and Code Ready
Workspaces. These are the bare bones for scaffolding our application and pipelines.
At this point, it is worth mentioning the other two values files, extratooling and
day2ops:
└── ubiquitous-journey
├── Chart.yaml
├── ...
├── values-day2ops.yaml
├── values-extratooling.yaml
└── values-tooling.yaml
If you check the ArgoCD web page, you should now see these applications begin
to deploy and synchronize into your cluster. It will take some time for them all to
synchronize completely. Jenkins, for example, builds all of the default agent images that
we may need for running pipeline jobs.
Figure 7.18: The complete picture with all our tools installed
We have now successfully bootstrapped our CI/CD tooling! We will revisit these
configurations as we find we need to add and update the tools we need to develop, test,
and deliver PetBattle. By practicing everything-as-code, we can easily redeploy these
tools into any Kubernetes cluster, track changes we may make, and manage the life
cycle of the tools (upgrade them as their versions and features change).
In the standard three-tiered testing triangle, the things at the bottom (listed above
as UNIT TESTS) are the things we should do more of. Unit tests are the smallest
amount of code we can test within an application. These units should have little to no
dependency on other items and so when we run them, they give us immediate and
precise feedback. Unit tests should point us exactly to where the problem is in our
code. Moreover, the thinking here is that unit tests are cheap to write, easy to maintain,
and fast to execute. Therefore, we want more of them. This is why they sit at the base of
the testing triangle.
196 | Open Technical Practices — The Midpoint
Service tests are sometimes seen as integration tests and are the next level up the
testing triangle. These are API tests that are validating the services within your
application behave as expected. This may include single service calls, as well as chains
of service calls, when one service calls another service. The width of the testing tier in
the triangle relates to how many types of a particular test there should be in your code
base. According to the pyramid, we should have fewer of these service tests than unit
tests as they can be costly to execute.
The top tier of the testing triangle is reserved for User Interface (UI) tests, or end-to-
end system tests. These are responsible for validating that the system, as the sum of
its components and parts, is behaving as expected. Often, UI tests can be brittle in the
face of change, break more often, and require maintenance to keep them relevant, so
the rationale from the testing pyramid is that we should do fewer of these as they are
difficult to perform and provide less feedback for us.
Testing in Practice
The Test Automation Pyramid is a great place to start when thinking about your
own testing. As with all models and patterns, people have over-simplified some of its
original meaning. In fact, if you do an image search for the testing pyramid, you'll see
that most of the results are missing the most important keyword – automation! Often,
organizations lose sight of this, and they think doing manual testing for these tiers is
good enough.
Testing is important; in fact, it's critical to being able to deliver at speed! If you imagine
not investing the time into writing automated tests, it may be possible to complete a
sprint without breaking things. It's probable that we'd be able to do two sprints and
not break things. However, once we hit that third or fourth sprint, that's when your
software system starts to misbehave. Applications that were written in sprint one
now have bugs appearing in them because their functional behavior does not work as
intended. Functions and APIs that were thought to be working were, in fact, completely
broken! Being able to release software at speed is one thing, but being able to release
quality software at speed is the differentiator.
What is important when thinking about testing is to apply context. You don't have to
blindly follow a model such as the testing pyramid. In fact, it's a good place to start
from, but it's not a golden hammer to apply in all environments. For example, you
might be building a web app with static content or third-party services, so UI testing is
probably the most important thing.
What is important is to be sensible about the types of testing you're aiming to perform
and the value they provide. You may find that it's more important to your product that
covering the services layer is a better option. If you don't have access to the code, then
writing black-box tests that assess the services with well-defined inputs and outputs is
Testing Testing Testing! | 197
more appropriate to your quality control. Likewise, measuring the number of tests, as
suggested by the pyramid, tells us nothing about the quality of the tests. Good quality
tests catch errors before your user does. When there is a failure in production, or a bug
raised by a user, it is very likely that you need to write some more automated tests.
The other way to look at this would be to calculate the risk of not testing a piece of
functionality. Perhaps the application you're writing is a one-time throwaway or just
a simple technical spike that does not require rigorous testing. However, if a piece of
functionality within your product is used all the time and it has no automated tests
written for it at all, this could be a good place to focus your automated testing efforts.
Create a culture within your product development team where testing is a continual
practice. Testing should not be an afterthought of the development process. All too
often, we see testing beginning once the development team throws a package over
a wall to the test team for some testing to begin. For us, every item in a sprint will
always have some level of testing. This isn't done by some third-party team, but by
the engineers themselves. Developers will often favor unit and integration testing,
but quality assurance (QA) teams will often favor automated UI testing to validate
the application from a user's point of view. Sometimes, if the culture is not right and a
team is being squeezed to turn out new features, testing quality can drop, leading to
an almost inverted testing pyramid: a few unit tests on the bottom, followed by a few
more service tests, and then a load of brittle UI tests sitting on top! This has an effect
on the quality of the software delivery pipelines. The feedback loop from Dev to QA can
be very long, with little to no value from unit tests and expensive UI tests that are not
providing feedback quickly enough.
Decreasing the quality by inverting the testing pyramid during delivery can be very
damaging to a team. If the volume of defects increases markedly, trust in the team will
falter. If there is no trust in the team, then autonomy could be the next thing to break,
leading to a heavy command-and-control-driven culture. Teams operating in this way
will very quickly fail and top talent will leave.
198 | Open Technical Practices — The Midpoint
You can learn more and collaborate on CI practices by going to the Open Practice
Library page at https://openpracticelibrary.com/practice/test-automation/.
Testing Testing Testing! | 199
TDD is a simple process, yet somewhat misunderstood by some teams. The process is
fairly simple. Start off by writing some tests for the functionality you're building. At this
point, they should fail (RED). If they don't fail, then your tests are not very well written
OR the functionality already exists! A developer will then write the code to make the
test pass (GREEN). With the tests now green, refactoring can take place or, as Kent
Beck, an American software engineer and the creator of extreme programming, puts it,
refactor to remove duplication. Remove duplicate code or make the code leaner and tidy
it up while maintaining the green state of the tests. The process is simple: Red > Green
> Refactor. Writing tests first is a hard practice to do and takes time and perseverance
to get the skills right, but it can lead to less spaghetti code. Because the tests are
written first, they lead the design and implementation of the code.
200 | Open Technical Practices — The Midpoint
A great exercise to do with teams wanting to try TDD without code is to do the Lego
TDD simulation on Gargoyle Software's website: http://gargoylesoftware.com/articles/
lego_tdd.
You can learn more and collaborate on TDD by going to the Open Practice Library page
at openpracticelibrary.com/practice/test-driven-development.
DDT is easy and probably the place to start if you're not writing any tests. The
important point here is that some tests are being written! DDT focuses on the
developers writing code as well as writing the tests. Simply put, the developer codes
for a bit, writes some automated tests, and then goes back to coding and testing. This
might sound a bit like TDD, but the key difference is the order. Code first and then test,
resulting in the code influencing the tests as opposed to the tests leading the software
design. The objective of DDT is that developers need to own their code and that
everyone should be responsible for testing.
Testing Testing Testing! | 201
BDD is a great tool to have in your toolbox as it brings people together in a shared
understanding of the scope of a story or feature under development. It's less of an
engineering tool and more of a method that focuses on the conversation to be had
between business and developers when writing features. BDD is about using a shared
language to write concrete examples of how an application should behave.
How the tests are implemented is then decided by the developers. But, more
importantly, a common language can be used between developers and product owners
to scope out a story without leading the design of the software. BDD can be a useful
way to write acceptance criteria for a story together. There is a common syntax
or approach to writing BDD tests based on work by Dan North, an agile coach and
originator of BDD4:
Scenario 1: Title
Given [context]
And [some more context]...
When [event]
Then [outcome]
And [another outcome]...
4 https://dannorth.net/introducing-bdd/
202 | Open Technical Practices — The Midpoint
For example:
Scenario 1: Buying an Ice Cream to cool off on a hot day
Given I have ten pounds in my pocket
When I purchase a Choc Ice for two pounds
Then I have only eight pounds left
And have a Choc Ice
For any feature being developed, there are probably a number of scenarios that could
be tested. These scenarios are defined using the common syntax of Given, When, Then.
Codifying the acceptance criteria using a common syntax can simplify the writing
of tests and gaining a shared understanding of the scope of an activity. Dan North
suggested this story-driven approach to BDD some years back and, since then, the
syntax has been adopted by lots of the testing frameworks, such as Cucumber.
The architect on the team was a strong believer in the BDD approach to
writing stories. All of our acceptance criteria were written in this way, and
it was a great way for us to understand the scope of what we were doing.
When I was pairing with another engineer, we would use the acceptance
criteria written in the BDD syntax as our starting point. We imported the
syntax straight from Jira to scaffold out the test cases using Python Behave.
For us as engineers, this made coding the features a breeze. We had been
given the specifications, so we could easily implement our code to pass the
tests.
BDD can help engineers understand the context of features better. It also helps bridge
the gap of alignment with business experts and product owners:
204 | Open Technical Practices — The Midpoint
BDD brings down the wall between technology and business people.
Example Mapping
Example Mapping, by Matt Wynne, CEO of Cucumber,5 is another great tool to have in
the toolbox. Once again, with a lot of these practices, it's just another really useful way
to articulate and drive a conversation. In this case, Example Mapping is primarily used
to drive shared understanding when writing stories and creating acceptance criteria.
We believe it's great for helping teams write behavioral-driven tests. The process is
simple and only involves four colored Post-Its:
• Yellow: For the story itself (as a header for the example map)
• Blue: For specific rules associated with the story
• Green: For examples of rules
• Red: For questions or unknowns that arise during the discussion
5 https://cucumber.io/blog/bdd/example-mapping-introduction/
Testing Testing Testing! | 205
Begin by selecting a story and write it on a yellow sticky note. Place it at the top of your
example map as a header. In a horizontal row underneath that, begin writing business
rules on blue sticky notes. Beneath the blue business rules, create columns of green
sticky notes with individual examples of those business rules. These could be relatively
unstructured Friends-notation The one where... examples, or full-blown Given, When,
Then criteria.
As misunderstandings arise surrounding individual examples or entire business rules,
add red stickies with questions written on them.
When there are enough examples that everyone is comfortable with, they can be
rewritten as both automated tests and acceptance criteria.
We were using Event Storming (more on this later) to model the onboarding
process for a new user to their application. We had a command that read
Submit relevant topics of interest, which was added to our backlog. We chose
this command so we could learn more about things our users would be
interested in, in order to better serve them recommendations.
We used Example Mapping to break this story down by first writing some
rules. We were not super strict on following a ubiquitous language at this
point as we knew the team would convert them into BDD-style syntax
afterward.
You can learn more about, and collaborate on, the Example Mapping practice by going
to the Open Practice Library page at openpracticelibrary.com/practice/example-
mapping.
Non-functional Testing
While the importance of testing cannot be overstated, it's critical to keep an eye on
other metrics that may give further insight into the quality of our code. For example,
how do you know your tests have enough breadth to check all the code? What if my
tests are passing, but the application response time is awful? Traditional unit and
integration testing might not catch these things. There are tools we can use to identify
causes and issues with our code base and, more importantly, fix them sooner rather
than later.
Code coverage reporters are simple to implement and usually come bundled up with
a lot of modern test frameworks. The idea is simple. While running our test cases, the
code base is being watched. Once test execution is completed, a report is generated
showing what lines of code have been hit and where there are gaps. These are useful
reports to help the team identify where there is room for improvement but they should
not be treated as the absolute truth. As with all these things, there are ways to trick the
coverage reports, but good developers and peer review processes should catch these
things. Often, teams will strive to increase the testing coverage if they have not started
from a very good state. Bringing these reports to a retrospective can be good for teams
to analyze and set higher targets. More aggressive teams may even fail their pipeline as
unstable if the coverage is below a certain threshold!
208 | Open Technical Practices — The Midpoint
Static code analysis is another tool that can provide insight into a code base not
detected by unit testing, creating rules for how the code should look and execute.
Consistency in an approach to how you write code is particularly important for
non-compiled languages such as JavaScript. JavaScript also behaves differently in
different browsers, so writing a set of rules such as using single quotes instead of
double quotes for all strings can help ward off any unexpected behavior. If we have the
rules codified, we may as well ensure that everyone adheres to them, so add them to
our pipeline! Coding standards are very important in multi-team setups too. If the code
base conforms to a standard structure and design, it can also make maintenance and
updates to it very simple.
Performance Testing
Sam's Code
Around 2014, we worked
for a retail organization
building mobile backend
services and some
automation around it.
This layer of services
was responsible for
aggregating data from
different backend systems such as product listers and categories, and
reviews. The services also performed some very basic data manipulation to
make the payloads more mobile consumable. It was critical that the adapters
responded in a very timely manner, as mobile latency was high compared
to modern mobile networks and a fast API response time made all the
difference.
Our team was always cognizant that we should keep track of the time taken
for the adapters to respond. We knew the organization would perform a
traditional load-testing initiative at the end of the program; however, we
didn't want to wait until then to reveal any surprises. We figured there had
to be a way to continuously validate changes we made to the adapter tier in
order to highlight any performance degradation.
Testing Testing Testing! | 209
One day, we came into the office and our nightly job had turned red! Perfect,
we thought, let's Stop the World and stop all the things we're doing while
we inspect what's changed in the system since last night. A quick check of
the changes that were made in the system revealed that Sam, one of the
team members, had tried to check in some new logic for one of the data
translation functions. Sam had introduced a big loop inside a loop inside
another loop, which had caused the code execution time to spike. It was
something that was not caught by our traditional unit testing, as the logic
was working fine. It was just taking longer to compute.
210 | Open Technical Practices — The Midpoint
Have no fear though – Sam, who's a good friend of ours, no longer writes
code for a living as he's moved into a technical sales role. We don't have a
blame culture within our workspaces and I'm sure if Sam was telling you this
story, he'd say it was one of us that checked in that silly piece of code. I'll let
you decide who it was.
Emerging Architecture | 211
There are lots of other types of testing and I won't list them all; we'd have to write
another book to fit them all in. We go into more detail about the non-functional nature
of our software in the next section, Discover It.
Emerging Architecture
Hope is not a design method.6
How do we know our architecture is good? What does good mean? Is good architecture
measurable? Have you ever had to operate, support, or fix a system that is poorly
architected?
It may be easier to identify some characteristics of what a poor architecture looks like:
• An unstable and unreliable system that fails regularly in unknown and
unexpected ways.
• The system is slow from a user's point of view.
• It does not scale well with increased users or loads.
• It is hard to upgrade because one small change requires everything to be
re-deployed, which is slow and costly.
• It is dependent on clients or other systems and cannot be easily modified or
changed without changing the other systems as well.
• It has a lot of complex business functions that are buried in the database, that
may involve triggers, and cannot be easily changed due to a complex database
schema with unknown side effects when modified.
• The system is hard to manage and operate.
Note
Generally speaking, it is better to make architectural and technical decisions as late
as reasonably responsible to do so, so that the most information is available to
those making the decisions.
There have been literally dozens of excellent books and articles written on
what is considered good architecture and patterns over the years. Our personal
choice is anything written by Martin Fowler (https://martinfowler.com/books/),
Chris Richardson (https://microservices.io/), and Sam Newman (https://samnewman.
io/books/), but there are many others.
Observations from the Field | 213
The Hammer
One pattern or behavior that we've seen a lot through the years is the Golden Hammer
pattern, in other words, if all you have is a hammer, everything looks like a nail. This is
more formally known as the Law of the Instrument.
All developers have their favorite tools and architectural approaches. For example, the
authors are fans of reactive, streaming architectures (Mike), asynchronous event-driven
messaging (Noel), and anything with Node.js or Emojis (Donal). The risk here is that
your own bias may lead you down an architectural path that is ultimately the wrong
approach.
If you find yourself listening to the first 10-20 seconds of a discussion around a business
problem and feel compelled to jump in saying Oh well, product, architecture, or tool X
can help with this, let's face it, you may be reaching for your golden hammer.
214 | Open Technical Practices — The Midpoint
Resumé-Driven Development
Many times, we've been involved in discussions around a technology being used in a
customer solution that is either out of place or just doesn't quite fit the problem they're
trying to address. We often discover that someone had introduced this technology as
they were keen to learn it and somehow it went from a technical experiment or spike
to a core component technology. There's absolutely nothing wrong with technology
experimentation; it should be encouraged, but care should be applied to ensure that a
chosen technology doesn't lead to a dead end or become a technology solution looking
for a problem. Examples of technologies where we've seen this include Service Mesh
and others as depicted in Figure 7.29:
We often forget to think about how much data is lost – the lost data can be a massive
source of lost opportunity for our business. This happens in cloud, IoT, industrial, and
even mobile web use cases with processing data on our mobile phones.
Conclusion | 217
Conclusion
In this chapter, we continued our exploration of technical practices to create a solid
foundation for us to be able to deliver at speed as one single cohesive unit. By using
techniques such as the Big Picture to gain a shared understanding of our delivery
pipelines, we further identified methods for testing and how we can connect the
business to the acceptance tests in a way that's more developer- and business-friendly.
218 | Open Technical Practices — The Midpoint
To close off this section, we have now built a solid foundation of culture, leadership,
and technical excellence. We have put in place principles and practices, including:
• Autonomy, mastery, and purpose
• Psychological safety
• Social contracts, stop-the-world events, real-time retrospectives, team identity,
and information radiation
• Leadership intent and team empowerment
• Priority sliders
• Team spaces
• Everything as code
• Containers
• Continuous integration, continuous delivery, and continuous deployment
• Test automation
• Emerging architecture
In Section 2, Establishing the Foundation, we built up the initial foundation for our
first teams to work on. It consisted of a culture that enabled psychological safety and
promoted autonomy, mastery, and purpose for team members. It also consisted of
the technical foundation that we bootstrap to enable teams to confidently start their
application development, giving them the best chance of maintaining steady continuous
delivery and not getting trapped by early technical debt.
We’re now going to work our way around the Mobius Loop, starting with Discovery:
In Chapter 8, Discovering the Why and Who, we’re going to explore how teams and
stakeholders can collaboratively discover outcomes and understand the common
purpose underpinning application development. To help navigate through this part of
the process, we will be using the Discovery Loop of Mobius:
When we are on the Discovery Loop, we identify and use practices that help us answer
the question of Why – why are we embarking on this journey? What problems are we
trying to solve? Who are we trying to solve them for and what do we know about them?
What is our great idea? We also use practices on the Discovery Loop to help us identify
and set measurable target outcomes for the business and its customers. Chapter 8,
Discovering the Why and Who, focuses on the upper portion of the loop, helping to
answer the Why and Who questions.
In Chapter 9, Discovering the How, we will shift the focus to practices that can help
us start building a solution. Our approach to solving problems is incremental and
experiment-driven. In particular, we focus on a group of practices that bring about a
shared understanding through visualization and modeling techniques. No sticky note
will be spared in this chapter!
Chapter 10, Setting Outcomes, will focus heavily on those outcomes, what they are, how
we derive them, and how we use them. By the end of this chapter, you will be able to
clearly articulate and define the team goals and outcomes being delivered.
Discovering the Why
8
and Who
Too often, software delivery teams dive straight into delivering features and focus
purely on output and deliverables. They don't take sufficient time to understand
what business problems they're trying to solve and the people they're trying to solve
them for.
There is often a split knowledge distribution of business processes and domain
knowledge that sits inside one person's head. The misalignment in knowledge across
the team and between different stakeholder groups causes misguided decisions,
misunderstandings, and incorrect assumptions to be made, with teams delivering the
wrong solutions solving the wrong problems.
224 | Discovering the Why and Who
This is why we start with using a selection of discovery practices in order to align all
team members and stakeholders with a shared understanding of why are we doing this?
why are we investing time? why are we embarking on this and not something else?
Furthermore, we explore what problems we are trying to solve and who we are trying
to solve them for. Understanding these people, their pain points, and opportunities for
improvement is crucial before we can start defining target outcomes for them.
The Discovery practices we are going to explore in this chapter really benefit from a
whole-team approach and being inclusive of all interested stakeholders. Gathering a
wide range of views and encouraging engagement and collaboration across a broad
group of people from different functional areas across the business helps to achieve
alignment and a committed group of people who share the same understanding of the
problems we're trying to solve.
The North Star | 225
Let's take a look at our first discovery practice that helps align everyone to move in a
single direction, the North Star.
1 https://growthhackers.com/articles/north-star-metric
2 https://amplitude.com/
226 | Discovering the Why and Who
3. It's a leading indicator of success. It predicts future results, rather than reflecting
past results.
4. It's actionable. We can take action to influence it.
5. It's understandable. It's framed in plain language that non-technical partners can
understand.
6. It's measurable. We can instrument our products to track it.
7. It's not a vanity metric. When it changes, we can be confident that the change is
meaningful and valuable, rather than being something that doesn't actually predict
long-term success, even if it makes the team feel good about itself.
The North Star Framework complements Mobius very well. In the discovery loop
(this section), we focus on defining the North Star and getting an aligned and shared
understanding across stakeholder groups and team members as to why the North Star
Metric has been chosen. In the next section, we'll be exploring the options pivot, which
is all about the actionable research, experiments, and implementations we can perform
to influence a North Star Metric. In Section 5, Deliver it, we'll explore the delivery loop,
including the practices to measure and learn – our instruments for tracking a product's
success against a North Star:
Using the North Star practice during a discovery loop can help set teams off in the
right direction and achieve alignment between all its members and its stakeholders. It
can also help teams achieve course correction or re-alignment if they've been running
without a North Star. Having a North Star's information radiated on the wall and clearly
in view of team members and interested stakeholders can help maintain focus on it and
steer the team toward it.
Many of the practices we explained in the previous section to create a foundation of
culture and collaboration will benefit the process of defining and using the North Star.
For example, having a social contract and retrospective practices that help generate
a team's autonomy and psychological safety in terms of contributing to strategically
important information will help to promote wider group ownership of this metric.
Having practices to promote visualization of work will help share, inspect, and focus on
reaching the North Star.
So, the North Star is the first discovery practice that highlights the importance of
needing a foundation of culture in place. Its effectiveness is much higher if an open
culture exists, with people feeling psychologically safe to contribute and collaborate
with each other. We will see this pattern continue.
To help find a North Star, we make use of a simple canvas, such as the one provided in
Amplitude's playbook:
A team using the North Star Framework identifies a single, meaningful metric and a
handful of contributing inputs. Product teams work to influence those inputs, which, in
turn, drive the metric. The North Star is a leading indicator of sustainable growth and
acts as connective tissue between the product and the broader business.
Let's look at an example of a North Star Metric and how it was derived from inputs.
OpenTable is a popular platform for booking tables at restaurants. A recommended
approach to discovering the different leading inputs is to consider breadth, depth, and
frequency; for example:
• Breadth: How many users are being engaged? An example leading input indicator
for this would be how many Monthly Active Users (MAUs) there are – users who
at least performed a search on the site.
• Depth: What level of engagements are users having? An example input indicator
for this would be the booking conversion rate per session with a search.
• Frequency: How often do users engage? An example input indicator for this
would be the average time interval between user searches.
These are all inputs we want to drive the improvement of in our product design
decisions, but we require one metric, which is a crucial metric that represents the
value of the product. OpenTable's North Star Metric is the number of seated covers
per month – in other words, the number of times restaurant tables are used by paying
customers each month. If this is high, we have a successful product driving good
business. It is the one metric that does not lie and is the one we should be relentlessly
focused on as being the indicator of success:
The North Star | 229
This canvas also captures what the mid/long-term impacts to the business of the North
Star Metric will be; in this case, a flat-rate fee taken for each seated cover booked
generates revenue for OpenTable.
Let's go back to our PetBattle team and see how they went about finding their North
Star Metric.
230 | Discovering the Why and Who
PetBattle as a Business
PetBattle was initially conceived as a fun idea to experiment with new
technologies. However, due to cats being the internet's favorite time-waster,
the site quickly went viral. As it was experimental, it wasn't built in the
most robust or professional manner. This caused the site to crash due to
the overwhelming number of visitors – a cataclysm of purrrfect disasters!
Malicious users on the internet have been uploading inappropriate content
to the site, which has to be removed manually as the site is intended to be
family-friendly.
So, the founders decided to try and monetize the site while still keeping to
the simplistic approach of the initial creation.
PetBattle's founders decided to embody the spirit of Dan Pink's book, Drive,
when building up their cross-functional product team, so started off by
building a cultural foundation. Each new team member adds their unique
view, personality, and motives to the culture of PetBattle. In order to retain
talent and enhance this great culture, PetBattle must make them feel like
they can achieve autonomy, mastery, and purpose.
PetBattle's employees need to feel connected to the vision of the product.
They held a North Star workshop to connect all employees to the vision
of their product and determine their North Star so that they can regularly
inspect and measure against it.
The workshop lasted for two hours and included the following:
• An opening discussion on the reasons for North Star
• Identifying the game PetBattle users would be playing
• A discussion on what makes a good (and bad) North Star
• A discussion on the structure of the North Star and its inputs
• Looking at North Star Metrics for other products
• Drafting PetBattle's North Star
on how efficiently and effectively a team member can get their work done.
Following some discussion on whether PetBattle was more of an Attention
game or a Transaction game, the group concluded on it being an Attention
game. Ultimately, customers want to spend lots of time enjoying the
PetBattle product and the aim is to keep them engaged.
In collecting candidates for the PetBattle North Star, employees used a silent
brainstorming technique first, and then paired up with one other person
to discuss their ideas. Two pairs then formed a group of four people to
merge their collective ideas. Finally, the full group converged all their ideas
on a single wall. This facilitation approach is called 1-2-4-all and is a great
example of a liberating structure .
Candidate metrics for the North Star included the following:
• The number of site visitors
• The number of site visitors who have engaged (for example, at least
placed one vote for a pet)
• The length of time spent on the site
• Increased follow-through to (eventual) advertisers
• The number of PetBattle competitions
• The number of pets uploaded to the site
• The site's drop-off rate
• The site's stickiness rate (the percentage of users returning three or
more times in a month)
• The percentage using the mobile app
During convergence of all employees' ideas, some great conversation started
to materialize about what they really saw as success for PetBattle. Yes, they
wanted more people to visit and yes, they eventually wanted advertising
revenue, but the consensus seemed to be that the one crucial metric that
was going to take PetBattle to the next level was the number of pet uploads.
Most of the other metrics were inputs that could all help drive up the
number of uploads. If there are more users, eventually more of them will
engage. As more engage, eventually more will want to engage more and
participate. The mid/long-term impact of participation would drive the
impact of increasing the likelihood of advertising.
232 | Discovering the Why and Who
We can use a North Star Metric in both product and services organizations, as
demonstrated by our own use in Red Hat's Open Innovation Labs.
Knowing, understanding, and aligning on your North Star provides an excellent basis
to move to the next discovery practice, which explores the linkage of strategic goals
to people, measurable impacts, and deliverables. This practice is known as impact
mapping.
Impact Mapping
Impact mapping is a technique that helps build shared understanding and alignment
between leaders of organization strategy and product delivery teams. Gojko Adzic first
documented the technique in his 2011 brochure,3 which provides an excellent guide for
people who want to facilitate the workshop.
Adzic defines impact mapping as a lightweight, collaborative planning technique
for teams that want to make a big impact with software products. It is based on
user interaction design, outcome driven planning and mind mapping. Impact maps
help delivery teams and stakeholders visualize roadmaps, explain how deliverables
connect to user needs, and communicate how user outcomes relate to higher level
organizational goals.
At the end of the impact mapping exercise, you should have:
• A shared understanding of the goal or problem statement.
• An inventory of human behavior changes in users and stakeholders that must
occur (or not occur) in order for your project to be successful. These are the
impacts from which the technique gets its name.
• Optionally, an inventory of project deliverables that could be delivered to achieve
the aforementioned impacts.
• Some sort of prioritization of scope – one or more deliverables – and/or impacts.
• A graphic presenting the preceding information.
3 https://www.impactmapping.org/
234 | Discovering the Why and Who
We'll then bring the Impact Map together and show how we use it to form hypothesis
statements that can be used to drive experiments.
Goal setting can be a difficult thing to agree upon. Often, there are competing views on
what direction we are headed as an organization with the product or business area we
are impact mapping. However, clarity of the goal for an organization can trickle down to
all levels of a company, thereby creating a shared purpose for the individuals.
Let's look at the goal set by the PetBattle team.
Impact Mapping | 237
Setting a goal amongst a small group can be straightforward. What about larger groups?
Let's look at some anti-patterns and tips for gaining consensus and inclusivity.
Note
When undertaking any group activity, often the person who shouts loudest is the
one who's listened to the most. Sometimes, the most senior person in the room is
given the most airtime in discussions – this is known as the Highest Paid Person's
Opinion (HIPPO) effect. This leads to one-sided views being portrayed as the norm
and makes people who are quiet or shy often feel unheard. If you're beginning
to write your goal, a good method to get everyone's voice heard is Liberating
Structures' 1-2-4-all practice (http://www.liberatingstructures.com/1-1-2-4-all/). It's
a simple practice to use, but immensely powerful. Start by setting a one-to-two-
minute timer and having each individual in the group write down what they think
their business goal is. Next, pair up and give another two minutes to the groups
to agree on a shared goal between the pairs. Now repeat in groups of four, before
finally coming together as a whole, focusing on just one goal:
238 | Discovering the Why and Who
When all the members in the team align on a goal, it can be very helpful to write it up
on a large board or flipchart paper. Having something that everyone can nod their head
to and agree this is why we are here, this is our Why, this is our shared purpose makes a
very powerful information radiator. Let's remind ourselves of PetBattle's goal on a big
sticky note:
The next part of the impact mapping process is to consider Who can help us reach our
goal.
Impact Mapping | 239
The Impact Map starts to take shape by visualizing and connecting these
actors to the goal. These groups of people are the stakeholders and
interested parties who will help us to achieve our goal:
We have now visualized the connection between the goal and all the actors that can
achieve or impede it.
Impact Mapping | 241
Note
The founders of PetBattle struggled when defining the initial set of actors and
mistakenly broke the group into authenticated and non-authenticated user groups
without tying it back to the goal and the problem statement. This is a classic
example of trying to engineer a solution too early without identifying why these
classifications exist. Identifying the boundaries between groups is very useful.
Sometimes, you can have coarse groups. By splitting them out, you may find
additional impacts that are only applicable to those sub-groups.
One of the greatest benefits we see to the impact mapping exercise is all the conversation
and shared understanding it generates. When naming actor groups, we find that different
people can have different understandings or terms they use. In the preceding example, we
may get a short conversation to differentiate between a casual viewer and an uploader.
Some people may have just called these generic users. Having ubiquitous language from
the outset can really help with the software development process. This is true for teams
that focus on both the technical and business aspects of the product. Already, we are
seeing the need to differentiate between different user groups in our software design.
The next stage is arguably the most important and key to the impact mapping practice
– defining the measurable impacts we plan to have on each of the actor groups.
We have now visualized the connection between the goal, all the actors that can
achieve or impede it and the measurable impact we want to have on those actors.
Note
Always start an impact statement with the term Increase or Decrease (or Reduce).
By doing this, we avoid accidentally slipping into the territory of capturing features
or outputs when we want to focus on behavioral change on the actors. It also
means the impact should be measurable. By thinking about how we want to
increase or decrease something, we are quantifying it. We could apply a measure
to the impact now (even an arbitrary measure) and come back in the future to see
whether that impact has moved up or down.
With the goals and impacts defined, we are coming close to the aspects we are often
most comfortable with – the deliverables. The deliverables could be features or groups
of features where a feature is a slice of the functionality of the application. What are the
tangible things we could do to create this impact? The temptation is to jump straight to
features, but by describing impacts as a measurable change in human behavior, we are
forced to ask why feature X is important and relate it back to our business goal.
Let's now look at the deliverables considered for PetBattle.
This part of the Impact Map is not really the focus of this particular practice and
we've seen teams not complete this part. Impact mapping is all about capturing the
impacts on actor groups through conversation and collaboration so that we can start
to measure what matters. We have other practices that we'll explore later in this
chapter that are better at defining the outputs and deliverables. That said, it's virtually
impossible not to think about what you might build to achieve these impacts, so it's
good to capture them during the conversation.
The resulting Impact Map was big. It was full of ideas for deliverables all
lined up to anticipated measurable impacts:
Our Impact Map is complete. It connects the deliverables (the What) to the impacts
(the How) on the actors (the Who), helping to achieve the overall goal (the Why):
At this point, we can start to think about some prioritization – at least among the
stakeholders who participated in the practice. Using the impact mapping practice,
we have come up with a set of hypothesis statements. We don't know whether those
statements will turn out to be true. They're just intelligent guesses at this point. To
prove or disprove hypotheses, we need to run experiments. Each hypothesis statement
can be turned into an experiment. But which experiment do we run first and what order
do we run them in?
Time to place your bets! An interesting way to approach this is to ask stakeholders
where they want to place their bets. To simulate this, we can give each stakeholder
some fake money. $10,000 Beattie Bucks or $10,000 Donal Dollars or $10,000 Mike
Money. (Noel, the fourth author, decided he didn't want to enter the currency market!)
This is a simple practice to simulate what it would be like if each person in the room
was the owner of the company and it was their money.
248 | Discovering the Why and Who
Attendees of the session now place a bet with their fake money. They bet on which
impact statement they would like to invest in an experiment first. They may choose
to invest in just one impact or spread their bet across multiple impacts. This allows
everyone to pick their top-priority deliverables, showing an aggregated view of what
the team should focus on building first.
Let's look into an example to understand this better:
They were asked which impact statements they would like to invest in and
were encouraged either to spread their money across different impacts or
place multiple Donal Dollars on one impact if that was deemed of very high
value to them.
Impact Mapping | 249
Some stakeholders chose to invest all their Donal Dollars on just one impact,
while others spread them out. The Impact Map, with Donal Dollars attached,
showed a clear winner:
In the next section of the book, Section 4, Prioritize it, we'll look at how we prioritize all
the learning and outputs from the practices we used on the Discovery Loop. One of the
viewpoints will be the outcome of investment betting on the Impact Map.
A final point before we share a couple of real-world stories about impact mapping is
to mention the fact that the Impact Map is never complete. It is not static. It is not
a one-time thing. It will always evolve. Teams should regularly come back to their
Impact Map after they deliver some features, run some experiments, and learn from
their users. They should converse on the results of experiments and conclude whether
existing hypotheses statements were proven or disproven.
We introduced psychological safety in the previous section. A helpful test of the current
psychological culture is to inspect what happens when a hypothesis statement from an
Impact Map turns out to be wrong. A failed hypothesis could impact team morale and
result in a lack of psychological safety. If the hypothesis succeeded and created room
for further learning and improvisation, it could boost morale and establish an open
culture in the team.
Impact Mapping | 251
Hypothesis Examples
Let's see how we used impact mapping at
our Open Innovation Labs residency to help
a global bank that was seeking to improve its
HR system.
The business goal was for HR to have an
integrated workflow across HR systems,
providing an improved colleague experience.
Some metrics were identified at the goal level,
including colleague satisfaction and the time to
change a service lead.
The brainstorming of actors identified colleagues, which the group agreed
would be the ubiquitous term they would use for all employees, managers,
second line managers, partners, and HR operations, as well as management
and technology stakeholders.
Measurable impact statements and possible deliverables were identified for
each actor group, leading to a number of powerful hypothesis statements,
including the following:
We hypothesize that delivering a single sign-on solution will speed up the
service, thereby helping colleagues and managers achieve the above goal.
We hypothesize that integrating with a data hub will increase the quality
of colleague data available to HR operators, thereby helping to achieve the
above goal.
We hypothesize that developing custom dashboard features will increase
insights into the management information group.
We hypothesize that designing a new governance model will help the HR
specialist speed up the remodeling process.
This is a great example for demonstrating how impact mapping can help
drive much larger enterprise systems (well beyond standalone applications)
and can also identify outputs beyond software (for example, the governance
model is an organizational feature).
252 | Discovering the Why and Who
I thought about the many development teams I'd worked with. It was
refreshing to see technical staff so connected with the Why. They were
engaged in ensuring that their outputs were delivering measurable
outcomes.
For more information and examples and to have further conversation with community
members about the Impact Mapping practice, please visit the Open Practice Library
page at openpracticelibrary.com/practice/impact-mapping
Impact mapping is, without doubt, one of the best practices to start your discovery
conversations. What else could we do to enhance this? We're now going to focus on
the actors' perspective, in other words, what the product could do for them, using
Human‑centered design.
Human-Centered Design | 255
Human-Centered Design
Impact mapping identified all the different actors and groups of people that can help
achieve our goal. It also identified the measurable impacts we would like to have on
them. Wouldn't it be great if we could actually meet and engage with some of these
people? We could validate our Impact Map with them. We could check whether the
deliverables we have in mind would be supported by them. We could learn from them
before we write a line of code! We could build a trusted relationship with them whereby
we regularly check in with them, test, measure, and learn from them as we start to build
out applications.
Human-centered design is an approach to system development that aims to make
systems usable and useful by focusing on the users, their needs, and requirements.
There are several major topics that relate to User Experience (UX). Each of these is
large enough to fill books of their own and include:
• Design thinking
• Human-centered design
• The Double Diamond process model
• Lean UX
In the software development world, we can start with people and end up with
innovative applications that are tailor-made to suit our users' needs. By understanding
our users and looking at things from their perspective, we can ultimately produce a
product that they will love to use and engage with.
People react to what they see – visual design. However, there are many layers below the
visual that UX designers must consider in order to make an application engaging and
useful:
256 | Discovering the Why and Who
Let's explore some of the different components of UX design that should be considered
in product development:
• Facilitation of product discovery involves leading the process of finding a
suitable product or service, focusing on product desirability, the viability of a
business model, and the feasibility of technical possibilities.
• Qualitative user research includes running interviews with users and
performing contextual inquiries and other observation studies whereby the
designer learns by watching users carry out their current job.
• Conceptual design facilitates and contributes to the creative work at the early
stages of a product or service formulation.
• Prototyping ranges from paper prototypes to low-fidelity prototypes, to
higher-fidelity mockups that can be used to test ideas with users and other
stakeholders.
• Interaction design starts to look at what the new experience could look like by
using visual storyboards and wireframes.
• Human-centered service design takes a wider, macro-based approach to
performing customer journey mapping, service blueprinting, and increasing the
quality of all touchpoints within a service delivery.
Human-Centered Design | 257
The common denominator of all of the above practices is learning through better
information and understanding of the user stakeholder. This will validate and improve
the predictions made during, for example, finding the North Star and impact mapping.
A popular practice that we use at many different stages of UX design is empathy
mapping. The empathy map is a visual canvas that captures what a stakeholder is
seeing, thinking, doing, and feeling within a certain context. The empathy map can
be used to surface out pain points and opportunities for improvement. These then
become an important source of value propositions that we can consider in our product
development.
We use empathy maps during the facilitation of product discovery, user research
(for example, to capture notes during a user interview), getting early feedback on
conceptual design, prototypes, and visual designs. It can also be used to capture
feedback regarding the user testing of developed application features:
One of the simplest human-centered design practices that employs rapid prototyping
is called Think, Make, Check. This involves creating a rapid prototype and interviewing
users about it. We use iterative steps to get to a design. During the user interview, the
team can listen in to help complete the empathy map:
She uploaded a photo of her cat she had on her phone. She then struggled to
understand why her cat was not appearing immediately in the competition
once she'd uploaded it. Ciarán and the team knew she had to vote for another
cat before hers would appear on the list, which seemed obvious to him.
The team observed a few other surprising things about how Mary interacted
with the application (and some of the things she didn't see!).
Ciarán and the team thanked Mary for her time and asked whether
she would be interested in returning in a week or so to help with some
prototyping. Mary was glad to help.
Meanwhile, the team had been building Mary's empathy map which they
would evolve and later play back to her to confirm their understanding and
learning.
A week later, Mary returned to PetBattle. The team had been busy and
produced lots of paper prototypes of some of the ideas that had come out
of the impact mapping session. This included a discussion forum, targeted
advertisements, a deal of the day, the scoreboard feature, and a loyalty
scheme.
Human-Centered Design | 261
While working for a Finnish telecoms company, our team used this practice to bridge
the gap between the Dev and Ops teams:
He would raise requests and the Ops team would be slow to respond,
impacting his ability to progress with any agility.
And of course, on the other side of the divide, we had Timo in Ops, and his
view was different from that of Kalle, our developer. Timo's empathy map
revealed that it was a challenge for him to have to keep repeating himself to
the developers and that requests coming into him were never accurate, so
he'd have to push back or do lots of reworking.
We played both empathy maps back to each team to start to ideate how we
can work together in solving this divide.
It's a simple practice to use and the information it revealed could be viewed
as predictable, but from here I could start to see where the pain points lay
within the company structure. It became super clear that the teams were
not sharing the responsibility of delivering new functionality or fixes, but
instead blaming each other when things went wrong.
Human-Centered Design | 265
Empathy maps play a vital role in acquiring a deep understanding of end users. Through
the following case study, we discuss our take on this practice.
There are several practice pages in the Open Practice Library that relate to
human‑centered design practices. These include:
• openpracticelibrary.com/practice/aeiou-observation-framework/
• openpracticelibrary.com/practice/proto-persona
• openpracticelibrary.com/practice/stakeholders-interview
• openpracticelibrary.com/practice/empathy-mapping
• openpracticelibrary.com/practice/lean-ux-workshop
• openpracticelibrary.com/practice/ui-design-workshop
These pages provide an excellent overview and starting point to explore these practices
in more detail.
In the next chapter, we will take a look at a discovery practice that can help our team
visualize the end-to-end business process in a way that everyone can understand – the
Event Storm.
Conclusion
In this chapter, we have been on the Discovery Loop with a focus on discovering the
Why and Who of the product - Why are we building this product and who are we
building it for? We started by finding our North Star Metric - a crucial metric that
would be an indicator of product success.
We introduced the Impact Mapping practice which helps us converge on our overall
goal and understand all the different Actors that will help or impede us reaching this
goal. We also learned how to define and map measurable impacts and deliverables
that we can work with our Actor groups to form hypotheses statements as a basis for
experimentation.
We explored the different human-centered design practices that can be used with the
Actor groups to form empathy and support user experience design.
Conclusion | 267
Figure 8.25: Adding practices to the Delivery Loop to discovery the Why and Who
In Chapter 9, Discovering the How, we will continue on our Discovery Loop but focus
more on discovering how we might build our solution. To be specific, it focuses on how
to start the architecture and the architecture will emerge over time. We'll be exploring
domain driven design and event storming as practices that help us do this. We'll also
look at practices that help with non-functional aspects of our architecture - the
non-functional map and metrics-based-process-mapping.
Discovering the How
9
In the previous chapter, we started our first iteration of the Discovery loop. We focused
on practices that helped us discover why and who we're trying to solve problems or
realize opportunities for.
In this chapter, we will shift our focus onto collaboratively learning how we might
start building a solution that fixes problems for these people or realize some new
opportunities for them. The practices that we're going to explore to help do this
include:
• Event storming: A colorful practice that visualizes business processes into
event-driven domains with an emerging architecture
• Emerging architecture: A principle we follow whereby we capture just enough
information from the Event Storm and other artifacts to start our architecture
and allow it to emerge over time as we add more complexity
• Non-functional map: A practice that facilitates conversations to explore different
non-functional aspects of the solution
• Metrics-based process map: A practice that allows us to capture some baseline
measurements of legacy systems and identify bottlenecks in systems and that
provides a strong case for moving to a more continuous delivery mode of
delivery
270 | Discovering the How
Event Storming
Event storming is an agile and lean way of getting teams to collaborate on technical
domain projects, combine thoughts, and learn to work with a shared understanding.
While at first the process can seem complex, it has become one of our favorite practices
at Open Innovation Labs to use with our customers. The short time taken to break
down silos of knowledge and visualize a business process with this technique is what
impresses us the most. Event storming is the technique we use for gaining a shared
understanding of how a business process operates while also visualizing any potential
issues and solutions. Having facilitated many event storms, this part of the chapter is
our guide to carrying one out and our commentary on how best to apply it.
Firstly, you need a big modeling surface, ideally a long wall with much more space than
you can imagine using. You also need plenty of space for people to stand around and
see the modeling surface.
1 https://medium.com/@ziobrando
272 | Discovering the How
An Event Storm workshop gathers people from across the organization. It does not
focus on any specific technology, so those skills can be left outside for the time being.
The only requirement of attendees is their energy, attention, and willingness to give it a
go. During an Event Storm workshop, everyone is armed with orange sticky notes – the
events – and the knowledge about their part of the company that they bring with them.
Software creation is an exploratory task, and while exploring, more learning occurs.
Capturing this is critically important. Event storming is about visualizing all that
knowledge as an event-based mind map and identifying the gaps, unknowns, and
pain points within. With the right audience for an Event Storm, you can get harmony
between groups who traditionally might never meet and, more importantly, bring
alignment where previously there may have been misunderstanding.
In an Event Storm workshop, you may have the business analysts, who know the
business needs and demands, identifying the commands and events alongside the
developers, who will implement these as features. Couple this with having the UX
designers (working with end users) doing UI design and data model validation to help
support this and you all of a sudden have alignment from end to end. You also get early
verification of what could work and what won't work before you write a single line of
code.
For a more comprehensive background of event storming, check out the Open Practice
Library, where you will find more links, articles, and examples of event storming being
used in the field with our customers. We are going to focus on an Event Storm at the
process and feature level.
Let's take a look at what you need in terms of ingredients to run a great Event Storm.
The Ingredients
The Event Storm ingredients are all the things you need for a successful Event Storm.
First up, you need sticky notes and you need lots of them. Think of a very large number
of them, then double it. Event storming uses a very specific color-coded key. It is
important to stick to the authors' colors as it gives consistency as you move from Event
Storm 1 to Event Storm 2.
Next, you need energy. Event storming is an intense workshop that requires people's
attention the whole time. Ironing out misconceptions requires the right people and for
them to be able to speak out. Bring lots of good coffee and plenty of water and fruit!
You will need a good open space. Give yourself an unlimited modeling surface. Roll
out some plotter paper, or if you're struggling for wall space, use big foam boards that
can be moved and added to. You don't want to constrain the amount of information
gathered by lack of wall space.
274 | Discovering the How
People are the most important ingredient. Invite the right people. Get your end users,
the business analysts, architects, business owners, product owners, and developers. Get
all the people. The more people you can get, the more fleshed out your Event Storm
will be. A great event storming coach and facilitator will seek to bring everyone who
knows the product and everyone who is collectively involved in building the product to
the Event Storm. If that's too many people, they'll look for representatives from each
department or team across the organization that makes up this cross-functional group.
We have created an Amazon shopping list2 for those looking for a shopping-list-as-
code Event Storm experience. Once you have all the ingredients, let's take a look at the
recipes to cook up a great Event Storm.
The Recipe
The event storming recipe is the list of things to do and the order to do them in!
First up, you need to preheat your oven. By that, we mean hack the space. Remove
all the chairs from the room and mark a big, empty wall space for you to roll out
the modeling surface. Chairs lead to people sitting down, which leads to people not
participating, which leads to people falling asleep! Give yourself lots of room and roll
out your plotted paper/foam boards. Do not start in a corner; start in the middle of the
room if possible. Starting in a corner will mean only 50% of the audience will be able to
gather around it, therefore removing half of the knowledge being offered.
2 http://amzn.eu/dViputa
Event Storming | 275
Bring out the Event Storm Flow, which is a key or legend to the Event Storm. We
usually make posters (on large, flip chart-sized sticky notes) the night before running
the workshop, as this can be time-consuming. Hang it so it is within eyeshot of all the
attendees. People who are new to event storming will want to keep referring to it to
guide them initially.
A more digitized version of this could be added to a digital tool (such as Miro or Mural)
if you are going to try to run the Event Storm with distributed team members.
We normally walk through the flow of the Event Storm at this stage with the attendees.
This can be fairly overwhelming as there is a lot to introduce. We create and walk
through an example based on something fabricated but relatable to everyone present
to help hammer home the key points. Usually, we will go for either a to-do list app or
something more tangible such as an online Amazon purchase. Everyone can relate to
these.
With the ingredients all laid out and the oven preheated, it's time to start creating the
mixture! We start with Events.
276 | Discovering the How
The first thing we do is set the goal of the event storming workshop. This could be the
entry point of the flow, the endpoint, or both. Introduce the event poster shown in the
previous image. In its simplest form, an event is just something that happened in the
past that someone cares about. With the starting point set, ask the attendees to divide
into groups of two to three and identify all the events they can think of in the system.
Timebox this activity. We usually go for 15 to 20 mins initially and hover around making
sure people are on task and conversing, clarifying things if needs be. If there are SMEs
in the groups, we make sure they are divided into the groups equally and not bunched
together in one. The SMEs are often in the best position to guide this initial building out
of the event spine because they understand a lot about the domain in question.
With the team's events identified, ask for a volunteer group to play their events back to
the group by adding them to the modeling surface and begin telling the story. Enforce
the timeline of the events, moving from left to right. The first person in a group to
volunteer their spine should be rewarded with a gold medal for breaking the ice. We
often draw a medal on a sticky note and award it to this person so they can wear it with
pride! A top tip is to not start fully on the left-hand side of the modeling space. When
things start to get moved around as more knowledge is uncovered, this space can be
useful to grow into.
Event Storming | 277
Ask the other groups to add in their events along the timeline. Encourage the attendees
to shuffle theirs along the surface to make room for more events to follow. If teams
come up with different words to describe the same event, try to get a consensus on
the language. Clarify events that are not fully developed. For example, if a group has a
very high-level event such as Item Ordered, break it down into lower-level details such
as Item Added to Basket or Checkout Opened. If there are any questions or assumptions
being made, mark them if they cannot be answered just yet.
It's important to park the things that cannot be answered confidently and move on or
you can end up in the weeds very quickly. Ensure enough time is given to the discussion
in case the answers can be uncovered, but if not, mark them with a pink sticky note
representing a question. Initially, the question key will be used very frequently. That
is OK; this is the point where we have the least collective knowledge of the system.
When marking an area with a question, we are stating that the conversation will not
be forgotten and can be returned to when needed. A good idea can be to hold off on
revealing this card until enough conversation has been had and it is needed in order to
move on. Be sure to capture all sides of the discussion using multiple cards if need be.
With the spine of events created, have the teams play through the story from front to
back and back to front. Add any missing events through the conversation that occurs.
At this stage, we like to mark pivot events with some tape coming vertically from a
single event. These can be useful in framing the boundaries between key things and can
be great for quickly identifying key sections of the timeline.
278 | Discovering the How
The next section introduces the bulk of the keys in an Event Storm puzzle. It's time to
introduce the actors, commands, read model, and systems. Add the charts for each so
they're within eyeshot. Walk through them with the group and clarify if anyone has any
misunderstandings.
The command represents a decision made by a user in response to some information
retrieved from the read model. The actor is the person who issues the command and
the system is the thing that receives the command. It is the responsibility of the system
to respond to the command and therefore trigger an event.
There is quite a lot introduced in this section, so its important people do not get
overwhelmed with the next pieces of the puzzle. It can be handy to go through another
simple example if the audience needs something more relatable. We tend to add a few
of these pieces to the spine so the example is now in the context of the flow being
created by the wider group.
The next part of our key provides a mechanism to capture Questions and assumptions
as well as External Systems.
One more part (for now) to add to the key is the Read Model.
The Event Storm posters should detail parts of the key with examples. Teams should
now prepare to add the new parts of the flow again! Break into groups but rotate
the members. If there are SMEs in given areas, again, make sure they're distributed
throughout the teams that are formed.
Ask the groups to come up with the actors, commands, and systems that the command
is issued to. Timebox this to 15 minutes again for the first pass. If the volume of events
is quite large and the group is too, it can be useful for the teams to take the events
between the two Pivot Points and flesh them out. This can speed things up, and all will
be replayed together afterward so the shared understanding can still be achieved.
We try to steer groups away from being bogged down on whether the system is
internal/external or how it should be named. It's good to keep things fuzzy until the last
point where we really need to know the low-level details. For us, the primary thing with
the system at this stage is just to identify that there is a thing to which the command is
issued, thus triggering the corresponding event. In our experience, being fuzzy at this
stage prevents people from falling down a rabbit hole, and keeps the discussions at a
high level so everyone feels included.
Have the teams replay their additions to the Event Storm to the rest of the group. By
doing this, more discussion will be generated and there will be more gaps identified.
Capture assumptions and add the missing pieces as part of this replay.
If the teams have divided around pivot events and more time is desired by the groups
to flesh out the flow, we will often rotate the groups to a new section between different
pivot events, set a new time box, and go again. This helps to validate other groups' work
and also flesh out the flow further with fresh eyes, revealing more knowledge.
During the next group replay of the full Event Storm, we introduce the next part of the
key. The Policies and Procedures are introduced along with the Subprocess.
The policies and procedures are a great way to flesh out some of the gaps between
commands and events. While the group is replaying the story, ask questions such as
Does this always happen? or say things such as Whenever a command is issued, we
always do this event. Doing so will tease out small things between the events that have
not yet been thought of. We think this is why Brandolini refers to this card as the Lie
Detector.
The subprocess can be a great way of parking content that will not be stormed during
the scope of the workshop, but will be returned to later, usually in a separate event
storming session.
282 | Discovering the How
A customer we worked with had a process for third-party fulfillment that was out of
scope for the process they were trying to uncover. The purple subprocess sticky note
was used to denote this unexplored area and was returned to in later workshops. A
simple piece of tape was used to connect the two processes once they had been nested
beneath each other so the visual connection between the two flows was not lost. A top
tip is not to use a marker or pen as the flow may be (re)moved at a later date!
Branches in the flow will inevitably occur. We mark these in a visual way using happy or
sad stickers to denote the positive and negative sides of the flow. We tend to model the
happy path first. Usually, for every happy flow, there is a corresponding sad flow. The
subprocess in my eyes is a great way to capture these without getting bogged down in
branches of the flow that we are not trying to explore in the scope of the workshop or
wish to return to in more detail later.
If an event results in lots of commands being made available, we stack them vertically
so the flow can be fleshed out one at a time. If time is tight, it can be good to just pick
one or two flows to flesh them out and return to the others later. It's important when
branching occurs to focus on the goal of the Event Storm and not getting bogged down
in the weeds with low-quality information.
Event Storming | 283
With most of the puzzle pieces in place, by now the Event Storm should start to tell a
more detailed story. Continue to replay the story forward and backward, starting in
the middle and going in all directions. Challenge all the items that are there, and don't
be afraid to add or tear up misconceptions. You should start to notice the volume of
question sticky notes drop off as knowledge is added to the Event Storm.
If the system you're modeling has UI components to it, a great add-on is to include
some high-level UIs to the read model. Simple sketches on sticky notes can quickly
validate what components the UI might need, as well as validating that the data that's
needed for them can be retrieved. If you invited any end users of the application to the
Event Storm, it can be really powerful to replay the Event Storm with any UI designs
with them to further validate the hypothesis of the flow.
Figure 9.15: The group adding UI to the Event Storm on green sticky notes
284 | Discovering the How
Finally, it's time to identify and name the aggregate. The Aggregate is the state machine
of the system.
It's the thing that receives a command and decides to act upon it or not. We only
introduce the system early in the flow as the aggregate can be the bit that leads to the
most confusion and be the biggest time sponge – especially when trying to name it!
By holding off until the end, a large portion of the unknowns is cleared up. When the
flow is more fleshed out, we get the group to replay the Event Storm and identify where
a system is either existing or has to be built. If it has to be built, we add the yellow
aggregate card and give it a nonsense name such as combobulator. By doing so, it helps
people not get bogged down in the boundary of where the aggregate should be defined.
Continue to replay the flow and the group will naturally start to identify that some
commands are issued to the combobulator and others to something else. Capture
these as something else. When a full pass of the Event Storm has been completed, then
return to naming it. The depth of understanding should be so high that the name and
boundary of the aggregate should naturally fall out.
Event Storming | 285
So, I challenged the product owner and said, "Should we not have the app
autosave for us?" He was not in agreement and thought the users would
want to save their progress manually. I figured, let's use the Event Storm to
run an experiment! We captured both sides of our arguments for and against
a manual save and put them on a pink Post-it to represent an assumption
we're making. I tend to put a dot sticker on Post-its that require further
exploration or experimentation. This makes it easier to view when looking
at the full model. We put a placeholder event on the model to say progress
saved until we'd done some further research.
This assumption was then written into our backlog, where our designers
were able to run an experiment with a clickable prototype to see what
the users wanted or needed. In the clickable prototype, there was no save
button and when asked after the tests if they felt it was missing, all of the
respondents said they assumed their progress was being saved as they
updated the content.
Returning to the Event Storm, now we had validated an assumption we've
made. This meant we could proceed to model out our understanding of the
user flow at this point. It also meant we could extend the things we'd need to
build from a technical standpoint with this understanding.
Event Storming | 287
I always think an Event Storm is the best guess at a point in time with the
knowledge of the people we have in the room. Setting up our environment
with a social contract to ensure psychological safety means we can freely
explore our system. We are able to make guesses and return to them or put
placeholders on the map until we do some further research. This is why
event storming, especially when paired with all the other practices, is such a
great fit for discovering and designing a system.
Let's dive back into the PetBattle story and look at how their team built up their Event
Storm.
3 https://en.wikipedia.org/wiki/Friends
288 | Discovering the How
They add this to their Event Storm in the top-left corner and add Time
before coming up with the spine of the events.
Figure 9.18: The One Where Mary enters the daily Tournament and wins a prize
Individually, the team members came up with all the events within the
boundaries of the example defined. Starting in the center of the map, one
team member put their events on the map, describing the story to the team.
Other team members filled in the gaps to include their events, thus forming
the overarching spine of the story. Here are the first stickies from the
PetBattle Event Storm. There may be other events that sit between these
ones, but the team is happy with the depth they have gone to for the
moment.
When two team members had similar events written differently, both were
initially put on the board and discussed further before agreeing on one
phrasing for the event.
In the example, Cat Photo Reviewed versus Cat Photo Validated was up for
discussion. Ironing out the language of the events is important to bring the
team to use a consistent vernacular.
Event Storming | 289
When Mary lands on the leaderboard page, there are many journeys that she
could take. She could log in to play the game, she could vote for a cat in the
currently running game, or she could simply navigate away from the site.
Our convention is to stack these commands as each could represent another
flow that needs to be mapped out. It's important for the team to not drift too
far away from the example they are mapping out, so they are captured and
added to the board but not further explored.
The read model may be more than data. It could also include a low-fidelity
UI mockup. Small UI experiments provide a great opportunity to generate
user feedback. This provides us with product direction without writing a
single line of code.
For example, once the Leader Board Displayed event occurs, we could sketch
out what it looks like. This gives our actor an overview of what options are
available to them and how they could interact with the system.
292 | Discovering the How
The PetBattle team's Event Storm is starting to take shape with the addition
of commands, read models, and actors.
The PetBattle team now add any external systems they're integrating with
along with policies/procedures and aggregates.
The purple Policy sticky note is a great way to add rules or habits to the
Event Storm tying commands to events that were not initiated by an actor.
Event Storming | 293
Figure 9.27: Processes, policies, and services added to the Event Storm
The PetBattle team has continued along the flow and event stormed how
Mary can enter the competition. The Event Storm can now be read much
like a story, capturing key things the team could explore now or later.
Reading from left to right, Mary uses her image and ID to add her cat to the
competition, then this command is issued to a cat service that is responsible
for holding the cats' properties, such as image and owner.
Whenever the photo is uploaded, we execute the verify image policy. The
team knows they're storming a happy path of image upload so they make an
assumption that the image is a cat. In order to not think about how image
verification could work, they put a purple diamond sticky note to return to
later. The Not Safe for Families (NSFF) process is a process about ensuring
images used on the internet are safe for children to see. The need for this
can be traced back to our impact map's deliverables. The team essentially
puts a pin in it here with the purple sticky note, knowing they can return to
it later when other bets have been verified. There may be a lot of information
required for the full implementation of the NSFF process. It may require
further discovery steps. We have now gathered just enough information to
carry on with our Event Storm.
Event Storming | 295
Assuming the cat photo is valid, we immediately enter Mary's cat into the
tournament through the Tournament Entry Policy. The Tournament service
aggregates information from multiple sources into one new meaningful
object. It takes cat.id that's just been uploaded, Mary's ID, and the number of
votes Mary's cat received. The Tournament service processes where Mary's
cat sits in the leaderboard along with the other competition entries. When
the Animal Lovers play the game now, they will amend the total count for
each cat in the current competition.
In an early version of PetBattle, cats were being randomly fetched from the
internet when there were not enough players uploading their own cats.
In an Event Storm, this external system is captured on a pink Post-it. The
following example shows how we could model this interaction. We fetch
the tournament cats from the Tournament service, which results in no
cats being found, triggering the Empty Competition policy. Whenever this
occurs, we always fetch random cats from the internet and add them to
the game. Capturing third-party systems in an Event Storm can highlight
dependencies that exist within our application.
In the case of PetBattle, we had some heated debates around the design of
the leaderboard. We couldn't agree on how many cats to show, that is, the
top 10 or just the top three. We also couldn't agree on whether we should
always show Mary's uploaded cat, even if it was not attracting any votes to
be in the top.
We decided to capture this discussion on a pink Post-it. This Post-it
represents work that needs further exploration and possibly some user
research to find out what functionality the players of the game would like to
see here. As with all work, this can be captured and added to our backlog.
The team could also build out multiple versions of the application, showing
the top three cats and the top ten. Using OpenShift, we could deploy both
versions of the app and gather some feedback on the user engagement
metrics to help determine which version is preferred by the players. In
the next section of this book, we will explore some of the options around
deployment strategy, such as A/B testing, which will enable us to learn
about the optimal solutions based on behavior and impact.
Figure 9.29: Using a pink Post-it for a feature that is yet to be finalized
At the end of the Event Storm exercise, the flow represents a best guess
made by the team knowing only what they currently know. It is not a static
document but one that is revisited as they build some of the features
identified on the map. We now have just enough information to be able to
design how the components of the system could interact.
Emerging Architecture | 297
We'll see how all of this information is used by other practices on the Options Pivot
section of the Mobius Loop in Section 4, Prioritize It.
Emerging Architecture
At this point in time, a traditional architecture approach would be to write a high-
level or detailed design document. This would normally take months and is an activity
that would be carried out by enterprise or solution architects. It usually involves a
high dosage of UML (Universal Modelling Language) diagramming and many pages of
technical discussion and validation. Let's look at a different approach that better aligns
with our discovery practices so far.
The idea behind emerging architecture is that we get just enough knowledge and shared
understanding of a business domain with just enough understanding of the logical
architecture needed to deliver high-business value features. The major architectural
approaches can be decided at this stage, such as decoupled services, event-driven
architecture, APIs, and streaming applications – and subsequently, this will be refined in
the following iterations.
In our Event Storm, we drilled down on one or two of the TOWs (The One Wheres). We
want to be confident that the aggregates, commands, and read models would deliver
just the capability described in that part of the business domain – no more and no less.
By taking an emerging architecture approach, we do not invest time in building a
comprehensive architecture that is not yet going to add business value. It may be that
the future direction of the product or re-prioritization of features means that this
architecture is actually never used and therefore time is not wasted.
298 | Discovering the How
The team will gather an increased level of shared understanding and confidence in the
overall solution as they incrementally deliver it. This means that any future solutioning
conversations benefit from that additional context and understanding.
After our first pass of the Event Storm, our initial draft of the architecture is depicted
here.
We haven't yet fleshed out the details of the AUTH (Authentication), NSFF, and
Notification processes. These would usually be expanded in subsequent iterations of
the Event Storm practice. However, given the linearity of books, it was decided to add
these here to give you an indication of what a future architecture could potentially look
like.
Emerging Architecture | 299
4 https://www.redhat.com/en/success-stories/world-health-organization
300 | Discovering the How
Our Event Storm was not overly complex but it looked like this after many
weeks and informed revisions. The details of each item are not super
important; suffice to say the long pink and yellow stickies represented
unique APIs we needed to build from a technical point of view. Ultimately,
we discovered through our Event Storm a large number of components, not
all of which were validated with end users:
In the previous real-world example, we made use of a cache to speed up API calls. Next,
we will look at a practice that will help surface these non-functional aspects of our
solutions.
We can use this mapping tool to gain a shared understanding of non-functional aspects
of a system in a visually structured representation. On completion, you should have a
good view of the key non-functional considerations for the solution. Mapping these
visually can highlight where work is required to realize them.
It's a very simple practice to run and could easily be done in an hour with the right
people in the room:
1. Ensure that everyone participating is comfortable with the headings and what they
mean.
2. For each heading, capture important considerations or requirements for the
system or solution and add stickies for each item. Depending on the number of
participants, you could use facilitation techniques such as 1-2-4-all or splitting into
smaller groups to capture items.
3. For each heading, play back the stickies as a group.
4. During the playback, think about whether each sticky should be:
5. An item added to the backlog and prioritized as it requires work to realize.
6. An item that should be included in the Definition of Done for other backlog items.
7. Part of the acceptance criteria for other backlog items.
The non-functional wheel is a great practice to start a conversation about all the
different non-functional aspects of the solution. One area it will not necessarily
facilitate conversation in is the speed of delivery and the lead time to deliver code to
production. This is a big focus for DevOps and therefore something we want to pay
attention to during our Discovery loop. To shift focus toward continuous delivery, it can
be helpful to capture some baseline metrics around delivery processes.
In this example, we have a software delivery flow that looks like a traditional waterfall
pattern of gathering requirements, developing, testing, and deploying code. Typically,
MBPM would be more granular; for example, instead of saying the developer
implements code, you could break that down to:
• Writing unit tests
• Writing implementation code
• Building code
• Deploying code
• The developer's integration test
• Peer review
• Fixing identified issues from peer review
• Documentation
Discovering the Case for Continuous Delivery | 311
You may see many problems, and it may be tempting to try to tackle every obvious
problem at once. A better strategy is to start with the biggest and easiest wins first
and prioritize a list of ideas from there. Remember to be agile about the improvements
themselves: focus on small chunks of work to ensure faster delivery of work. Get a fast
feedback loop and re-prioritize based on feedback.
Once you make a change, you need to adjust your map accordingly. You should pay
close attention to the metrics: are the metrics improving given the volume of work
flowing through?
312 | Discovering the How
This felt like a perfect situation to bring out the MBPM practice. It wasn't
a practice I'd used previously on new greenfield product development
engagements. But, given this was a legacy brownfield application and focused
on improvement and optimization, this discovery practice would be helpful.
The team consisted of a representative from the application development
team, a representative from the application operations team, and an
infrastructure specialist. Over the course of a couple of days, these three
people combined all their knowledge of development and operations
processes to create an MBPM.
Some of the key learnings that came out from this were:
• There were long lead times for any development to start because of
the SLA for booking virtual machine time.
• There was a lot of wasted time waiting for builds and deployments to
complete because there was a dependency on a third-party team to
trigger these.
• There was low completeness and accuracy in some of the quality
assurance due to the large amount of manual testing performed.
The yellow sticky notes in the following image show all the processes
mapped into the MBPM. The pink sticky notes represented ideas for how we
could optimize the metrics or even make the process obsolete altogether.
Discovering the Case for Continuous Delivery | 315
When we returned to the MBPM three weeks later (after three one-week
delivery iterations), we captured on blue sticky notes where there had
been metrics-based improvements as a result of the continuous delivery
infrastructure that had been introduced.
Let's see how PetBattle ran the MBPM exercise to capture some baseline metrics.
PetBattle – MBPM
The current deployment of the PetBattle application is torturous, taking
many hours to manually copy and deploy artifacts to a Raspberry Pi cluster
that also happens to be a media server collection that is installed under the
table belonging to one of the developers.
This seemed concerning to the new engineering team and they suggested
visualizing everything that happens in the existing software delivery process.
This would mean a shared understanding of how things worked in the
hobbyist app and allow them to ideate how they can optimize and improve
as PetBattle starts to scale.
The team, led by Tim, created an MBPM on the wall. There was a good bit
of discussion about who was involved in releasing PetBattle, the actors, and
when certain events occurred. The team settled on the following flow as
representative of what happened during a release cycle.
One of the PetBattle developers (Mike) would begin the release cycle by
checking out all of the code. The next step was to try and figure out whether
all of the feature branches that needed merging had been merged correctly.
There was some delay in finding all the right code branches. Packaging the
code failed 25% of the time because updating the configuration values was
always manual and error-prone. The deployment to the Raspberry Pi cluster
could begin, which usually took 1 hour because it was a CPU-intensive task.
There was only one cluster, so this took down the PetBattle service while
deployment happened.
Normally, one of the other PetBattle engineers who understood the
Raspberry Pi hardware the best (Noel) took time to patch and update the
cluster operating system and tools. Often while doing this work, hardware
issues were discovered that required new parts to be ordered. It could take
up to three days to wait for the parts to arrive and often made the whole
cluster unavailable, leading to a long wait time for testing a release.
Next, the best tester among the team (Donal) would run through the
manual test plan. Because the release was already deployed, users would
unknowingly use the new version, often with a lot of failures. Donal's'
testing often failed about half the time usually because the test plan needed
updating or there were just a lot of bugs! Donal diligently entered the bugs
he found into a spreadsheet to share with the other developers. Often, he
got the details mixed up, meaning that 20% of the reported defects had the
wrong information recorded.
318 | Discovering the How
The developers would take a couple of days to fix all of the defects reported
in Donal's spreadsheet. The code was again merged, packaged, and
deployed, often with similar failure rates to the first round of deployments.
Now, the end users could use the polished code. Often, end users were
testing features at the same time as Donal. They would report defects and
outages to the PetBattle email list.
From this practice, the team identified several initiatives that they agreed
should be put in place:
• Use of containers in cloud infrastructure
• Automation of infrastructure
• Continuous integration
• Continuous delivery
• Test automation
• Security automation
Conclusion | 319
Conclusion
In this chapter, we continued our journey on the Discovery loop with a focus on how
we are going to deliver solutions – or, at least, how we are going to start.
We looked at DDD and, in particular, the event storming practice, which helped
visualize the business process flow. Our event-driven architectures started to emerge
from the Event Storm. We started to further drive user research by making bets to form
our backlog for feature development.
We also looked at a couple of practices that help form non-functional work we should
consider – the non-functional map and the MBPM. The latter allows us to take key
measurements associated with legacy software delivery processes and identify some of
the continuous delivery practices we could employ to help improve and optimize them.
This chapter and the previous chapter have enabled lots of ideas and candidates for
development/configuration work. They are all visualized on the artifacts produced by
the practices.
In the next chapter, Chapter 10, Setting Outcomes, we're going to explore how we
distill all this information into setting measurable outcomes. We'll then look at how we
organize and prioritize all these ideas in a way where we can start to iteratively deliver
value and do so regularly.
Setting Outcomes
10
In the previous chapter, we used practices to help extract lots of detail to confirm what
problems we're going to solve, who we're solving them for, and our approach to solving
them in an experimental and incremental way.
By using practices such as Impact Mapping and human-centered design, we should now
have a good understanding of the customer and business pain points. Before we dive
into developing software solutions, we need to translate those problems into potential
business and customer outcomes. And we need to use all the information captured
from those artifacts and the information radiated from those practices to ensure we
deliver outcomes that will really matter.
In this chapter, we're going to explore all the learning done in the Discovery section of
the Mobius Loop so far to clearly define and articulate the goals and outcomes the team
is striving to deliver to happy stakeholders and end users. This includes:
• Explaining the difference between outputs and outcomes
• Why we should have Target Outcomes
• How to capture Target Outcomes
• Some examples of Target Outcomes
• How to visualize Target Outcomes
• How to optimize and chain Target Outcomes with other practices
322 | Setting Outcomes
Setting Target Outcomes can include end user (customer-)based outcomes as well as
business- and capability-based outcomes. When using Mobius, outcomes should be
measurable. We should always be able to assign a numerical value to the current state
and the target state. By regularly inspecting these measures, we can establish whether
we are moving toward the outcome, whether we've met it, or if we need to pivot and
change our approach if we're not meeting them.
What Is an Outcome?
As described in Section 1, Practices Make Perfect, the Mobius Loop encourages an
iterative, experimental approach to achieving outcomes through a continuous flow of
innovation using continuous discovery and continuous delivery. An outcome is a result
that occurs from doing something. In our case, the outcome is a change in human
behavior that drives change in the culture, impacting long-term business results.
The Target Outcomes practice helps teams discover, write, align on, and share the
desired behavioral change they would like to see achieved by application products.
Sharing Target Outcomes helps the team deliver measurable results and align with
stakeholders and team members. Referring to Target Outcomes can help prioritize and
filter work so that we are always focused on delivering measurable impact.
The Target Outcomes practice involves creating a canvas that summarizes anticipated
and hopeful measurable change captured by the team and stakeholders. This artifact
serves as an information radiator, reinforcing Target Outcomes used during other
practices, activities, and discussions from other parts of the Mobius Loop. The Measure
and Learn element of the Delivery Loop focuses on evaluating the outcomes. We'll
explore that in Section 5, Deliver It. This chapter is about setting the outcomes when in
the Discovery Loop.
In the previous chapter, we introduced Impact Mapping and described the differences
between Impacts, which are measurable changes we want to achieve on the Actors, and
a Delivery, which comprises of implementable pieces of work. Let's dive deeper into
the difference between outcomes and outputs because it is very important to get this
understanding.
1 https://www.senseandrespondpress.com/managing-outcomes
324 | Setting Outcomes
2 https://openpracticelibrary.com/practice/start-at-the-end/
3 https://openpracticelibrary.com/practice/news-headlines-aka-cover-story/
4 https://openpracticelibrary.com/practice/start-with-why/
5 https://openpracticelibrary.com/practice/affinity-mapping/
6 https://mobiusloop.com/kit/
326 | Setting Outcomes
Alternative formats exist that can help define outcomes. Experiment with alternative
formats to discover if the team creates outcomes more effectively using them. For
example, the SMART criteria (introduced in the previous picture) and summarized
again below define five characteristics for an outcome or goal: specific, measurable,
achievable, realistic, and time-based.
What Is an Outcome? | 327
A stronger outcome statement is one that is measurable or can be broken down into
a number of measures. Our approach to delivery is centered around moving to an
iterative approach with regular releases of incremental value. We want to measure
the outcomes as of now and then be able to measure the outcomes as we start to run
experiments with feature ideas.
When starting our delivery, we should be able to regularly pause and measure the
impact of the experiment or initiative to ensure that we learn from it and what we are
doing is either taking us toward our target-measurable outcome or triggering a pivot or
re-think.
Let's look at some examples.
Other Discovery Loop activities reveal that operations stakeholders are nervous about
more frequent product deployments. Frequent deployments cannot result in additional
planned or unplanned work for operations personnel.
The team uses the Target Outcomes practice to:
• Discuss the results of the MBPM practice and other Discovery practice findings.
• List potential goals of the team's project.
As a result of the Target Outcomes practice, the team creates the information radiator
visualizing three Target Outcomes.
In this example, the service provider previously had a service level agreement to always
resolve a complex provisioning issue (including, if necessary, site visits) within 30 days.
As a result of the upcoming product development, the target is to move that to 20 days.
In doing this, the organization hopes to improve customer satisfaction survey scores
(out of 5) from 3.2 to 3.5. They want to do this without creating any additional manual
work performed by those who work on the operations.
As we gain more information from customers as they feed back on increments of the
evolving product, it can lead to course direction changes and perhaps a need for new
or revised Target Outcomes. We'll explore this in much more detail when we look at
the Options Pivot coming out of Discovery (in the next chapter) and the Options Pivot
coming out of Delivery in Section 5, Deliver It.
Target Outcomes is not a practice that can be done alone. It needs to be fed by several
other practices and the result of it drives inputs to several new practices.
Let's get back to our PetBattle team and explore the Target Outcomes that resulted
from their trip around the Discovery Loop.
which crashed the system, they had very poor user experience and no
chance to potentially gain revenue as PetBattle was not online for users to
access! Target Outcome #3 is a team culture-based outcome – by having
an engaged team who have a real sense of product ownership, they will be
happier in their workplace and be more responsive and motivated to grow
and improve the product in the future (another assumption we can test!).
Let's take our first Target Outcome, which is that PetBattle is generating
revenue from an increased active user base. This can easily be seen as a
measurable outcome. We can quantify at any given time the number of users
registered in our system, the number of visitors to our site, the number
of engaged users, the amount of engaging activity users are doing, and, of
course, revenue secured through advertisers.
We can nicely visualize the baseline metrics we have at this point and the
target metrics we want to get to from near future increments of the product.
Figure 10.8: Target outcome – Increased active user base (in thousands)
Let's look at our second Target Outcome: that PetBattle is always online.
From the early version of the app, there have been stability issues due to the
huge popularity. We can collect a metric now on what we believe represents
the uptime of the existing site and visualize how we want to improve this to
the target-measurable outcome of always being online.
334 | Setting Outcomes
In the example above, we saw three very different focuses on outcomes. They were not
just about the application or business but spread between the application, what the
platform enables, and the performance and motivation of the team. This is a pattern we
have observed with many customers.
A more recent set of outcomes with a customer in the United Arab Emirates
was as follows:
• [Customer] can seamlessly deploy workloads several times a day to
prove self‑service capabilities.
• [Application] is seamlessly deployed as an example to prove a
repeatable pattern for deploying N other applications and to decrease
the time to market for changes.
• Development is a first-class citizen. Multifunctional teams are
empowered using Site Reliability Engineering best practices as a
Product Team.
These outcomes start to introduce the idea of something measurable. The
magic word in the second outcome is decrease. There is a measurable
outcome here around decreasing time to market for changes. This
is something that could be measured before, during, and after the
engagements.
The final example comes from a non-government organization focused on
healthcare:
• Increased level of empowerment and skills such that the [application]
can be confidently owned and managed.
• Create a more Open Data model for [application] for [users] to be
increasingly attracted to our system.
• A clear development lifecycle on a scalable, secure, and stable platform
as measured by increased security, automation, and monitoring
(auditability)
Again, we see the balance of team, application, and platform in these three
outcomes. All three outcomes can also be measured.
We've now seen a few different examples of Target Outcomes set during engagements
with different customers. One thing you may have noticed is that Target Outcomes can
range from being very end user-centric to being more about the technology, platform,
and team being used to develop the application. This leads us to explore the idea of
primary outcomes (more end user-centric) and enabling outcomes (delivered by the
platform and team).
7 https://mobiusloop.com/kit/
Differentiating between Primary Outcomes and Enabling Outcomes | 339
Let's look at some common examples of outcomes that are a mix of primary and
enabling outcomes.
• Increasing Speed and Quality with Customers:
• Improve Customer Satisfaction: Measuring customer happiness using tools
such as Net Promoter Score (NPS).
• Reduce Lead Time: Elapsed time from the initial customer request or
contact to delivery triggering customer happiness.
• Increasing Quality:
• Reduce Outages: Number of operational incidents leaving customers with
no service.
• Reduce Defects: Number of defects in the product that impact customers.
• Sustainability:
• Improve Team Satisfaction: A measurement of the happiness of a product
team.
• Increase Employee Retention: Turnover rate of employees.
• Working:
• Increase Throughput: Total work items completed per unit of time.
• Improve Predictability: Percentage of time team delivers on their
commitments.
• Increase Agility: Elapsed time to pivot to new priorities.
• Efficiency:
• Reduce Cycle Time: The elapsed time to complete delivery activities such as
build, test, and deployment of new features.
• Reduce Work in Progress (WIP): Number of items a team is working on at
any moment.
• Reduce Technical Debt: How much you're spending on reworking your
codebase.
The organization's specific context will drive whether they are more focused on one or
two of these and set them as their primary Target Outcomes. If the motivation is more
business-focused, their primary outcome is likely to be around the performance of
their products with their customer, and their customer's satisfaction.
Nearly all of the examples above can also be enabling outcomes, and we can form
hypotheses of how improving these can drive improvements in more business-focused
outcomes. These examples also steer our focus toward exploring software delivery
techniques (such as DevOps) and the platform (such as OpenShift).
340 | Setting Outcomes
As this book has started to demonstrate, teams that successfully adopt the practices
being introduced will see a positive effect in these software delivery metrics. The
cultural practices, along with other practices such as metrics-based process mapping,
identify the bottlenecks and handovers that, when removed, can speed lead times to
change. The continuous integration and continuous delivery practices being used when
we build pipelines allow us to increase deployments. And, with increasing levels of
automation and continuous learning from practices such as retrospectives, we can keep
Mean Time to Repair (MTTR) and Change Failure Rates to a minimum.
Platforms such as OpenShift provide an even further level of enablement. Without these
platforms, we wouldn't be able to achieve market agility and reliability for the software
products being deployed on the platform.
So, if the platform is the lowest level of enablement, are there metrics we can baseline
and regularly check to ensure this enablement is being passed up to software teams
and eventually to the business? Let's look at a few platform adoption metrics.
Again, we start to see the impact these measures can have on software delivery and
operations of applications running on the platform, which, in turn, can be measured
by the four software delivery metrics. And these metrics can generate a lagging effect
on the quality of the software, the organization's products, and the happiness of their
customer and employees. So, the value of capturing, measuring, visualizing, and sharing
these metrics cannot be underestimated.
Let's look at our approach for growing this metric-centric mindset.
Conclusion
In this chapter, we focused on setting Target Outcomes for the business problems and
opportunities we uncovered in the previous chapters.
We explained what an outcome is and how it is different from output, and how we
capture them, write them, measure them, and visualize them.
We dived deeper into measurable outcomes and metrics and explored the difference
between primary outcomes and enabling outcomes. Enabling outcomes in a DevOps
world is heavily focused on software delivery metrics and platform metrics.
You can read more and chat about the practice of setting Target Outcomes in the Open
Practice Library at openpracticelibrary.com/practice/target-outcomes/.
Figure 10.13: Adding a further practice to the Delivery Loop to set Target Outcomes
In the next chapter, we'll move to the Options Pivot and look at practices to prioritize
and plan what we can deliver incrementally. We'll look at the trade-offs between feature
delivery and non-functional work to improve software delivery and platform. And, we'll
explore some of the advanced deployment capabilities offered by OpenShift and other
platforms that will help us further decide on how to deliver the best value and products
to customers.
Section 4: Prioritize It
In Section 3, Discover It, we worked our way around the Discovery Loop. We started
with Why—why are we embarking on this initiative? What is our great idea? We used the
North Star to help us frame this. We defined the problem and understood the context
further by using the Impact Mapping practice to align on our strategic goal. Impact
Mapping helped us converge on all the different actors involved that could help us
achieve or impede our goal. Impact Mapping captures the measurable impacts we want
to effect and the behavioral changes we would like to generate for those actors. From
this, we form hypothesis statements about how the different ideas for deliverables may
help achieve these impacts.
We refined this understanding further by using the human-centered design techniques
and Design Thinking practices such as Empathy Mapping and Contextual Inquiry to
observe and connect with our actors. We explored business processes and domain
models using the Event Storming practice by generating a shared understanding of
the event-driven process. Using the Event Storming notation, a microservices-based
architecture started to emerge. We also discovered non-functional aspects of the
design by using Non-Functional Maps and running Metrics-Based Process Mapping.
The Discovery Loop presented lots of ideas for things we can do in our delivery cycles—
features we can implement; architectures that emerge as we refine and develop the
solution by repeated playthroughs of the Event Storm; research that can be performed
using user interface prototypes or technical spikes that test our ideas further;
experiments that can be run with our users to help get an even better understanding of
their motivations, pain points, and what value means to them; and processes we can put
in place to gather data and optimize metrics.
350 | Section 4: Prioritize It
From just the first iteration of the Discovery Loop, it would be very easy to come
up with hundreds of different tasks we could do from all the conversations and
engagement that those practices generate. It can be a minefield visualizing all these
ideas and it can take weeks, if not months, to generate tasks for a small team just from
a short iteration of the Discovery Loop! So, we need to be careful to ensure we remain
focused on delivering value, outcomes that matter, and that we don't get bogged down
in analysis-paralysis in a world filled purely with busyness!
Before we left the Discovery Loop, we took time to translate all of this learning
into measurable Target Outcomes. This started with the primary target outcomes
associated with the business product, but we also took time to recognize some of
the secondary targets and enabling outcomes that can help support development—
especially those that can be enabled by software delivery processes and underlying
platforms such as OpenShift.
With these outcomes visualized and presented using big visible Information Radiators,
supporting metrics can also be baselined and radiated. We can now think about all
those tasks and ideas that resulted from the Discovery Loop. But we can only do so by
keeping an eye on those outcomes at all times and ensuring everything we do is directly
or indirectly going to take us toward achieving them. This is where the real fun begins,
because we're going to explore how we're going to achieve those measurable outcomes.
| 351
Mobius uses the word options instead of solutions, or the dreaded term requirements.
Until we validate our ideas, they are simply wild guesses, so calling them solutions
or saying they are required is not logical and there is no evidence to support them.
Instead, we call them potential solutions, options, and we get to test them out in the
Delivery Loop to prove or disprove the hypothesis that we have formed around those
options. This drives us to a more data-driven approach rather than just simply guessing.
When we are on the Options Pivot, we decide which of the outcomes we are going
to target next. We choose which ideas or hypotheses we need to build, test, validate,
and learn from, as well as exploring how we might deliver the options. We also need
to get a sense of priority. We never have the luxury of infinite time and resources, so
prioritization is always going to be the key to achieving business value and fast learning.
Learning fast is an important aspect here. We want to generate options that can
validate, or invalidate, our ideas from the Discovery Loop so we can ultimately revisit
and enhance them. Fast feedback is the key to connecting the Discovery artifacts with a
validated prototype.
Chapter 11, The Options Pivot, will focus on the practices we use before we begin a
Delivery Loop. We will return to the Options Pivot again after the Delivery Loop in
Section 7, Improve It, Sustain It, when we take the learnings and measurements that
have resulted from the latest Delivery Loop iteration and decide what to do next given
these findings.
The Options Pivot
11
During the Discovery Loop, we started to come up with lots of ideas for
implementation. The Impact Map gave us deliverables that formed hypothesis
statements. The human-centered design and Empathy Mapping practices gave us
ideas directly from the user. The Event Storm gave us standalone features (triggered
by commands) that can be implemented using standalone microservices (codifying the
aggregate). The Metrics-Based Process Map and Non-Functional Map gave us ideas
on how we can speed up the development cycle and improve security, maintainability,
operability, scalability, auditability, traceability, reusability, and just about anything else
that ends with ability!
The next step after the Discovery Loop is the Options Pivot, where all the information
from these practices that we've used gets boiled down to a list of options for actions to
take and decisions to make on what to deliver next.
The Options Pivot is the heart of the Mobius Loop. On the left-hand side of it is where
we absorb all the learning and Target Outcomes we aligned on in the Discovery Loop.
We generate further ideas. We refine ideas on what to deliver next and then choose
the options to work on. Later in the book, in Chapter 17, Improve It, we'll look at the
right-hand side of the Options Pivot. This is where we adapt our approach based on
the measurements and learnings from a completed iteration of the Delivery Loop. We
decide whether to do more Discovery, more Delivery, or Pivot completely. We refine
what to discover or deliver next.
354 | The Options Pivot
Value Slicing
We are approaching the part of the Mobius mental model where we will start delivering
increments of our solution. They will vary from running short prototypes and technical
experiments or spikes, to conducting defined user research, to implementing features
that have resulted from Event Storming and other Discovery practices.
An iteration of the Delivery Loop is not prescribed in length. If you are using a popular
iterative agile delivery framework such as Scrum, an iteration of the Delivery Loop
translates well to one sprint (a fixed time-box between one and four weeks). If you
are using a more continuous delivery approach such as Kanban to enable an ongoing
flow of value, each Delivery Loop may simply represent the processing of one Product
Backlog item and delivering it into the product. You may even be using a non-agile
delivery methodology such as Waterfall whereby the Delivery Loop is more singular
and slower to move around. The Mobius Loop is agnostic to the delivery approach. But
what is consistent regardless of the delivery approach is the idea that we seek to deliver
high‑value work sooner, establish important learning more quickly, and work in small
batch sizes of delivery effort so we can measure and learn the impact to inform our
next set of decisions.
To help us break down all our work items and ensure they are grouped to a level
that will form small increments of value, we use popular visualization and planning
practices.
Simple path mapping techniques break the work down by mapping back from the
Target Outcomes to the least number of steps needed to deliver it. There are many
other practices, such as journey mapping, story mapping, future state mapping, service
blueprints, and more. Mobius is less concerned with the how, as long as you focus on
finding the simplest way to deliver the outcomes. This technique we have found works
very effectively is called Value Slicing.
Let's look at how we approach Value Slicing.
356 | The Options Pivot
First, we note all of the standalone work ideas that have been generated by the
Discovery practices. Our focus here is now on Outputs (and not Outcomes) as we want
to group all of our deliverables together and form an incremental release strategy that
delivers the outcomes. A starting point is to copy each of the following from existing
artifacts:
• Deliverables captured on the Impact Map
• Commands captured on the Event Storm
• Ideas and feedback captured on Empathy Maps
• Non-functional work needed to support decisions made on the Non-Functional
Map
• Ideas and non-functional features captured during discussion of the
Metrics‑Based Process Map (MBPM)
• All the other features and ideas that have come up during any other Discovery
Loop practices you may have used and the many conversations that occurred
Here are a couple of tips we've picked up from our experience. First, don't simply move
sticky notes from one artifact to this new space. You should keep the Impact Map,
Event Storms, Empathy Maps, MBPMs, and other artifacts as standalone artifacts, fully
intact in the original form. They will be very useful when we return to them after doing
some Delivery Loops.
Second, copy word-for-word the items you're picking up from those practices. As
we'll see in the coming chapters, we will really benefit when we can trace work items
through the Discovery Loop, Options Pivot, and Delivery Loop, so keeping language
consistent will help with this. Some teams even invest in a key or coding system to
show this traceability from the outset.
Figure 11.1: Collecting information and ideas from Discovery Loop practices
Value Slicing | 357
To start with, simply spread all the items across a large work surface. There's something
very satisfying about standing back and seeing all the possible work we know of in
front of us. It can be amazing to see just how much has been ideated from those
few practices. It can also be a bit chaotic and daunting. This is why we need to start
organizing the work.
If you're working virtually with people distributed, having a Canvas such as the following
one (and available for download from the book's GitHub repository) may be helpful:
Next, remove any duplicates. For example, you may have identified a deliverable on
your Impact Map and the same feature has ended up in your Event Storm. Your user
interviews may also have found similar feature ideas captured on Empathy Maps. Where
there are identical features, remove the duplicate. If the idea can be broken down into
smaller standalone ideas, refactor and re-write your sticky notes to have these multiple
ideas. The more the better in this practice!
The next step is to categorize each of the items into some kind of common theme and
give that theme a title. We're looking for something that brings all of the items together.
If you were to put each item into a bucket, what would the label on the bucket be? A
top tip is to start with the Target Outcomes that were derived from the Discovery Loop
and set them as the headings to categorize each item under. The reason we do this
is that we want to work with an outcome-driven mindset. We have agreed on some
Target Outcomes so, really, every work item we are considering should be taking us to
one or more of those outcomes. If we pick any one of the items and can't easily see an
outcome it will help achieve, we should be questioning the value of doing that thing at
all. (There are cases where such items that don't map to outcomes are still important,
so if this does happen, just give them their own pile.)
358 | The Options Pivot
We should end up with all items in a neat, straight column directly beneath the Target
Outcome they are categorized under.
If we have a good, well-thought-out set of Primary Outcomes and Enabling Outcomes,
it should be a very positive exercise mapping all of the features, experiments, research
ideas, and so on to an outcome. This exercise should be collaborative and include all
members of the cross-functional team. Developers, operators, designers, Product
Owners, business SMEs, and so on will all have been involved and provided input to the
preceding Discovery Loop practices. They should remain included during the Options
Pivot to ensure their ideas and initiatives are understood and included on the map.
The resulting visualization of work should include functional features and
non-functional initiatives. All of the work that can take place on the platform to enable
faster and safer development and quicker release of product features should be shown.
If we stand back at the end of the exercise, we should start to see our delivery loops
starting to emerge.
The next step is to prioritize all tasks and items on the board. This is never easy but
nearly always needed. If you have worked on a project where time has not been an
issue and it's been obvious that the team will have all the time, they need to confidently
deliver everything asked of them, you are in a unique position! That has never happened
to us and there has always been a need to prioritize work and choose what not to do!
This can start with the Product Owner deciding his or her perspective on priority.
However, as we progress through this chapter, we'll look at a few practices and tools
that you can bring out to help with prioritization in a collaborative environment and
drive consensus. Executing those practices can then be reflected on this value map
we're creating.
Value Slicing | 359
We like to attempt to prioritize each column. So, take each Target Outcome with all of
the features and other items that we believe will achieve them and prioritize them. The
most important and compelling items should be at the top. These are the items that
need to be prioritized above anything else if you are to achieve the outcome. The lesser
understood or "nice to have" items should be further down the column.
The final stage is to slice value out of the value map. Using some sticky tape (ideally
colored, such as painters' tape), we ask the person who holds overall responsibility
for prioritizing work and articulating value (usually this is the Product Owner for
a team using Scrum) to slice horizontally what they see as a slice of value for the
whole product. This means looking at the most important items for each theme and
combining them with some of the other highly important items from other themes.
At this point, our Product Owner has a huge amount of power. They can prioritize
within a given outcome. They can prioritize a whole outcome and move everything
down or up. They can combine items together from different outcomes to form
proposed releases. They can slice one, two, three, or fifty slices of value – each
one containing one, two, or more items. Most importantly, they can facilitate
conversations with all stakeholders and team members to arrive at a consensus on this
two-dimensional Value Slice Map.
360 | The Options Pivot
During many years of using these practices, we've picked up a few facilitation tips to
help explain them correctly. The first involves how you might visualize and plan two
valuable activities.
I started by saying, Obviously, we'd like to do all this work, after which one
of the senior stakeholders interrupted and said, YES! We need to do all this
work. I could sense there was some discomfort among stakeholders, as if
I was doing a typical consultants' effort on locking down scope when the
stakeholders wanted everything built. Perhaps my leading choice of words
could have been better.
But I wasn't trying to decide what was in and out of scope. My whole agile
mindset is based on flexible scope, the ability to adapt and change scope as
we learn more, and always ensuring we're delivering the next most valuable
and important work.
To explain my mindset, my thoughts fast-forwarded to a team social we had
planned for later that day. It had been a long week and we had planned to go
for a few drinks and a curry – again boosting our cultural foundation further
by allowing the team to relax and get to know each other a bit better.
362 | The Options Pivot
I was looking forward to having a beer and having a curry after that beer.
In fact, I was really looking forward to that beer. I felt we'd really earned it
that week and it was going to be great to raise a glass and say cheers with
my new team! But that didn't mean that the curry wasn't important. Nor did
it mean that the curry was not going to happen. We were going to have a
beer first followed by a curry. That was how we'd prioritized the evening. We
hadn't de-scoped anything nor were we planning to. The beer was in my top
slice of value. The curry was in my second slice of value.
The team felt more relaxed understanding we were not de-scoping any work
at all using this practice but simply organizing by value. The team also felt
very relaxed and enjoyed both a beer and a curry!
We've also learned a few simple tricks that can help set up the Value Slicing practice to
work effectively.
As I observed the team working through this process, I realized that the
single line of tape had generated a misleading point of this practice. There
was a reluctance to put anything beneath the line because there was a
perception that this meant out of scope. I explained this was not the case
and what I was trying to do was slice out the Minimal Viable Product or
MVP. MVP defines the minimum number of features that could form the
product that could be released to users to learn from and build upon. In
reality, many stakeholders see defining the MVP as something negative as
it's where they lose all the great innovative featuresthat they may want but
are not collectively deemed important. I actually try to avoid using the term
MVP, as it is often greeted with some negative emotion.
I learned from this facilitation that one slice should never be used as we are
not defining things as in or out of scope and we are not defining just the
MVP.
Working with another customer in Finland, I took this learning and adapted
my facilitation approach. With all the items that had been captured from
the Discovery Loop on the map, I produced three slices of tape. Hopefully
now the Product Owner and stakeholders would not fall into the in-scope/
out-of-scope trap. However, now there was a new misunderstanding! For
this particular engagement, which was an immersive four-week Open
Innovation Labs residency focused on improved operations, we had planned
three one-week sprints. By coincidence, I had produced three slices of tape
for Value Slicing. So, the stakeholders and Product Owner assumed that
whatever we put in the first slice would form the scope for Sprint 1, the
second slice would be Sprint 2, and the third slice would be Sprint 3.
364 | The Options Pivot
Figure 11.7: Value Slicing of the items captured from the Discovery Loop
I explained that this was not the case. We do not yet know how long it will
take the team to deliver each item in each slice. We will use other practices
in the Delivery Loop to help us understand that. We could end up delivering
more than one slice in one sprint. Or, it may take more than one sprint to
deliver one slice. We just don't know yet.
Since then, I have tweaked my facilitation further. When making the slices
available, I now produce lots of them – at least 10, sometimes more than 20.
I also make the roll of tape accessible and tell the Product Owner to use as
many slices as they would like – the more the better, in fact! I've found Value
Slice Maps now often have many more slices.
A Product Owner from a UK defensecompany once remarked to me that
you could argue that each item on the Value Slice board could be its own
slice of value. I celebrated with a massive smile when I heard this. Yes! When
we reach that mindset and approach, we truly are reaching the goal of
continuous delivery.
Value Slicing | 365
Visualizing and slicing increments of value has evolved from the amazing thinking
and work produced by Jeff Patton in his book User Story Mapping1 published in 2008.
User Story Mapping is an effective practice for creating lightweight release plans that
can drive iterative and incremental delivery practices. We highly recommend reading
Patton's book and trying out the exercise he describes in his fifth chapter about
visualizing and slicing out the value of something very simple, like everything you do
in the morning to get up, get ready, and travel to work. We use this exercise in our
enablement workshops and find it really brings the practice to life well.
Let's look at how the PetBattle team approached Value Slicing.
1 https://www.jpattonassociates.com/user-story-mapping/
366 | The Options Pivot
As the team explored these four outcomes further, they thought it might
help to break them down a bit further to help with shared understanding
with stakeholders. The Impact Map had driven focus on four outcomes:
• Increased participation rate of the casual viewer
• Increased uploads
• Increased site engagement of the uploaders
• Increased number of sponsored competitions
Collectively, these would all help with the first primary outcome where
PetBattle would be generating revenue from an increased active user base.
So, these were added to the Value Slice Map:
So, they had a huge collection of outputs spread over one wall and an
organized set of outcomes as headings on another wall:
The outputs moved under the second and fourth outcomes were sourced
from the MBPM and Non-Functional Map. This was also true for the third
outcome, which also included some of the ideas captured during the early
social contract and real-time retrospective that was started when building
the cultural foundation.
The team ended up with a UserStory Map that showed the initial journey
through PetBattle as well as the journey the team would go on to deliver and
support it:
Looking at the top slice of value brought a feeling of excitement. The team
could see the first items they were going to do to make the PetBattle vision
a reality!
Design of Experiments
All our ideas for new products, services, features, and indeed any changes we
can introduce to make things better (more growth, increased revenue, enhanced
experience, and so on) start off as a hypothesis or an assumption. In a traditional
approach to planning, a team may place bets on which experiment to run based on
some form of return on investment-style analysis, while making further assumptions in
the process.
Design of Experiments is an alternative to this approach, in which we try to validate
as many of the important ideas/hypotheses/assumptions we are making as early as
possible. Some of those objects of the experiments we may want to keep open until
we get some real-world proof, which can be done through some of the advanced
deployment capability (such as A/B Testing) that we'll explore later in this chapter.
Design of Experiments is a practice we use to turn ideas, hypotheses, or assumptions
into concrete, well-defined sets of experiments that can be carried out in order to
achieve validation or invalidation – that is, provide us with valuable learning.
Design of Experiments is a fail-safe way to advance a solution and learn fast. It can
provide a quick way to evolve a product, helps drive innovation in existing as well as
new products, and enables autonomous teams to deliver on leadership intent by placing
small bets.
You may need more than one experiment for each item (idea, hypothesis, assumption).
An experiment usually only changes a small part of the product or service in order
to understand how this change could influence our Target Outcomes. The number
of experiments is really defined based on what you want to learn and how many
distinctive changes you will be introducing.
374 | The Options Pivot
Once described, the experiments can be implemented, tracked, and measured in order
to analyze the outcomes. In an ideal world, an experiment will have binary success/
failure criteria, but most often we need to analyze data using statistical methods to
find out if there is a significant correlation between the change introduced with the
experiment and the change in the Target Outcome.
NOTE
Successful experiments are not experiments that have proven our assumption is
correct. Successful experiments are those that provide valid and reliable data that
shows a statistically significant conclusion.
376 | The Options Pivot
The sources of ideas and hypotheses are all the Discovery practices, such as
Event Storming, Impact Mapping, and Empathy Mapping. While we perform those
aforementioned practices, often ideas may emerge for possible improvements or new
hypotheses may form.
Before adding any of them as items to the Product Backlog, these ideas and hypotheses
would typically need some research, analysis, and further elaboration.
Once prioritized, these ideas and hypothesis may lead to:
• New features being added through the User Story Map and Value Slice board
• Complete new features being broken down into smaller features or User Stories
and refactored on the Value Slice board
• User Research
• Design of Experiments
• Technical Spikes and UI Prototypes
The Impact and Effort Prioritization matrix has its own Open Practice Library page at
openpracticelibrary.com/practice/impact-effort-prioritization-matrix/ – a great place
to continue the learning and discussion about this prioritization practice.
A slightly different perspective on prioritization is achieved using the How-Now-Wow
Prioritization practice. Whereas the previous practice is used to filter out and prioritize
the very high-impacting features, this practice is used to identify and prioritize the
quick wins and base features needed for a product.
How-Now-Wow Prioritization | 379
How-Now-Wow Prioritization
How-Now-Wow is an idea selection tool that is often combined with Brainstorming,
How-Might-We2 (HMW), and Design of Experiments. It compares and plots ideas on a
2x2 matrix by comparing the idea's difficulty to implement with its novelty/originality.
Similar to the Impact and Effort Prioritization Matrix, How-Now-Wow Prioritization
is simple, easy to understand, and very visual, and can include the whole team in the
process of transparent selection of ideas/hypotheses to work on first.
Again, the sources of ideas and hypotheses are all the Discovery practices, such as
Event Storming, Impact Mapping, HMW, and Empathy Mapping. When we perform
those aforementioned practices, often ideas will emerge for possible improvements or
new hypotheses may form.
We can plot each of these on the How-Now-Wow matrix by assessing each item and
considering how easy or difficult it is to implement (using team members to collaborate
and align on this) and how new and innovative the feature is.
2 https://openpracticelibrary.com/practice/hmw/
380 | The Options Pivot
Three separate groups of ideas have emerged from this practice. There are three we're
particularly interested in:
1. Now Ideas: Easy to implement and considered normal ideas for the product. These
should be considered in the higher slices of the Value Slice Map and are ideas we
would expect to deliver.
2. Wow Ideas: Easy to implement and considered highly innovative or new for the
product. These should be considered as potential ideas for the higher slices of
the Value Slice Map and would be particularly valuable if innovation and market
differentiation were deemed high-priority focus areas. If the Priority Sliders
practice has already been used, it may provide some direction here.
3. How Ideas: Hard to implement and considered highly innovative or new for
the product. These would benefit from further research to understand the
implementation difficulty and potential impact further. Design of Experiments,
prototyping, and further User Research will help validate whether this innovation
is something that would be well received. Technical Spikes and research will help
establish confidence and potentially easier solutions to implement.
4. Other ideas: Hard to implement and not particularly innovative or new for the
product. We're not interested in these ideas at all.
Once placed on the How-Now-Wow matrix, these ideas and hypotheses may lead to:
• New features being added through the User Story Map and Value Slice board
• Complete new features being broken down into smaller features or User Stories
and refactored on the Value Slice board
• User Research
• Design of Experiments
• Technical Spikes and UI Prototypes
Figure 11.20: Different stakeholders collaborating to gain a better understanding of the product
Both practices highlighted some features to do more research on. Practices categorized
in the How quadrant of the How-Now-Wow matrix will benefit from additional
research. Practices categorized in the High Effort / High Impact quadrant of the
Impact and Effort Prioritization matrix will benefit from additional research.
Many of the human-centered design practices outlined in Chapter 8, Discovering the
Why and Who, will help with this research. This includes Empathy Mapping, qualitative
user research, conceptual design, prototyping, and interaction design. If the feature
area is of very high importance, it may be valuable to invest in a specific practice that
will really further the understanding of the feature – the Design Sprint.
382 | The Options Pivot
The process phases include Understand, Define, Sketch, Decide, Prototype, and
Validate.
The aim is to fast-forward into the future to see your finished product and customer
reactions, before making any expensive commitments. It is a simple and cheap way to
validate major assumptions and the big question(s) and point to the different options
to explore further through delivery. This set of practices reduces risks when bringing
a new product, service, or feature to the market. Design Sprints are the fastest way to
find out if a product or project is worth undertaking, if a feature is worth the effort, or
if your value proposition is really valid. For the latter, you should also consider running
a Research Sprint.3 It compresses work into one week and most importantly tests the
design idea and provides real user feedback in a rapid fashion.
By now, there are many different variations of the Design Sprint format. You may come
across the Google Ventures variation – the Design Sprint 2.0 – which is the agenda
shown below. The best thing to do is to try different variations and judge which one
works for what context.
3 https://library.gv.com/the-gv-research-sprint-a-4-day-process-for-answering-
important-startup-questions-97279b532b25
The Design Sprint | 383
Final
Crazy 8's
Walk-through
User Test
Long-Term Goals
Recruiting
Sprint Questions
Lightning Demos
Drawing Ideas
Develop Concepts
Table 11.1: A five-day Design Sprint
Effectively, we are using the same Mobius Loop mental model as used throughout this
book but micro-focused on a particular option in order to refine understanding and
conduct further research about its value. That improved understanding of relative value
then filters back into the overall Mobius Loop that classified this as an option worthy of
a Design Sprint.
384 | The Options Pivot
Figure 11.21: The Design Sprint – quick trip round the Mobius Loop
A Design Sprint will really help refine the shared understanding of the value a feature
or group of features may offer in a product. They may be functional user features. They
may also be non-functional features that are more focused on improving the platform
and improving the development experience. The same agenda as above can apply where
"users" are developers or operators and the Design Sprint is focused on researching
some potential work that will improve their development or operations experience.
This practice will help elaborate and refine the information that is pushed through
the User Story Map and Value Slicing Map. We see it as a practice on the Options
Pivot because it will help decide whether or not to proceed with the delivery of the
associated features.
Read more about the practice, add your own experiences, or raise any questions you
might have at openpracticelibrary.com/practice/design-sprint/.
The User Story Mapping practice helps us visualize our work into a story with a
clear backbone. Value Slicing allows us to form incremental release plans that can
be delivered in iterations. The Impact and Effort Prioritization and How-Now-Wow
Prioritization practices help provide alternate perspectives to help with the Value
Slicing. The Design Sprint allows us to dive deeper into a specific feature area to
research it further, so we can prioritize with increased confidence.
Forming the Initial Product Backlog | 385
All of these practices (and many others you'll find in the Open Practice Library)
are homing in on us being able to produce an initial Product Backlog – a single,
one-dimensional list of stuff we're going to take into the Delivery Loop.
Let's now look at how we translate the information from our Value Slices into a Product
Backlog and how we can continue to prioritize it.
This has been a great blend of practices and, if there was strong collaboration and a
sense of alignment throughout (which is facilitated by having a strong foundation of
open culture), the Value Slice Map should represent a combined, shared view of work
and how it can be incrementally released.
To create the Product Backlog, we simply copy each sticky note in the top slice from
left to right and place them in a single column.
386 | The Options Pivot
The sticky note on the left of the top slice will be copied and placed as the item at the
top of the Product Backlog. The sticky note to the right of it on the top slice will be the
second item on the Product Backlog. Once we've copied all items in the top slice, we
move to the second slice of value and, again, copy each of the items from left to right
onto the Product Backlog.
Figure 11.25: Creating a Product Backlog from the Value Slicing Canvas
We end up with a single column of Product Backlog items that have been sourced
and prioritized through a collection of robust practices. That traceability is important
because we can trace back to the Discovery Loop practice that generated the idea and
the value it is intended to deliver.
Let's look at that traceability in action with our PetBattle organization.
388 | The Options Pivot
This is the beginning of the life of the PetBattle Product Backlog. It will
remain living, breathing, and always ready for updates as long as the
PetBattle product is in operation.
In fact, the team immediately sees some early prioritization needed and
recommends moving the CI/CD workshop and team lunch/breakfast items
to the top. They all agreed there was no point writing any code or building
any features until they had CI/CD in place and a fed and watered team!
The Product Backlog is a living, breathing artifact. It should never be static. It should
never be done. It is a tool that is always available for teams and stakeholders to
reference and, in collaboration with Product Owners, a place to add ideas, elaborate on
existing ideas, and continue to prioritize work items relative to each other.
From this moment onward, we will start and continue the practice of Product Backlog
Refinement.
There is also no defined agenda for a Product Backlog Refinement session and it can
be attended by a variety of different people, such as development and operational
team members, business stakeholders, and leadership. The activities that take place in
Product Backlog Refinement include:
• Talking through and refining the collective shared understanding of an item on
the backlog, its value to users, and the implementation needs that need to be
satisfied
• Re-writing and refining the title of a Product Backlog item to better reflect the
collective understanding
• Writing acceptance criteria for a specific item on the Product Backlog
• Doing some relative estimation of the effort required to deliver a feature from
the backlog to satisfy the acceptance criteria
• Splitting an item on the backlog into two (or more) smaller items
• Grouping items together into a more standalone item that will deliver a stronger
unit of value
• Capturing new ideas and feedback on the Product Backlog
• Prioritizing and re-ordering items on the Product Backlog
All of the artifacts we've generated on the Discovery Loop and Options Pivot are
useful to look at, collaborate on, and refine further when performing Product Backlog
Refinement. They too are all living, breathing artifacts and, often, conversations during
Product Backlog Refinement trigger further updates to these. So, for example, we
may add a new deliverable to our Impact Map and connect it to an impact and actor
to test with. We may elaborate on some details on the Event Storm as we start to
consider the implementation details of an associated backlog item. As new items are
captured from Product Backlog Refinement, the Impact and Effort Prioritization Matrix,
How-Now-Wow Prioritization Matrix, and Value Slice board artifacts are all available to
relatively plot the new item against existing items. In Chapter 17, Improve It, we'll return
to the Options Pivot following an iteration of the Delivery Loop and look at how the
measurements and learning captured from delivery can drive further Product Backlog
Refinement.
Arguably one of the most important aspects of Product Backlog Refinement is
prioritization and, in particular, prioritizing what is toward the top of the Product
Backlog. This is what the team will pull from when planning their next iteration of the
Delivery Loop. So, it's important that the items at the very top of the backlog truly
reflect what is most valuable and help generate the outcomes that matter.
For more details on Product Backlog Refinement and to converse with the community,
take a look at the Open Practice Library page at openpracticelibrary.com/practice/
backlog-refinement.
Prioritization | 391
We've already seen a few tools that help with initial Product Backlog generation and
giving the first set of priorities. Let's look at a few more that will help with ongoing
Product Backlog prioritization.
Prioritization
Throughout this chapter, we've used the terms features and Product Backlog items
to explain the different units of work that we capture through Discovery and prioritize
and decide which to work on first in the Options Pivot. An important clarification
that's needed is that this does not just mean functional features. We are not just
deciding which shiny new feature the end users are going to get next. We need
to balance customer value against risk mitigation; we need to balance functional
against non-functional work. We do that by balancing research, experimentation, and
implementation.
Running Technical Spikes and proving some of the non-functional aspects of the
platform early can provide the knowledge and confidence value, which can be equally, if
not more, important than customer value achieved from delivering functional features.
In fact, this non-functional work helps us achieve the Enabling Outcomes outlined
in Chapter 10, Setting Outcomes, whereas the functional implementations are more
focused on achieving the primary outcomes.
392 | The Options Pivot
Let's look at an economic prioritization model that can help us quantify risk, knowledge
value, and customer value. It can be used by a Product Owner in collaboration
with wider groups of team members and stakeholders and presented to the wider
organization.
So, what is WSJF? It is based on Don Reinertsen's research on the Cost of Delay and
the subject of his book The Principles of Product Development Flow – Second Generation
Lean Product Development. Reinertsen famously said, If you quantify one thing, quantify
the cost of delay. Josh Arnold explains how the Cost of Delay is calculated by assessing
the impact of not having something when you need it. As a typical example this might
be the cost incurred while waiting to deliver a solution that improves efficiency. It is the
opportunity cost between having the same thing now, or getting it later.4
The core thinking behind the Cost of Delay is value foregone over time. For every day
we don't have an item in the market, what is it costing the organization? If the value of
the item is a cost-saving initiative, how much money is the organization not saving by
not implementing this feature? If the value of the item is revenue-related, what is the
additional revenue they're missing out on by not implementing it?
The Cost of Delay can be sensitive to time. There are seasonal influences – for example,
shipping in retail can be very time-sensitive around, say, the holiday season. Changes
may be needed for legislation and compliance. The cost can be very high if something is
not delivered by a certain date when new legislation kicks in. The Cost of Delay will be
nothing in advance of this date and very high after this date.
There are three primary components that contribute to the Cost of Delay:
• Direct business value either to the customer and/or the organization. This
reflects preferences users might have that will drive up their customer
satisfaction. It will also reflect relative financial reward or cost reduction that the
item is expected to drive.
• Time criticality to implementing the solution now or at a later date. This
incorporates any seasonal or regulation factors that might drive time criticality,
as well as whether customers are likely to wait for solutions or if there is a
compelling need for it now.
• Risk reduction and opportunity enablement is the indirect business value this
might bring to the organization. It considers the hidden benefits this might bring
in the future as well as reducing the risk profile.
Using Cost of Delay to prioritize work in agile backlogs will result in items being
prioritized by value and sensitivity to time. It also allows us to have a lens on direct
business value (such as new functional feature development) and indirect business
value (such as non-functional improvements to the OpenShift platform).
Cost of Delay =
Business Value + Timing Value + Risk Reduction/Opportunity Enablement Value
WSJF adds a further dimension to this by considering the cost of implementation.
Reinertsen said it is critical to remember that we block a resource whenever we service a
job. The benefit of giving immediate service to any job is its cost-of-delay savings, and the
cost is the amount of time (duration) we block the resources. Both cost and benefit must
enter into an economically correct sequencing.5
Weighted Shortest Job First (WSJF) = Cost of Delay (COD) / Duration
What unit do we use for the three components in Cost of Delay and Duration? It's
arbitrary. The actual numbers are meaningless by themselves. The agile practice we use
to support COD and WSJF is Relative Estimation,6 whereby we are relatively assessing
the magnitude of business value, timing value, and risk reduction/opportunity
enablement for each item on the Product Backlog relative to each other item. This
allows us to prioritize the Product Backlog according to WSJF.
We've now introduced several practices on this first trip to the Options Pivot that help
us generate more ideas from discovery, refine them, prioritize them, and, ultimately,
decide which options we're going to take into a Delivery Loop next. But who makes this
decision? The term we has been used a lot in this chapter, emphasizing the importance
of collaboration. But what happens when we don't have a consensus? Who gets the final
say? This is where the importance of great production ownership comes in.
6 https://openpracticelibrary.com/practice/relative-estimation/
7 https://www.mountaingoatsoftware.com/blog/why-the-fibonacci-sequence-
works-well-for-estimating
Prioritization | 395
The team members would reveal individual scores to each other and a
conversation would follow to converge and align on the team's assessment
for each score.
This resulted in a Cost of Delay value and a WSJF value for each item.
The previous sections on forming the product backlog, refining it and prioritization are
all key responsibilities of Product Ownership which we will now explore further.
Product Ownership
Everything in this chapter is about Product Ownership. Everything in the previous
chapters about Discovery is Product Ownership. Prioritizing early efforts to build a
foundation of open culture, open leadership, and open technology practices requires
strong Product Ownership from the outset.
There are whole books and training courses written about Product Ownership, Product
Owners, and Product Managers. Much of our thinking has been inspired by the amazing
work of Henrik Kniberg. If you have not seen his 15-minute video on YouTube entitled
Product Ownership in a Nutshell,8 please put this book down, go and get a cup of tea,
and watch the video now. Maybe even watch it two or three times. We, the four authors
of this book, reckon we've collectively seen this video over 500 times now!
8 https://www.youtube.com/watch?v=502ILHjX9EE
Product Ownership | 397
Some say it is the best 15-minute video on the internet, which is quite an accolade! It
packs in so many important philosophies around Product Ownership in such a short
amount of time. We tend to show this video during our DevOps Culture and Practice
Enablement sessions, when we start a new engagement with a new team, or simply to
kick-start a conversation with a stakeholder on what agile is really about.
The resulting graphic is well worth printing out and framing on the wall!
During our time working on hundreds of different engagements, we've seen some
examples of amazing Product Ownership. We've also seen some really bad examples.
Let's look at some of the patterns that we've observed, starting with Product Owners.
398 | The Options Pivot
Over time, the need for direct access to this Product Owner diminished. It
is a pattern I've noticed working with several organizations where Product
Ownership has been particularly strong. Great Product Owners democratize
Product Ownership and provide a direct connection between teams and
stakeholders. Product Owners should see their current role as one to self-
destruct and not be needed in the long term.
Next, let's look at how great Product Owners have approached their first iterations and
what they've prioritized to form their first iteration goals.
Several other engagements have had almost identical goals, and the pattern
is strong because:
• The team wants to set up their workspace. That may be their physical
workspace with lots of information radiators and collaboration
space. It may be a virtual workspace with digital tooling. It may be a
development environment using code-ready workspaces and being
familiar with all tools to be used.
• The plan to build a walking skeleton. This is a thin slice of the whole
architecture delivered in one iteration. There won't be any fancy
frontend or complex backend processing. They will prove full-stack
development and that the cross-functional team representing all parts
of the logical architecture can deliver working software together. It's a
walking skeleton because it is a fully working product. It just doesn't do
very much yet!
• Their work will be underpinned by continuous integration and
continuous delivery. This green-from-go practice means they are
set up for success when it comes to automating builds, tests, and
deployments. If they prove this and learn this for a thin slice, it will
become increasingly valuable as we start to put all the flesh and organs
into the walking skeleton!
The final part of this chapter shifts the focus from what we're deciding to deliver
next to how we're going to measure and learn from our experiments and the features
we deliver. The OpenShift platform enables our teams to consider several advanced
deployment capabilities.
The OpenShift platform enables several different deployment strategies that support
the implementation of experiments. When we are on the Options Pivot, we should
consider these strategies and which (if any) we should plan with the delivery of the
associated Product Backlog item. The advanced deployment strategies we can consider
include:
• A/B Testing
• Blue/Green Deployments
• Canary Releases
• Dark Launches
• Feature Toggling
We introduce these concepts here as, from an options planning perspective, this is
where we need to be aware of them. We'll return to specific implementation details in
Section 6, Build It, Run It, Own It, and we'll explore how we use the resulting metrics in
Section 7, Improve It, Sustain It.
A/B Testing
This is a randomized experiment in which we compare and evaluate the performance
of different versions of a product in pairs. Both product versions are available in
production (live) and randomly provided to different users. Data is collected about
the traffic, interaction, time spent, and other relevant metrics, which will be used
to judge the effectiveness of the two different versions based on the change in user
behavior. The test determines which version is performing better in terms of the Target
Outcomes you have started with.
A/B Testing is simple to apply, fast to execute, and often conclusions can be made
simply by comparing the conversion/activity data between the two versions. It can be
limiting as the two versions should not differ too much and more significant changes
in the product may require a large number of A/B Tests to be performed. This is one
of the practices that allows you to tune the engine, as described in The Lean Startup9 by
Eric Ries.
9 http://theleanstartup.com/
402 | The Options Pivot
For more information on this practice and to discuss it with community members
or contribute your own improvement to it, please look at openpracticelibrary.com/
practice/split-testing-a-b-testing/.
Blue/Green Deployments
Blue/Green Deployment is a technique in software development that relies on two
productive environments being available to the team. One of them, let's call it green,
is operational and takes load from the reverse proxy (load balancer/router). The other
environment, let's call it blue, is a copy upgraded to a new version. It is disconnected
from the load balancing while this upgrade is completed.
The team can perform all required tasks for an upgrade of the product version on the
blue environment without the rush of downtime. Once the blue environment is ready
and has passed all tests and checks, the team simply redirects the reverse proxy (load
balancer/router) from the green environment to the blue environment.
If everything works fine with the blue environment, the now outdated green can be
prepared to be recycled to serve as the blue for the next release. If things go bad, the
team can switch back to a stable environment instantly using the reverse proxy/load
balancer/router.
This is a feedback loop practice that allows the team to get prompt feedback from the
real-life use of their changes. It enables continuous delivery and provides safety for
performing complex releases. It removes the time pressure and reduces the downtime
to practically zero. This is beneficial for both technical teams and end users who will
not notice glitches or unavailability of the service/product, provided that the new
version is performing at par. In case of adverse effects, it allows the teams to have an
instant roll-back alternative and limit the negative impact on customers.
To explore this practice further, visit the Open Practice Library page at
openpracticelibrary.com/practice/blue-green-deployments/.
Canary Releases
In software development, this is a form of continuous delivery in which only a small
number of the real users of a product will be exposed to the new version of the product.
The team monitors for regressions, performance issues, and other adverse effects and
can easily move users back to the working old version if issues are spotted.
The term comes from the use of caged birds in coal mines to discover the buildup
of dangerous gases early on. The gases would kill the bird long before they became
life-threatening for the miners. As with the canary in the mine, this release practice
provides an early warning mechanism for avoiding bigger issues.
The canary release provides continuous delivery teams with safety by enabling them to
perform a phased rollout, gradually increasing the number of users on a new version
of a product. While rolling out the new version, the team will be closely monitoring
the performance of the platform, trying to understand the impacts of the new version,
and assessing the risks of adverse effects such as regressions, performance issues, and
even downtime. This approach allows the team to roll back the release as soon as such
adverse effects are observed without the majority of the customers being impacted
even for a limited amount of time.
404 | The Options Pivot
Canary Release is similar to A/B Testing in the sense that it is only exposing a part of
the population to the new feature, but unlike A/B Testing, the new feature can and is
typically a completely new feature and not just a small tweak of an existing one. The
purpose is different too. A/B Testing looks to improve the product performance in
terms of getting business outcomes, while the Canary Release is focused entirely on
technical performance.
You can read more about this practice, contribute improvements, or have a discussion
with the wider community at openpracticelibrary.com/practice/canary-release.
Dark Launches
Dark Launches are another continuous delivery practice that release new features to
a subset of end users and then captures their behaviors and feedback. They enable
the team to understand the real-life impact of these new features, which may be
unexpected for users in the sense that no users asked for them. It is one of the last
steps for validating a product/market fit for new features. Rather than launching the
features to your entire group of users at once, this method allows you to test the waters
to make sure your application works as planned before you go live.
Dark Launches provide safety by limiting the impact of new features to only a subset
of the users. They allow the team to build a better understanding of the impact
created by the new feature and the ways the users would interact with it. Often novel
ways of interaction can surface, ways that were not initially envisioned by the team.
This can be both positive and negative, and the limited availability allows the team to
draw conclusions from the real-life use and decide if the feature will be made widely
available, further developed, or discontinued.
Advanced Deployment Considerations | 405
The Dark Launches practice has its own Open Practice Library page at
openpracticelibrary.com/practice/dark-launches/, so head there for further
information, to start a conversation, or to improve the practice.
Feature Flags
Feature Flags (also known as Feature Bits/Toggles/Flipping/Controls) are an
engineering practice that can be used to change your software's functionality without
changing and re-deploying your code. They allow specific features of an application to
be turned on and off for testing and maintenance purposes.
In software, a flag is one or more bits used to store binary values. So, it's a Boolean that
can either be true or false. A flag can be checked with an if statement. A feature in
software is a bit of functionality that delivers some kind of value. In its simplest form, a
Feature Flag (or Toggle) is just an if statement surrounding a bit of functionality in your
software.
Feature Toggles are a foundational engineering practice and provide a great way to
manage the behavior of the product in order to perform experiments or safeguard
performance when releasing fresh new features.
406 | The Options Pivot
average, 3 minutes per item. Their goal was to put each feature in one of the
columns with a short note on what their approach to the implementation
was.
• Open PetBattle: This was easy. Anyone using the app would need to
open it. IMPLEMENT.
• Display Leaders: Lots of questions about what and how to display.
How many leaders? Should we add pagination or scroll? They decided
some RESEARCH was needed – perhaps a UI prototype with some user
testing.
• Let me in please: The team had to go back to the Event Storm to
remind themselves what this was about! Again, it was a simple feature
of letting the user in to see Pets uploaded. IMPLEMENT.
• Vote for Cat: This triggered some conversation. Do they vote up or
down? Or do they just give a vote (or nothing at all)? The team was
divided and had heard differing views from user interviews. They
decided to EXPERIMENT with an A/B Test.
• Add my Cat: Not much research or experimentation needed. A
standard uploading tool was needed. Just IMPLEMENT.
• Verify Image: This sounded a bit trickier. There were merging AI/ML
patterns available. It needed some technical RESEARCH and probably a
Technical Spike.
• Enter cat into tournament: Not much ambiguity here. IMPLEMENT.
• Display Tournament Cat: It wasn't clear if this was going to be well
received or not. The team thought they could EXPERIMENT with a
feature toggle and then it's easy enough to turn off.
• Disable "Add my Cat": Some users have more than one cat and will
want to add more than one. Let's EXPERIMENT with a Dark Launch of
this feature to a small subset of users.
• Vote for given cat: Once the team got the results from the A/B Test,
they could EXPERIMENT further and launch as a Canary Test.
• Update the Leaderboard: IMPLEMENT
• End Competition: IMPLEMENT
• Notify Players: Not clear how this would happen – SMS? Email? Other
mechanisms? The team decided to do some user RESEARCH.
408 | The Options Pivot
Let's look at another real-world experience to see just how simple yet effective this
experimental mindset can be.
The team was very small, just one designer, two engineers, a business
analyst, and a Product Owner. As a small co-located team, buried in the
heart of the bank, we were able to move fast! We interviewed people who
had recently purchased mortgages with the bank to get insight into their
motivations for using the tool. We did a load of research by going into the
bank branches and asking people open-ended questions while they used
the existing tool. We collated this information along with how they were
accessing the calculator and if they were to complete an application, what
device they would use,that is, their phone or their laptop.
Through this research we stumbled upon an interesting fact – people were
not interested in How much could I borrow but How much house can I afford.
This simple difference might seem inconsequential but it massively affected
how we rebuilt the bank's online mortgage calculator. It meant people
wanted to be able to tailor their calculation to see how their rates and
lending criteria could be affected by, for example, having more income. Or,
if they were to continue to save for another year and have more of a deposit
saved, could they get a better rate? This flip meant people were using the
tool to not see if they could afford a given home but how much of a home
could they afford and by when.
It would have been very simple for us to just recreate the bank's existing
calculator with a new skin that ran on a mobile – but this would not have
addressed the core problem. By reframing the question, we were now in
a position to create a simple calculator tailored to the needs of the bank's
first-time buyers.
All these advanced deployment considerations provide powerful tools for use in Options
planning and how we can conduct research, experimentation, and implementation.
When we return to the Options Pivot after an iteration of the Delivery Loop, we'll
complete the final section of this map:
• What did we learn?
The Options Map provides clarity and direction as to how the product priorities to help
reach outcomes. It helps form our delivery strategy.
Conclusion
In this chapter, we focused on how we are going to deliver the outcomes set in the
previous section.
We explored the User Story Mapping and Value Slicing practices and how we take all
of the information captured in Discovery practices and push it through these tools.
We also showed how using some helpful practices to look at the same information
with slightly different lenses –Impact versus Effort Prioritization and How/Now/Wow
Prioritization – can help improve Value Slicing. Where proposed feature areas would
benefit from a deeper dive to understand the value, we recommended the Design Sprint
as an option.
Conclusion | 413
We showed how these practices drive the initial Product Backlog prioritized by value
and how this produces a living, breathing artifact that will be subject to continuous
Product Backlog Refinement as we gather more learning, feedback, and metrics for our
delivery. The economic prioritization model WSJF, which is based on Cost of Delay,
provides a repeatable and quantifiable tool to drive this. It's one of many prioritization
tools that can help the Product Ownership function work smoothly and effectively.
Finally, we looked at the advanced deployment considerations that should be taken
when designing experiments and how platforms such as OpenShift enable powerful
evidence-based testing to be conducted in production with users. A/B Testing, Blue/
Green Deployments, Canary Releases, Dark Launches, and Feature Flags were all
introduced from a business perspective. We will return to the implementation details
of these in Section 6, Build It, Run It, Own It and explore how we interpret the measures
from them in Section 7, Improve It, Sustain It.
Figure 11.37: Practices used to complete a Discovery Loop and Options Pivot
on a foundation of culture and technology
414 | The Options Pivot
In the next chapter, we will shift to the Delivery Loop. We'll look at agile delivery and
where and when it is applicable according to levels of complexity and simplicity. We'll
also look at Waterfall and the relative merits and where it might be appropriate. We'll
explore different agile frameworks out there and how all of them relate to the Open
Practice Library and Mobius Loop. We'll explore the importance of visualization and of
capturing measurements and learning during our iterations of the Delivery Loop.
Section 5: Deliver It
In Section 4, Prioritize It, we moved through the Options Pivot for the first time. We
took all of the learning and ideas that can come out of the practices used on the
Discovery Loop and applied more practices to organize them according to their value.
We used prioritization techniques to help us slice value and considered how to plan
and design experiments that could be run by making use of advanced deployment
capabilities offered by the OpenShift platform.
As outlined in the Mobius Outcome Delivery QuickStart Guide by Gabrielle Benefield
and Ryan Shriver (freely available to download from mobiusloop.com), Mobius
encourages product development through the use of continuous discovery, validation,
and learning. Building products this way requires a strong foundation of culture and
technical practices that enable an ongoing experimental fast release and fast feedback
approach. Indeed, ongoing research and experiments are just as much part of product
development as feature delivery.
The key differences between research, experimentation, and implementation are the
investments in time and money required to steer a team toward building the right thing
that is aligned with the target outcomes. Mobius encourages the idea of delivering
maximum value from the least amount of investment.
416 | Section 5: Deliver It
As we move onto the Delivery Loop, we're going to carry out research, execute
experiments, and implement features. How are we going to do this? Are we going
to use a framework like Scrum, or Scaled Agile Framework (SAFe), or Large-Scale
Scrum Framework (LeSS), or Lean? Are we going to follow a prescriptive approach or
methodology? No, we are not! As has been the case throughout this book—indeed, as
is the mindset behind Mobius and the Open Practice Library—we are going to offer a
toolbox of practices that will help you work your way around the Delivery Loop. We will
not say there is one right way to do things because it does not exist. Our approach is to
adapt based on context, learning, and experience. We will look at some of the popular
practices that we have found to be very effective and share some stories and tips
behind them.
Which practices you choose is up to you but, in nearly all cases, we strongly
recommend using practices that allow fast delivery, rapid feedback, and continuous
learning throughout.
| 417
To this effect, this section is divided into two chapters. Chapter 12, Doing Delivery, is
all about Doing Delivery—the difference between Waterfall and Agile approaches and
popular Agile practices to incrementally and iteratively deliver. Chapter 13, Measure and
Learn, is all about measuring metrics and learning about what they tell us. We can use
the many different flavors of metrics, applied to the small incremental deliveries, to
continually improve.
• Describe the practices with examples we often use to help with Delivery
Iterations; including, for example, Iteration (or Sprint) Planning, Daily Stand-Ups,
Showcases, and Retrospectives.
• See how, as teams mature and continuously improve, they can adapt their
working practices and approaches.
• Share some tools and tips about getting working software features done with a
focus on both functional and non-functional acceptance.
• See, for the first time, how all of our practices blend together to give us a
continuous delivery engine.
Let's start by having a look at those two phenomena—Waterfall and Agile. As depicted
in Figure 12.1, Waterfall is often associated with Gantt charts, dependencies, and hand-
offs and hand-overs from one part of the organization to another. Agile takes a different
mindset and is more cross-functional. Some may say Waterfall is old and dying.
We'll see later in this chapter that there remain valid use cases for taking a Waterfall
approach.
Waterfall | 421
Waterfall
Waterfall projects are broken down into linear steps; each step relies on the
previous one being completed before starting the next. The first step is to gather
user requirements, followed by designing and planning, and then finally software
development. Once testing and deployment are done, we enter the maintenance and
operation mode.
This type of delivery methodology has a long history outside of software development,
especially in traditional engineering. When building a bridge or a factory, the project
plan lays out the requirement of all the people, resources, tasks, and timelines for
the overall project of work. The Waterfall approach can be described as a plan-driven
engineering process where success is measured according to how well development is
able to keep up with the plan.
422 | Doing Delivery
By following a plan, individual variations are minimized. In the industrial world, this
helps make both delivery and cost predictable. Because the traditional Waterfall model
does not allow for looping, requirements and design are gathered in vast documents
to begin planning. Changing and understanding the documents becomes very tricky
for teams because there are no mechanisms for getting a shared understanding other
than reading all the documents over again. Have you ever participated in a book review
session? How many different points of view do you get when a group of people reads
the same book? Coupling this with the likeliness of an organization to raise expensive
change requests and making changes to the documentation can quickly show how
disastrous this approach can be for a company. So how does this method fare in the
world of software?
Often, we find that in the software realm the requirements do not match what
the end users expect. This is because as we develop the software, questions and
misunderstandings about the original requirements and design become hard to
address. As design and business requirement teams are not heavily involved in the
development phase, there is no opportunity to refresh the designs or ideas.
Delivering customer and business value through working software takes a lot longer
using Waterfall—as working software is only delivered toward the end of a Waterfall
process. Think of how much time is lost while you're waiting for your great application
to arrive. What if it never arrives? Or worse, arrives too late and you've missed your
market?
Changing requirements during a Waterfall execution is often left to another Waterfall
phase. If changes are urgent, then they may be handled through some form of change
request mechanism. However, this is really a form of loop—leading us to a more iterative
approach. In Waterfall, these loops can still be very time-consuming and costly. If only
there was a quicker way!
Winston Walker Royce was an American computer scientist and director at Lockheed
Software Technology Center in Austin, Texas. He was a pioneer in the field of software
development, known for his 1970 paper from which the idea of the single-pass Waterfall
model for software development was mistakenly drawn. That's right, Waterfall was
misunderstood and mistaken!
Waterfall | 423
Barry Boehm wrote in 1987: "Royce's 1970 paper is generally considered to be the paper
which defined the stagewise Waterfall model of the software process. But it is surprising
to see that both the earlier Benington and Hosier papers had good approximations to the
Waterfall model, and that Royce's paper already incorporates prototyping as an essential
step compatible with the Waterfall model."1
Royce demonstrated that while the development of large software systems required
a more thorough approach, there was an inherent risk in the single-pass sequential
approach. He proposed an iterative approach and advocated that projects should pass
through this at least twice. But many Waterfall projects started and have continued
to follow just a single-pass flow, left to right, with just a single iteration, which was
not intended or recommended by Royce—Waterfall plans were never supposed to be
executed just once.
The Waterfall model did help in the innovation of other approaches and the focus on
iteration was corrected in Agile models.
The Agile Manifesto2 is a simple set of values, highlighting the importance of some
values over others.
Working software over comprehensive documentation—This sounds great, I don't have to
write any documentation is what some are probably thinking. But they would be wrong.
The values are not saying things on the right are not important, but that there is higher
value in the items on the left. This is the bit people tend to forget when quoting the
manifesto.
2 https://agilemanifesto.org/
The Birth of Agile | 425
In addition to this set of values, this group came up with a set of 12 principles3 to
support them.
These principles provide more detail about the kind of behaviors they foresaw as being
agile. Read together, the principles provide an incredibly useful resource when having
conversations about what it means to be agile. Agile these days has really become an
overused buzzword and is frequently injected into sentences. Examples include—"We're
going to use Agile to deliver this project" or "We're going to install Agile." Reading through
and having a conversation about the 12 statements can provide clarity and alignment on
what we mean by Agile.
3 https://agilemanifesto.org/principles
426 | Doing Delivery
The Agile Manifesto contrasts with the plan-driven Waterfall approach in that it:
• Is more adaptive, rather than prescriptive.
• Is more people-oriented, rather than process-oriented.
At the beginning of the industrial age, people working on the factory floor were
described by Frederick Taylor, an American mechanical engineer widely known for his
methods to improve industrial efficiency, as being lazy, venal, and stupid. A factory was
built and designed by architects and engineers who did not work on the factory floor.
One of the tenets of Agile is that the people and teams who do the work decide on how
it gets done.
An important part of Agile is its focus on cross-functional teams delivering working
software. The designers and business stakeholders are continuously involved in this
process. Unlike Waterfall, when questions arise about the end user experience, these
can be researched and answered based on the most up-to-date information available to
the team.
Agile methodologies focus on simultaneous workflows that form a baseline to help us
control change. Plans are constantly revised to reflect the learnings during a product's
development. Success is based on the value delivered by the working software.
Agile is focused around products that are directly connected to the end user as
opposed to effort and output that may not be directly connected. The team comes
together to drive successful delivery where processes and tools are used to optimize
product delivery.
One of the most noticeable traits of Agile is its breaking down of project delivery into
smaller iterative phases. This works really well for software products that require
regular revisions and testing. Because the work is done incrementally, teams can easily
adjust what is being worked on. They can pivot to work on high-value features and
deliver those to customers early and fast.
Some critics of Agile methods quite rightly point out that Agile projects also fail.
However, it is not methodologies that fail, it is teams and organizations that fail.
Often the reasons can be clearly articulated as the team taking on too much technical
debt—adding new features or changing existing software is just too hard to do quickly
because the code base is a mess, or the architecture is just plain wrong.
Fundamentally, Agile is an ideology or a philosophy—a way of thinking. It is articulated
by the four value statements in the Agile Manifesto and defined further by the twelve
principles. Agile is instantiated and brought to life by a series of practices, many of
which are in the Open Practice Library. Practices have been grouped together to form
popular Agile frameworks such as Scrum, Kanban, and SAFe (Scaled Agile Framework).
All of these make up the Agile ecosystem.
The Birth of Agile | 427
Figure 12.6: Decomposing the Agile mindset into values, principles, practices, and frameworks
Before we get into the detail of Agile, let's consider how OpenShift complements Agile
and promotes business agility.
As we mentioned in the introduction, we still see a place for less Agile approaches
to delivery (including the use of Waterfall). Let us now look at the characteristics of
different projects that can determine what approach is best.
Decision-Making Contexts
In 2020, Red Hat produced an eBook entitled Transformation takes practice.4 This
was written in response to a question asked time and again by business leaders: Why
are so many digital transformation efforts failing? In the eBook, Mike Walker, Global
Director of Red Hat Open Innovation Labs explains: "In complex sociotechnical systems,
it is a group of people, not individuals or managers, who can create innovative change.
These groups must tune the system through a perpetual cycle of probing, sensing, and
responding to outcomes."
To explore that cycle of probing, sensing, and responding to outcomes, let's introduce
a very helpful framework that compares this approach to alternative approaches used
in different systems.
4 https://www.redhat.com/en/resources/transformation-takes-practice-open-
practice-library-ebook
Decision-Making Contexts | 429
The framework is intended to help managers identify how they perceive situations and
make sense of their own and other people's behavior. The framework draws on research
into systems theory,5 complexity theory,6 network theory,7 and learning theories.8
The clear, or the obvious domain represents the known knowns. This means that
there are rules in place (or best practice), the situation is stable, and the relationship
between cause and effect is clear—if you do X, Y is expected. This is the domain of legal
structures, standard operating procedures, and practices that are proven to work.
5 https://en.wikipedia.org/wiki/Systems_theory
6 https://en.wikipedia.org/wiki/Complexity_theory_and_organizations
7 https://en.wikipedia.org/wiki/Network_theory
8 https://en.wikipedia.org/wiki/Learning_theory_(education)
430 | Doing Delivery
The complicated domain consists of the known unknowns. The relationship between
cause and effect requires analysis or expertise where there is a range of right answers.
The framework recommends sense–analyze–respond, that is, first assess the facts, then
conduct analysis and use that to apply the appropriate good operating practice.9
The complex domain represents the unknown unknowns. Cause and effect can only
be deduced in retrospect, and there are no right answers. "Instructive patterns ... can
emerge," write Snowden and Boone, "if the leader conducts experiments that are safe to
fail." Cynefin10 calls this process probe–sense–respond.
In the chaotic domain, cause and effect are unclear. Events in this domain are "too
confusing to wait for a knowledge-based response," writes Patrick Lambe. "Action—any
action—is the first and only way to respond appropriately."11 In this context, managers
must act–sense–respond, act to establish order, sense where stability lies, and respond to
turn the chaotic into the complex.
Most of the work we do with our customers sits in the complex and complicated
domains. To differentiate between these two domains further, let's look at a helpful
example used in the Transformation takes practice eBook about the difference between
a Ferrari and a rainforest.
9 https://hbr.org/2007/11/a-leaders-framework-for-decision-making
10 https://hbr.org/2007/11/a-leaders-framework-for-decision-making
12 https://www.redhat.com/cms/managed-files/cm-digital-transformation-harvard-
business-review-analyst-paper-f22537-202003-en_0.pdf
Decision-Making Contexts | 431
Let's get back to our visualization and mental model that we have been using
throughout the book using the Mobius Loop.
432 | Doing Delivery
Projects in the complex domain fit well with an Agile approach. Visualized on the
Mobius Loop, this means starting on the Discovery Loop and establishing the first set
of Target Outcomes. The Options Pivot involves refining what to discover or deliver
next and deciding on which options to deliver. The Delivery Loop creates a design and
build for a small increment, runs that increment in production to collect data, and
takes measurements before learning and reflecting on the outcomes achieved. There
is then an adaption point to either go through another loop of Discovery, do another
loop of Delivery, or return to the Options Pivot to reprioritize and pivot. This cycle will
continue indefinitely as long as there are product delivery and/or operations ongoing.
There will be a growing desire, through continuous improvement, to speed up the
iterations of Discovery and Delivery.
Decision-Making Contexts | 433
Projects in the complicated domain fit well with an Agile approach. Visualized on
the Mobius Loop, this means starting on the Discovery Loop and establishing Target
Outcomes. This may be slower and more detailed, given projects in the complicated
domain benefit from sense and analysis before a response. A similar pattern will be
experienced for complex projects through Options into Delivery and then either a
return to Discovery, Delivery, or Options.
Finally, projects in the chaos domain need to act quickly and move out of that domain
as quickly as possible. That may mean starting on the Delivery Loop with the urgent act
and, as the level of chaos reduces, start to re-group and organize options and a loop of
Discovery. Once the shift to the Discovery Loop is completed, the project moves from
one of chaos to more of a complex project. Most application development these days
would be considered complex.
So, where does an application like PetBattle sit in the Cynefin framework, and what is
the best delivery approach to consider?
• The Event Storming done to date has focused on the one where Mary
enters the daily tournament and wins a prize. We'll need to do more
Event Storming as we consider new features and/or answers to the
unknowns.
• We have some good target outcomes and some great hypothesis
statements behind all the options. But what if the needle doesn't shift
and those hypotheses are proven false? We'll be back to the drawing
board.
• When we refined our options in Section 4, Prioritize It, on the Value
Slice board, we identified several experiments we need to run. Some
will be run as A/B tests, feature toggles, and canary launches, while
some require UI prototyping and technical spikes to run before we can
progress.
• The Empathy Map we produced from interviewing Mary was great.
But it showed how many things we hadn't thought of about our users.
What will happen when we Empathy Map other users? Or, when we
start to meet advertisers and partners and capture Empathy Maps on
them? What will they say?
• There remain a lot of technical unknowns and learning to be done on
the platform.
There are many known unknowns and probably even more unknown
unknowns! With so many unknowns, how do we deliver these changes in a
way that ensures quality (no breakages to a PetBattle Tournament!), speed
(can the team deliver changes quickly and on time?), and efficiency (can the
team repeatedly deploy changes cheaply and easily)?
PetBattle, like most user-centric applications, is a complex sociotechnical
system with many dynamics, both human and technical. With continued
emergent practice, we can now move to a probe-sense-respond mindset as
we approach our first (of many) iterations of delivery.
The Definition of Ready | 435
Taking a complex product into the first iteration of delivery means we have to decide
what, from our list of options, we are going to deliver first. We also need to check
whether the team are ready to work on the selected items and are confident they
have the right amount of information and understanding to deliver them. This leads to
another foundational practice to define and assess readiness to deliver.
The Definition of Ready often starts with the act of writing user stories and/or
acceptance criteria. When we look at Product Backlog items, they are often quite
high-level, for example, Log In, Choose Item, Submit Order. At first glance, you might
think these items are clear in scope and could be delivered according to expectation.
However, a few simple questions from a Development Team member opens up the
world of ambiguity that exists:
• When you say log in, do you mean log in with a username and password?
• What happens if I don't have a username registered?
• What happens if I put in the wrong password?
• What should happen immediately after I log in?
The practice of writing user stories has become very popular with Agile teams because it
turns a feature idea into a conversation (or, at least, a promise to create a conversation).
Mike Cohn of Mountain Goat Software has driven the popularity of user stories and
explains how the practice helps us shift the focus from writing about requirements to
talking about them. All Agile user stories include a written sentence or two and, more
importantly, a series of conversations about the desired functionality.13
User stories tend to follow the template: as a <type of user>, I want <some goal> so that
<some reason>.
So, after some brief conversation between the Development Team members, business
stakeholders, and the Product Ownership, the Log In feature could be re-written as,
"as a retail banking customer, I want to log in with my username and correct password
so that I can be presented with my latest bank balance." The conversation may also have
driven the splitting of this feature into several other user stories that focus on how to
get a username/password, what should happen if incorrect credentials are supplied,
other information and options available after logging in, what other types of customers
should be able to see and do, and so on.
The user story practice facilitates three outcomes, also known as the three Cs:
1. It generates a Card—the user story is short enough that it can be written on a small
index card (or a digital equivalent).
2. It generates a Conversation—both in writing this user story and capturing
unknowns or further clarifications needed, yet provides enough direction for
working on the item.
3. It provides Confirmation of the scope and expectation of this individual item that
will deliver a small piece of value to the end user.
13 https://www.mountaingoatsoftware.com/agile/user-stories
The Definition of Ready | 437
You'll notice that the act of re-writing a feature as a user story helped answer some
of the example questions, but not all of them. While we could write many user stories
simply to answer every question a developer might have, this may result in them
being too granular and developers grouping them together to write code. So, a second
popular practice to prepare items for delivery is the act of writing acceptance criteria.
Acceptance criteria is a further elaboration of understanding and often written on
the back of the index card holding a user story. Acceptance criteria are the conditions
of satisfaction a Product Owner and stakeholders will have when the item has been
completed. These are low-level binary points of clarification — basically, a checkbox list
that the Product Owner may go through to convince themselves that a feature has been
delivered to meet business expectations. So, some acceptance criteria for our Log In
feature may include the following:
• There is a textbox for username and password.
• Beneath, there is a hyperlink to create a new username.
• The username will be displayed in plain text when typed in.
• The password will be masked with asterisks when typed in.
• If the username/password is correct, the Customer Details page will be
presented with the bank balance at the top of the page.
• If the username/password is incorrect, an error message will appear.
• If an incorrect password is entered three times in a 15-minute period, the session
will be locked out for a further 15 minutes.
438 | Doing Delivery
Each of these items represents small, standalone tests (which can be automated).
They are specific to the functionality being built. Later in this chapter, we'll show how
another practice, the Definition of Done, is used to drive other types of testing.
The act of writing user stories and acceptance criteria helps generate shared
understanding across all team members, Product Owners, and stakeholders. They are
often practices used in Product Backlog Refinement sessions, which were discussed in
the previous chapter.
If a team sees the value in these practices, they may choose to add them to their
Definition of Ready so that they always go through the motions of writing user stories
and acceptance criteria for a Product Backlog item, so that it will be ready to be worked
on. If an item does not have acceptance criteria, it is NOT ready and the team either
needs to do some refinement and write the acceptance criteria, or demote the item on
the Product Backlog and find alternative items that are ready.
Now, it's important to call out some misuses and problems that can come about
through having a Definition of Ready. First, this is not an opportunity to introduce lots
of pre-development phase gates. Definitions of Ready should not be about writing long
specifications, having design documents signed off, having architectural review board
approvals, and so on. The Definition of Ready is about giving the team the confidence
that, if they take some work into a Delivery Iteration, they will get it done.
Secondly, the Definition of Ready is not a mandatory practice for teams. Like all
practices in this book, it is a tool that is useful to try and, if the team does experience
problems with getting work out of the Delivery Iterations they take in, it would be
recommended to try using the Definition of Ready practice.
When we look at a team's Definition of Ready, we look for statements that are focused
on the team and their comfort or confidence, as these are clearly the criteria that bring
the team to the fore.
Let's look at the Definition of Ready criteria our PetBattle team put in place.
The Definition of Ready | 439
The team agreed that they would regularly use the Fist of Five confidence
vote or Roman vote to assess their comfort and confidence levels for the
seven points above. They were unified in that they should never take items
into a Delivery Iteration that were not ready. They would need lots of short
Product Backlog Refinement sessions to keep the top backlog items ready.
440 | Doing Delivery
To discuss the Definition of Ready with the wider community or to read more
information or even improve the practice, have a look at the Open Practice Library page
at https://openpracticelibrary.com/practice/definition-of-ready/.
Now we have some items ready to be delivered, let's look at the most popular Agile
framework used for iterative delivery, Scrum.
Scrum
When you hear someone say that they are now doing agile, their team is using agile, or
their organization has adopted agile, there's a very strong chance that what they are
actually doing is some form of Scrum.
Scrum is not the same as agile. As mentioned earlier, Agile is an ideology. It is
underpinned by those values and principles set out in the Agile Manifesto. Scrum is one
of several frameworks that instantiates Agile and makes the principles a reality through
a group of practices. It is also, by far, the most popular Agile framework.
It was created by Jeff Sutherland and Ken Shwaber in the 1990s and was inspired by
Hirotaka Takeuchi and Ikujiro Nonaka when they published The New New Product
Development Game.14
There have been numerous books written about Scrum and, as with any good
framework, it has evolved and continuously improved with learning and experience.
We would strongly recommend reading the latest version of the Scrum Guide as
presented by Sutherland and Schwaber at https://scrumguides.org/. It is an artifact we
recommend re-reading every now and again to refresh your memory and knowledge of
this very powerful framework. This is especially true when updates are made and there
are new releases of the framework.
We're not going to teach Scrum in this book as there are many better articles and books
and, most importantly, the Scrum Guide can do that much better. What we are going
to do is demonstrate how we have used Scrum in our engagements, how it fits into the
Mobius Loop, and share some stories of good and bad Scrum adoptions.
One of the reasons we love Scrum is because of its simplicity. Often people confuse the
many (now hundreds) of practices associated with Agile as being Scrum. In fact, Scrum
has very few practices. We tend to describe them as 3-5-3 (which sounds a bit like a
formation for a sports team!).
14 https://hbr.org/1986/01/the-new-new-product-development-game
Scrum | 441
This Scrum picture can be best described by grouping the roles, events, and artifacts.
Figure 12.11: Collaboration between a Product Owner and her Development Team
15 https://www.youtube.com/watch?v=502ILHjX9EE
Scrum | 443
The Product Owner needs to be regularly available to the Development Team, especially
during Product Backlog Refinement to, for example, write and agree on acceptance
criteria, and during Scrum events, such as Sprint Planning.
Figure 12.12: Development Team member clarifying his understanding with his Product Owner
and Business experts
In the case study story in Chapter 11, The Options Pivot, Experimenting with different
Product Owners, we mentioned that Product Owners should see their current role as
one to self-destruct and not be needed long-term. They can make themselves obsolete
by successfully facilitating the Development Team members, directly connecting them
with stakeholders and users, and building those relationships to a level so that their
facilitative role is no longer needed.
Later in the book, in Chapter 18, Sustain It, we will look at larger multi-team
organizations and some of the new patterns that help collaboration between the teams.
But first, let's look at the artifacts that every team using Scrum should always have.
The Sprint Backlog is a different artifact that is more micro-focused for a specific sprint
or Delivery Iteration.
The Sprint Board is one of the artifacts that sees the most use during a Delivery Loop
iteration and one of our favorite information radiators. In the preceding example, the
following information is shown:
1. The Sprint number.
2. The Sprint Goal—a one- or two-sentence description of what the product should do
in the next increment that it does not currently do. Ideally, this should link to the
overall Target Outcomes agreed in the Discovery Loop.
3. Key dates—the Sprint Planning, Sprint Review, and Sprint Retrospective day, and the
time required for this sprint.
4. The Sprint Backlog items that have been pulled from the Product Backlog. There is a
one-to-one matching and should be easily traceable from Product Backlog to Sprint
Backlog.
5. The Acceptance Criteria that have been agreed for each Sprint Backlog item.
6. The low-level tasks that are needed to deliver the features according to agreed
acceptance criteria. These are color-coded to radiate the cross-functional aspect
of feature development, with different colors representing frontend and backend
development, data, infrastructure, operations, and design.
7. Columns to represent the flow of work on each task—Sprint Backlog, In Progress,
Blocked, Ready for Review, and Done.
8. Avatars to show which pairs or mobs are working on a given task in the In Progress
column.
9. A Burndown chart,16 a popular visualization practice showing the pace of—in this
case—tasks getting plotted against the timeline (in days) of the sprint.
16 https://openpracticelibrary.com/practice/burndown/
Scrum | 449
The following template is available in the book's GitHub repository17 and can be
downloaded to produce the same Sprint Board virtually.
One of the most impressive and powerful aspects of the Sprint Board is that it provides
the (almost) final piece of traceability in the life of developing a given feature. If you
walk-the-walls with all Discovery, Options, and Delivery artifacts available, you should
be able to see a low-level task (typically a development activity performed by a pair of
Development Team members that takes a few hours) and see its connection to:
1. The overall feature/user story and the acceptance criteria on the Sprint Backlog.
2. Where that feature sits on the Product Backlog.
3. How that feature was prioritized using Value Slicing and other prioritization tools.
4. How that feature fits into the event-driven system and overall business process on
the Event Storm.
5. The measurable impact that was hypothesized with the feature on the Impact Map.
6. The actors captured on the Impact Map the features can be user-tested with.
7. The overall goal captured on the Impact Map and in Target Outcomes that this
feature should be aiming to achieve.
The final piece of traceability is the working software, which is the third increment of
the Scrum framework.
17 https://github.com/PacktPublishing/DevOps-Culture-and-Practice-with-
OpenShift
450 | Doing Delivery
18 https://www.redhat.com/en/proof-of-concept-series
Scrum | 451
The final installation of our tour around the Scrum framework and how we use it is to
explore five Scrum events – Sprint Planning, the Daily Scrum, the Sprint Review, the
Sprint Retrospective, and the Sprint itself.
The third step is to decompose the items into low-level tasks. We capture all the things
we need to do to deliver this feature to satisfy the acceptance criteria and to meet
the Definition of Done (which we'll cover shortly). This is a great opportunity to follow
another diverge-converge pattern or liberating structure as introduced in Chapter 4,
Open Culture, like 1-2-4-all. Getting everyone thinking about the work needed to deliver
this increment and visualizing this is a really powerful aspect of this approach. This is
detailed design on steroids! This is fixing bugs before we even write any code! As we
saw in the example Sprint Board, using different-color sticky notes to reflect different
types of tasks such as frontend, backend, data, design, infrastructure, operations, and
so on, visualizes the cross-functional nature of the team.
Figure 12.17: Product Owner Paul continues to iterate on his goal while a MakMakMakMak
Development Team member pulls items from the Product Backlog into the Sprint Backlog
After much collaboration, merging together of individual ideas for tasks into pairs, and
then into pairs of pairs, and finally into a full group, we should have a Sprint Board,
a complete visualization of the week ahead. The team should be fully aligned on the
work that needs to be done. Now, can we achieve this? We finish the Sprint Planning
event with a Confidence Vote – how confident are we? Can we meet the Sprint Goal by
delivering the features in the Sprint Backlog? Can we satisfy all the Acceptance Criteria
agreed and complete all tasks by the date of the Sprint Review on the board? Team
members vote with their fingers—0 or 1 means very low confidence and we need to
re-scope or discuss what it will take to increase confidence. A group consensus of 4 or
5 is what we want, and this indicates confidence from the outset.
Scrum | 453
As you can see from the photographs, those big 8-foot by 4-foot foam boards, which we
explained in Chapter 5, Open Environment and Open Leadership, are incredibly powerful
as Sprint Boards. They are light and portable and can easily be transported to different
areas of the build, for example, to the Daily Scrum or to a Daily Stand-Up event.
Iteration or Sprint Planning has its own entry in the Open Practice Library and is
an excellent resource to find out more, contribute improvements, and discuss with
the wider community the use of this practice. You can find this page at https://
openpracticelibrary.com/practice/iteration-planning/.
The Daily Scrum is often an event where Scrum-but antipatterns start to emerge.
Scrum-but represents activities or behaviors that the Scrum team exhibit that do not
conform to the Scrum framework. For example:
• We use Scrum but all of our management shows up to Daily Scrums to get status
updates.
• We use Scrum but our Daily Scrums tend to run to between 30 and 60 minutes.
• We use Scrum but we vary the time of our Daily Scrums and who attends.
These are all anti-patterns to an incredibly simple practice that is purely for the
Development Team to synchronize with each other and highlight any blockers. Jeff
Sutherland often says if your Daily Scrum takes more than 15 minutes, you're doing it
wrong.
If you have experiences to share about this practice or suggestions to improve it
further, please go to the Open Practice Library page at https://openpracticelibrary.
com/practice/daily-standup/.
Let's fast-forward to the end of the Sprint and look at a very valuable practice, the
Sprint Review Event or Showcase.
Scrum | 455
We will explore much more about the measurements and learnings, including those
captured from Showcase events, in the next chapter. The Showcase should not just be
limited to an event. We need to seek other creative ways to show the world what we've
delivered and how we've delivered it in this latest iteration of the Delivery Loop.
As mentioned earlier, our Engagement Leads tend to do a short write-up to summarize
progress and share the latest Product Increment and other artifacts. They send this via
email to all stakeholders and anyone interested in the product and/or team. Visually,
this can be enhanced by including a few photos of the event, or a short video montage
of the team in action and the product evolving from the last sprint. Our engagement
nearly always produces a series of weekly videos, which, when watched one after the
other, produces an amazing story.
As we explained with the Product Increment Artifact, having working software is
the most important ingredient of the showcase. Being able to provide a link to the
application and inviting the Showcase audience to touch and feel the user interface is
very engaging. Having a laptop in the workspace that always has the latest increment of
the working software product running is great for stakeholders visiting the workspace,
walking the walls, tracing feature history through all the practices, and seeing them for
real in the product.
Finally, Showcases are limited in value if we do not get feedback from them. Great
ScrumMasters and Engagement Leads will be creative around ways to do this. Don't just
limit this to an any questions or feedback session at the end. Allow online feedback in a
Google document, or open a feedback channel on Slack or via SMS. Invite value voting
from stakeholders or use Net Promoter Score to capture fresh metric-based feedback
on how likely stakeholders are to recommend the product and/or team.
Scrum | 457
This example frames nicely where the learning and feedback from a Delivery Loop can
trigger the need to return to the Discovery Loop. We'll be exploring this and other
patterns when we return to the Options Pivot in Chapter 18, Sustain It.
To share experiences, learning, contribute improvements to this practice, or simply
read more about it, take a look at https://openpracticelibrary.com/practice/
showcase/. We'll also return to other ways to capture metric-based feedback in the
next chapter. The Showcase is one of the main forums for collecting this. Another one is
the Sprint Retrospective Event.
If you were involved in a long traditional Waterfall program, you may have done some
kind of Lessons Learned exercise at the very end. This is where everyone sits down and
captures all the things they might have done differently if they were to start over again
now. It all gets written up and filed away in some shared drive. If we're lucky, some
other team or project might benefit from this learning but, often, it never gets looked at
again. Well, Retrospectives are just like this BUT they happen much more regularly and
the team in question gets to benefit from the learning as they go along!
Retrospectives can happen at any time. In fact, you may well want to schedule a
Retrospective after a long Event Storming session or after a first Product Backlog
prioritization meeting. It is a foundational concept in the Open Practice Library as it
is a tool that can be used at any time. A Sprint Retrospective is an event we run at the
end of the Sprint or Delivery Loop. The Scrum framework states that it should happen
immediately after the Sprint Review.
There are hundreds of different formats, styles, and themes of Retrospectives. People
are coming up with new, fun, creative, and energetic ways to run Retrospectives all the
time. Check out the Retrospective page of the Open Practice Library19, which has links
to many of these. If you have a link to your own personal favorite or have run your own,
you could submit a pull request and add it!
We're going to share a few different ways we have run Retrospectives and differentiate
between Retrospectives of breadth and Retrospectives of depth.
A Retrospective of breadth essentially inspects the time period that the event is focused
on and asks what worked well, what did not, and what could be done to be better next
time so as to address the things that did not work so well.
Our recommended approach to a Retrospective is to use a diverge-converge technique
or liberating structure to ensure everyone has an opportunity to contribute their
feedback. So, start by allowing a few minutes for everyone to silently write down their
feedback on what worked well, what didn't, and corrective action that could be taken.
Then we converge and merge everyone's ideas together on an information radiator. We
cluster common themes and have short discussions to achieve alignment. Finally, we
agree on a set of corrective actions to come out of the Retrospective, prioritize them,
and agree on owners to take them forward.
19 https://openpracticelibrary.com/practice/retrospectives/
460 | Doing Delivery
The next Retrospective should always open by reviewing the previous Retrospective's
actions. The following are a few alternative ways to ask those questions and provide
headings on the Retrospective's information radiator:
• What should we START in the next sprint that we're not currently doing? What
should we STOP in the next sprint that we are currently doing? What should we
CONTINUE doing? What should we do MORE OF? What should we do LESS OF?
• What should we ADD? What should we DROP? What should we KEEP? What
should we IMPROVE?
• What made us GLAD in the last sprint? What made us MAD? What made us SAD?
• What did we LIKE in the last sprint? What did we LEARN? What did we LACK?
What did we LONG FOR? This format is known as the 4Ls.
Preparing a simple canvas for one of these approaches will allow the team to collect and
organize their feedback and radiate the resulting learning and actions for all to see.
A slightly more creative and fun approach that has become popular is to use analogies
to reflect and inspect the previous sprint. Canvases can be prepared with a theme
such as:
• The Sailboat: What is our island—the goal or target we are aiming for? What is
the wind—the things that will take us there? What things represent the anchor
dragging us back and slowing us? What are the rocks or the risks that we might
be overlooking?
• The Hot Air Balloon: The hot air represents the things that are helping us rise
up. What represents the weights that might be dragging us down? What is the
sun that we want to steer towards for a smooth ride? What is the bad weather
that might give us a bumpy ride?
• The Motor Racing Car: What is the engine moving us forward? What is the
parachute slowing us down?
There are many other similar formats: the Three Little Pigs Retrospective with the
houses made of straw, sticks, and bricks; the Star Wars Retrospective with the light side
and the dark side. We've even followed musically themed Retrospectives such as Elvis
and ABBA, where song titles represent the different inspection aspects.
Some of the common pitfalls of Retrospectives include focusing far too much on the
negative and not enough on what has worked well. A good way to look at this is if
something has been working really well, how might we turn up this behavior and do
even more of it? Another common pitfall is failing to capture Retrospective items. It's
very easy to whine for an hour about all the things that have annoyed us. But what
tangible thing can we do to fix it or at least make it better?
On the other hand, sometimes we come up with too many things we want to fix and too
many ideas for improvement. Use a prioritization practice (such as those outlined in the
previous chapter) and radiate the actions on its own backlog, for everyone to see and
help with.
462 | Doing Delivery
Retrospectives should be fun and engaging. They should be something the team looks
forward to and has everyone motivated to inject energy into. This is an opportunity for
the team to continuously improve.
To help keep them engaged, especially for long-lived product teams, it really helps to
mix up the format. Try new ideas out for Retrospectives. Rotate the facilitator around
the team so they can share the ownership of this practice, figuring out what works well
with Retrospectives and what we should do more of, and what doesn't. In other words,
sometimes we should run Retrospectives on our Retrospectives!
A final tip comes from years of experience running Retrospectives away from the
normal place of work and what we infamously call The Pub Retro.
Scrum | 463
Of course, it's important to bring all the artifacts and learning captured back
to the physical workspace and that the Retrospective actions are included in
backlogs, so they are prioritized and actioned.
Experiment with the location on Retrospectives. My teams have gone
beyond pubs by doing things like ten-pin bowling, urban golf, and meeting at
a nice café and having a retro-breakfast together!
Scrum | 465
The team had their Definition of Ready up on the wall and had been running
short Product Backlog Refinement sessions to get the top ten items of the
Product Backlog ready according to the Definition of Ready. This meant
talking through the items with Valerie (who occasionally would bring in a
stakeholder or two to support the conversation), writing acceptance criteria.
The team used the Confidence Voting practice to assess whether they felt
the item could be delivered within a few days of development. They would
also talk through and update the emerging Logical Architecture diagram to
ensure they had a collective understanding as to how the feature would be
built. Dave had been doing lots of sketches and wireframes for the UI and
would add these to the conversation at the appropriate point.
The Sprint cadence was set. Sprint Planning would take place between
10:00 and 12:00 every Thursday morning. Daily Scrum meetings would be
scheduled between 09:30 and 09:45 every morning. The Sprint Review
showcase would take place on Wednesday afternoons at 15:00, immediately
followed by the Sprint Retrospective, which would be held off-site at 16:30.
Next up on the agenda was to pull items from the Product Backlog, do a
final check that they were ready, and place them in the Sprint Backlog. Eoin
suggested the team copy the sticky notes and leave the Product Backlog
information radiator complete. He would mark the items on the Product
Backlog as they were added to the Sprint 1 Backlog. Eoin also had the idea to
show on the Sprint Backlog what outcome the item was linked to. This was
easy to see by referring to the Value Slice board. Overall, this simply meant
that we had a great information radiator showing how outcome-focused the
team was.
The team pulled the top ten items on the Product Backlog into the Sprint
Backlog. They talked through and made some final edits to the acceptance
criteria. Valerie was excited because this would deliver the entire top slice
of value from her Value Slicing in just one week! Aidan and Ciaran started
to get a bit nervous as the sprint backlog filled up. They were really pleased
that the sprint balanced functional development versus non-functional and
work focused on operations. The team also loved that they could do some
educational workshops on CI/CD and put the long-awaited breakfast and
lunch planning in place to help improve the culture. But there was some
nervousness around the TLS security and ensuring the performance. This
would need some technical spikes and research and, with just four working
days until the first Showcase, Aidan and Ciaran were worried that they
would not have the time to do everything.
They still proceeded to the second part of the Sprint Planning event and
decomposed all the items into tasks. They decided to color-code their
tasks, which would all be written on square sticky notes. Pink stickies would
represent application development. Blue stickies would represent UI/UX
and frontend development. Light green tasks would be focused on data
configuration. Purple stickies would be automation. Dark green stickies
would be infrastructure and platform work. Yellow stickies would be testing.
There were some other tasks that were focused on culture, team building,
and other squishy/fluffy stuff! They decided to call these squishy tasks, and
these would be on the orange stickies.
The team divided into pairs and each pair focused on two features. Every
5 minutes, they rotated the story so every pair could think about tasks for
every feature. All tasks were then added to the Sprint Board, duplicates were
removed, and a discussion took place on each task, until the teams were
Scrum | 469
confident they had all the low-level details to deliver regarding each item on
the Sprint Backlog as per the agreed acceptance criteria.
Finally, Eoin facilitated a confidence vote using the Fist of Five. After a count
of three, everyone displayed with their fingers (one to five) how confident
they were that they would achieve the Sprint Goal, by delivering the sprint
backlog according to the acceptance criteria and tasks set out by next
Wednesday at 15:00. The votes ranged from two to four with Ciaran, Aidan,
Emma, and Jen all voting with just two fingers. The nervousness around
getting all ten features done was growing. After some discussion, Valerie
suggested removing the security- and performance-related items and
putting them back on the Product Backlog. A second confidence vote now
had everyone voting either four or five and there were many more smiles
around the team!
The Sprint Board was ready, which included a Burndown chart to visualize
the progress of the 26 tasks on the board.
Now we've seen Sprint 1 Planning complete, let's see how the team got on with its
delivery.
470 | Doing Delivery
The Development Team adjourned with Eoin to the local pub where they
brought a load of sticky notes. They ran their Sprint 1 Retrospective, which
was run using the simple Start-Stop-Continue format. The team collectively
came up with a great summary of the sprint and they agreed that they
should:
• CONTINUE pairing and mobbing
• STOP being late for Daily Scrums
• START holding free weights during Daily Scrum updates to stop
updates from going on so long
• START a social playlist to have some music in the afternoon
• START using the Definition of Done practice (including writing the
Definition of Done in Sprint 2)
• CONTINUE using and improving the CI/CD solution
• START having more information radiators such as build monitors and
test scores
The team raised their glasses and toasted SPRINT ONE DONE! The next
morning, they'd start all over again with Sprint 2 planning.
These PetBattle stories provide an example of the full Scrum delivery cycle. As the team
choose to do more delivery loops, the same processes are repeated.
There are many tools that will emulate every practice and the way of working used
physically in the room. We look to replicate every practice we would use in a room
to work online and in real time. So, if a team used sticky notes and whiteboards to
collaborate during Sprint Planning, they need to be able to do that using a tool. If they
use a big board for the Product Backlog, they need a tool to host that online. If they
used Confidence Voting or planning poker for relative estimation, they need a tool to
allow them to continue to do that. If they use a Burndown chart in the room, they need
to find a way to keep doing that online.
To help you get started, we have provided a simple Sprint Board that can be used to
build Sprint Backlogs during Sprint Planning, plan for feature demos during Sprint
Reviews, and radiate the learnings during Sprint Retrospectives. This is available to
download at https://github.com/PacktPublishing/DevOps-Culture-and-Practice-
with-OpenShift.
To complete this section about Scrum, we're going to drop a bit of a bombshell. Scrum
is sub-optimal. Scrum does not comply with Continuous Delivery. Scrum can actually
create bottlenecks in delivery rather than removing them!
You might now be wondering why we have spent so much time and pages explaining
Scrum, sharing stories and learnings from using Scrum, and having our PetBattle team
adopt Scrum. Well, Scrum provides a much-needed cadence and set of guardrails
to start working in a more agile manner. Often, new or immature teams need these
guardrails and frameworks to stay on the right path and ensure value is being delivered.
But, as teams mature and get better, there may be a time when we can remove some or
all of those guardrails.
474 | Doing Delivery
The examples above show how Scrum might actually inhibit continuous delivery.
They can, of course, release more frequently than just once in the sprint. The Scrum
framework does not say that you can or should only release once.
Teams may also feel, when they reach these points of realization and maturity, that they
would be able to deliver more continuously if they moved to a different approach, to
deliver and adopt Kanban.
Kanban
Kanban is Japanese for signboard. Kanban has its roots as a scheduling system used in
lean manufacturing in the car industry and, in particular, Toyota cars.
In the context of Agile, Kanban is another methodology. It specifically strives to better
coordinate and balance work amongst team members. Most famously, it employs
a Kanban to help visualize the process. The board is split into categories of work to
be done, work in progress, and completed work. Each task is written onto a card that
moves from column to column as it progresses through the team's process. Tasks are
prioritized. The board keeps everyone on the same page and is highly visible, so it
allows corrections to be made easily.
476 | Doing Delivery
Kanban Board!
When running our DevOps Culture and Practice
Enablement workshop with members of our
Open Innovation Labs team in 2018, we had the
pleasure of meeting our colleagues in Japan.
A lot of the lean culture and processes we
describe in this book originate from Japan, so
we always learn a lot from collaborating with
our peers there.
During the course, we referred to the Kanban
board many times, which seemed to generate some laughter! Our Japanese
colleagues eventually told us not to say, Kanban board as we were really
saying signboard board. That's why we now refer to the artifact as the
Kanban.
Kanban requires strict limits on the number of tasks in progress at any given time.
This is called the Work In Progress, or WIP limit. No new work can enter the column
once the WIP limit is reached. Teams need to work collaboratively to fix issues, identify
bottlenecks, and get tasks Done. This collaborative work style leads to continual
improvement of the team's processes. The team can meet periodically to discuss
changes that are needed, and these are displayed on the Kanban.
Scrum | 477
Scrum and Kanban are not mutually exclusive. In fact, in our PetBattle example, we
walked through the team using Scrum but, in the Sprint Board, they were using Kanban
to visualize and track work.
478 | Doing Delivery
Kanban has its own page in the Open Practice Library, which can be used for more
information, discussion, and improvement, at https://openpracticelibrary.com/
practice/kanban/.
An important practice that we have not yet introduced and that applies to teams using
both Scrum and Kanban is the Definition of Done.
Earlier in this chapter, we talked about how we write Acceptance Criteria against
features or stories on the Product Backlog. This is a specific criterion that applies to the
context of the feature in question and only that feature. The Definition of Done criteria
is additional criteria that also needs to be considered in the planning and delivery of
every feature.
So, if we say that we are always going to check code into a source-code control system
to trigger a CI build and a feature should not be considered done unless this has been
done, we can add this to the Definition of Done. If a team agrees that we should always
have a percentage of the code covered by automated tests and we should never fall
below that given percentage, we can write Definition of Done criteria to manage this.
This provided a great starting point and, in particular, Aidan and Emma
recognized that the Definition of Done was a great tool for ensuring
the non-functional aspects that they had brainstormed using the
Non-Functional Wheel could be considered for each and every item.
Some of those items had ended up on the Value Slicing board because
there was specific work involved in configuring them. But many of them
needed team behaviors or some automation to be focused on continuously
addressing the non-functional requirements (NFRs).
The team revisited all of the items on the Non-Functional Wheel and
categorized each item into one of three lists:
1. Requires platform setup/config/spike
2. Comes for free with platform
3. Team behavior/principle
The Definition of Done provides a place for non-functional requirements and quality
to be managed effectively. While it starts as a big visual information radiator, it drives
more and more quality into the product. It also provides many opportunities for
automation. Given the Definition of Done criteria is something that will be tested for
each and every backlog item accepted into a sprint, the same tests to meet the criteria
should be run over and over again. So, automation is certainly our friend here. To
explore the Definition of Done practice further, visit the Open Practice Library page at
https://openpracticelibrary.com/practice/definition-of-done/.
Now that we have introduced the Definition of Done, when we plan sprints or the tasks
to deliver a specific feature, we can improve the questions that teams should ask about
each backlog item. What tasks are needed to deliver this feature to satisfy the agreed
acceptance criteria and comply with the Definition of Done? And, as Retrospectives
continue to inspect how to improve, the team should consider what they can add to
their Definition of Done to improve the overall product and what to automate to make
Done a part of CI/CD.
When a team does not have a Definition of Done visible or they are failing to follow it or
build it into their CI/CD pipeline, it is a Bad Agile Smell that tends to result in the rise
of technical debt and the slowdown of team velocity. Let's look at a few other Bad Agile
Smells we've seen.
482 | Doing Delivery
These Bad Agile Smells should be kept in mind and, if they start to appear,
we recommend using the Retrospective event as a starting point to discuss and
course‑correct.
Conclusion
In this chapter, we progressed from the Discovery Loop and Options Pivot and focused
on how we deliver features into products using practices on the Delivery Loop. We
explored different domains of delivery using the Cynefin framework and saw how
Waterfall remains effective for work in the clear domain, whereas Agile is more suitable
for work in the complex and complicated domains.
484 | Doing Delivery
We explored where Agile came from, the Agile Manifesto, and took a detailed look into
Scrum and Kanban methods, the practices they use, as well as supporting practices
such as the Definition of Ready and the Definition of Done.
Figure 12.31: Adding Delivery Loop practices and more practices to the Foundation to support delivery
We can now see how Agile frameworks and practices help achieve continuous delivery
when using platforms such as OpenShift and, when coupled with high-performing
teams and an autonomous culture, we can deliver outcomes that matter earlier and
much more frequently.
In the next chapter, we're going to dive deeper into the Measure and Learn part of the
Delivery Loop and look at different mechanisms and tools that facilitate measurement
and learning.
Measure and Learn
13
Startup success can be engineered by following the process, which means it can be
learned, which means it can be taught. – Eric Ries
In his book, The Lean Startup, Eric Ries describes a startup company as a human
institution designed to create a new product or service under conditions of extreme
uncertainty. He outlines a process to help deal with this uncertainty where a tight
feedback loop is created around the creation of a minimum viable product (MVP).
He argues that being able to react, fail fast, and use a data-driven approach to
measurement assists in decision-making that is based on reason rather than emotion.
This ability to learn from small experiments can be seen as a form of business agility –
the ability to pivot quickly in the face of ever-changing circumstances. In lean terms,
this feedback loop can be summarized as Build, Measure, Learn.
The cultural and human aspects of this process cannot be overlooked. Uncertainty and
humanity are common bedfellows. Ron Westrum, an American sociologist, posits that
organizations with better "information flow" function more effectively. He argues that
a good culture requires trust and cooperation between people across the organization,
and therefore it reflects the level of collaboration and trust inside the organization.
Second, better organizational culture can indicate higher-quality decision-making.
In a team with this type of culture, not only is better information available for making
decisions, but those decisions are more easily reversed if they turn out to be wrong
because the team is more likely to be open and transparent rather than closed and rigid.
486 | Measure and Learn
So, how can we take these ideas and make them actionable in our delivery processes?
When timelines for delivery are tight and deadlines are fast approaching, a team's
ability to deliver and operate software systems is critical to the business performance.
With software, there are often two competing forces at work. Innovation, which
inherently is accompanied by system change, and running software, which is serving
end customers and implies that the system is stable. We can identify two important
areas to focus on here:
• To help measure the effectiveness of a team's development and delivery
practices
• To start measuring and monitoring activities that allow the rapid diagnosis
of issues
Metrics-Driven Transformation
Metrics-driven transformation focuses on using value-based business metrics to
understand how technology-related investments impact organizational performance
and provide specific tools and guidance to help improve those metrics.
In the previous chapter, we looked at different approaches to doing delivery, whether
that be Waterfall, or using an Agile framework such as Scrum or Kanban. As we
complete loops of delivery, we want to take value-based metrics to validate hypotheses,
confirm the results of experiments, clarify the impact of our feature deliveries,
determine whether we have moved toward the Target Outcomes we set out, and make
decisions around what to do next.
There are many different levels of measurements we can take in our delivery ecosystem
and a growing number of sources we can collect them from. In this chapter, we will
explore metrics we can collect automatically from our software and our platform as
well as practices we can use to collect metrics from our users, customers, employees,
and the wider organization. Let's start by re-visiting some of the practices we've
already used and see how we can use them to collect measurements and learning.
Where to Measure and Learn | 487
When measuring the outcomes, we need to be aware of any prior measurement that
may form an historic baseline, and potentially what any estimate of value may be
against the actual value. Thinking about these different forms of measurement leads
us to ask the question, Where and when should we take measures and learn, and where
should we inspect them?
488 | Measure and Learn
The Showcase
Ideally, we take measurements as soon as a delivery item is complete. That might
be at the end of a sprint in Scrum or after delivering a completed feature in Kanban.
Perhaps we can build in metrics data collection to the application code and run a report
just before or during a Showcase. For example, if we have a Target Outcome around
increasing the user base to 10,000 people, every Showcase could provide an update on
what the current user base is and whether it is moving in the right direction toward the
target.
There may be a lag in the availability of data that means it cannot be presented at
Showcase events. In this case, there are two plans of attack. First, we can use the
Showcase itself to collect some measurements and learning from stakeholders.
Alternatively, we could collect measurements and learning from stakeholders and users
on the latest Product Increment. But how?
Perhaps we have an outcome around stakeholder confidence and/or user happiness
in the application we are incrementally building. If so, then what better opportunity
than to ask the stakeholders after seeing the Showcase of the latest increment how
confident or happy they are with what they have seen? This can be quantitative, in that
we can ask stakeholders to rate the product on a score of 1-10. It can also be qualitative
in the conversation that happens around this scoring and supplementary feedback from
collecting feedback data.
Perhaps we have an outcome around employee engagement and team skills. Again,
what better opportunity than at the end of a Delivery Loop iteration to survey the
team on their happiness level and ask them to rate themselves on different skills? This
visualization will not only allow us to see trends, but also identify the positive and
negative effects as a result of the team education and cultural activities that we are
undertaking.
Showcase events allow the opportunity to show many different kinds of metrics:
Software Delivery metrics, Platform metrics, and Team Velocity metrics. We will explore
the details of these shortly. If these are things that will be of interest to the audience
and help them understand the impact of everything the team is doing, by all means,
include them. However, the event where a more in-depth conversation can happen is
the event that typically follows the Showcase, the Retrospective.
The Retrospective
We introduced the Retrospective practice in the previous chapter and looked at many
different formats of running them. Let's consider metrics a little further and also an
engineer's perspective of Retros.
Where to Measure and Learn | 489
The Retrospective –
an Engineering Perspective
Back in the late 1980s, I vividly remember being
introduced to feedback loops and control
theory in chemical engineering classes, way
before software engineering totally enveloped
my career! If you wanted to control the level
in a tank, the flow rate in a pipe, or nearly any
dynamically changing system, you needed
to learn how to keep that system stable. The
software engineering practice of Retrospectives
always makes me think of those second-order feedback loops, the physical
connection back into our software design and development process that
allows us to learn and adapt so that the system can become more stable.
Retrospectives all have a common goal of allowing the team to inspect what
has just happened, while allowing them to adapt and refine what goes on in
the future. This is a critical function. If there is no feedback or feedback is
not actioned, the team may start to lose faith in the whole delivery process.
In the engineering world, if feedback fails, the tank overflows!
It should come as no surprise that metrics should be included as hot topics
in any Retro. Teams that discuss their poorly performing SDO or CI/CD
metrics can gain a lot of insight into what is going wrong with the software
delivery process. So, celebrate when the build starts to take 20 minutes to
complete. This means that the metrics measurement and trending are in
place and the team can now take action to find out why it has gotten so slow
and seek to improve it.
It's possible to save a lot of wasted time and resources by listening to timely
feedback. If feedback is ignored, you can incur an opportunity cost by
spending the next year carrying out remediation projects instead of rolling
out new, revenue-generating features. The moral of the story is to Retro,
Retro, Retro! You can really never have enough feedback.
490 | Measure and Learn
With all the tools and technology put in place when we created the technical
foundation (in Chapter 6, Open Technical Practices – Beginnings, Starting Right, and
Chapter 7, Open Technical Practices – The Midpoint), there are huge amounts of data,
metrics, and analysis we can collect and conduct. If you run pub Retros (as introduced
in the previous chapter), there is nothing better than taking a few print-outs of reports
from these tools and taking them down the pub to analyze together over a pint of
Guinness!
Export your build statistics, latest test results, static code analysis reports, Burndown
charts, and whatever else you can find, lay them all out on the table, and ask yourselves:
what does this data tell us about us as a team that we don't know already? What can we
learn about ourselves? What patterns exist? What can we do in the next sprint to make
these measurements better?
Retrospective actions that are often taken include an increased focus on CI/CD
infrastructure, increasing thresholds around code coverage for testing, and adding
more or faster feedback loops for the team to learn from.
Where to Measure and Learn | 491
Figure 13.3: Inspecting the build time statistics at Retrospectives (the graph has been
annotated with a few labels to reflect the sprint that the data was a part of)
As we can see, the build was very unstable in Sprint 1. A much earlier version
of this chart was taken to the pub Retro at the end of Sprint 1. The team
inspected it and agreed that, during Sprint 2, they would invest some time
investigating the build stability issues.
492 | Measure and Learn
Two weeks later, Sprint 2 was completed and the ScrumMaster brought
another print-out of this chart from Hudson. The good news was that
the Sprint 1 Retro action had paid off as the build was much more stable.
However, it was noticed that the build was sometimes taking more than 15
minutes to complete. This was a much longer feedback loop than needed,
so a further retrospective action was taken in Sprint 3 to address this. We
can see that from Sprint 3 onward, the build was mostly stable and relatively
quick.
Imagine if we had not used the Retrospective to look at this data. Imagine
if we had let the slow build just fester over time. Imagine how much time
would have been lost. This is why we do metrics-driven Retrospectives.
As we come out of the sprint, we now have the opportunity to learn the results of
experiments we've designed.
Of course, you can't validate everything through the use of numbers. Some
commentary with further learning and understanding is needed by talking to end users,
known as quantitative feedback. Let's explore a couple of practices to help us do this,
starting with user testing.
User Testing
User-based testing is a technique that focuses on user interactions with a product.
These types of evaluation directly involve end users and focus on the person. Let's dive
a little more into this by looking at two user testing practices: usability and guerilla
testing.
Usability Testing
In a usability testing session, the team observes real users interacting with the product.
Typically, a facilitator sits with a user and asks them to complete tasks and explain their
thinking as they go. The team sits in a separate room and observes the testing by video
link.
A usability test is not a focus group; it's focused on what the user thinks and does in the
real world. An Empathy Map, as introduced in Chapter 8, Discovering the Why and Who,
can be a very useful supporting practice.
A usability test can be run on an existing product, a prototype, or even a competitor's
product. The prototype could be working code, or it could be something as simple as
a few clickable images. Test early and often to create products that delight users and
solve real needs.
Usability testing often highlights something that's obvious to someone who has been
working on a product, but might be confusing to a user. What we think users need
might not be what they actually need. Indeed, what users think they need may not be
what they actually need! Usability testing can help answer questions such as, Are we
on the right track? What problems do we still need to solve?, or Which features should
we build next? With early feedback from real users, teams can avoid sinking time into a
feature that's confusing or not useful.
494 | Measure and Learn
The business analyst who was running the usability test immediately phoned
in to say that the app was broken. We replayed the exact thing the user did,
only to discover that the bank's API was unable to accept a decimal point.
But for us, the real surprise was that no one on our team had ever noticed
this issue before. As a small team, we were quick to fix the issue by updating
the placeholder text and sending only an integer to the API.
This example always reminds me that we are not our users. Users do
weird and wonderful things in an application. Even if you try to test all the
scenarios, there is almost always going to be some edge case you have not
thought of, and the chances are the first person to use your app will do the
one weird thing!
496 | Measure and Learn
Read more about the usability testing practice, share your own experiences,
ask questions, and improve the practice further in the Open Practice Library at
openpracticelibrary.com/practice/usability-testing.
Some of the challenges you may encounter when trying to organize usability testing
include getting actual customers to access legacy systems or not having the time,
money, or experts to conduct this level of testing. Guerrilla testing offers a low-cost
alternative or supplementary form of qualitative user testing.
Guerrilla Testing
Guerrilla testing is a low-cost, lean, and Agile method of collecting data for testing and
validating a hypothesis in a short session focused on specific tasks. Participants are
not recruited in advance, but instead are approached in a number of environments by
the team, where similar demographics are targeted; for example, customers in coffee
shops, or administrators in an office environment.
This testing provides a simple method for collecting enough data to make well-
informed strategic design decisions. It can also assist senior stakeholders and product
teams in understanding the importance of usability testing and customer feedback.
Everyone on the team can facilitate without any research experts. It is a flexible
approach that can be implemented at any stage of product development.
Read more about the guerilla testing practice, share your own experiences, ask
questions, and improve the practice further in the Open Practice Library at
openpracticelibrary.com/practice/guerilla-testing/.
What to Measure?
What you measure is what you get. – H. Thomas Johnson
In assessing what to measure as part of the Build, Measure, Learn, feedback loop, we are
going to be standing on the shoulders of giants. There is a whole literature section on
DevOps metrics available to us and we are going to call out our current favorites here.
Top of that list is the DevOps DORA report1 and the book Accelerate,2 which are works
where scientists described how they have taken a data-driven approach to measuring
DevOps culture and practice. In the DORA report, effective IT delivery organizations
take about an hour to get code from committed into trunk (in Git) to "running in
production". This sounds great! So let's look at some of the detailed metrics that allow
us to hone in on such a goal.
Unfortunately, if we dig into these a little, we can quickly find problems with all of these
measurements.
1 https://www.devops-research.com/research.html#reports
2 https://itrevolution.com/book/accelerate
500 | Measure and Learn
Is a solution that can be written with a 100-line piece of code better than a 200-line
piece of code? On the surface it would seem so, but which solution is easier to maintain
over time? Which solution is clearer and easier for new developers to discover and
figure out? It may be that the 200-line piece of code is much easier to maintain and
learn for the team rather than the expertly crafted bit of black magic code that only
the expert understands. If a team is 100% utilized on feature development, when do
they get time to learn, do unplanned work, or manage technical debt? Not having time
for these activities ultimately slows delivery and innovation to a grinding halt. If a
team delivers 10 stories per sprint, is this better than a team that delivers 5 stories per
sprint? Can we really compare different teams' output when they work on unrelated
work items? Probably not. Are the user stories the same size and value for our business
and end customers? It is not easy to judge across teams with different products and
ownership.
These measurements are what we call output-based. We can instead shift the focus
from what a team produces more toward the goals or Target Outcomes. By focusing on
the global outcomes, the team members aren't pitted against each other by measuring
the wrong thing. A classic example is rewarding Development Team members for
throughput on new features and Operations Team members for service stability. Such
measurements incentivize developers to throw poor-quality code into production as
quickly as possible and operations to place painful change management processes in
the way to slow down change.
By not focusing on measurement or the right things to measure and focusing on
outputs rather than outcomes is a quick way to get into trouble. Luckily, the DORA
report starts to lay out some of the key metrics that a team can use for SDO:
Software Development Metrics are metrics that capture the effectiveness of the
development and delivery process by measuring the following:
• Lead Time: The time code is checked in to when it is released into production
• Deployment Frequency: How often the team can release code into production
Lead time is a key element of Lean Theory: by shortening the amount of time it takes to
get product features to end users, the team can shorten the feedback loop for learning
what their end users like or do not like. If the team is building the wrong thing, they can
correct their course (or pivot) quickly when they have short lead times.
Another element to this is the size of the work delivered, or Batch Size. By delivering
small incremental value to end users quickly, lead times are kept lower. With a much
faster delivery cadence, how do we make sure system stability does not suffer?
Software Deployment Metrics are the metrics that capture system stability and release
quality. They are measured by the following:
• Time to Restore: The time taken from detecting a user-impacting incident to
having it fixed or resolved
• Change Failure Rate: The number of released changes that fail or result in a
user-impacting incident
What percentage of the changes made to production fail? How long does it take to
restore service to end users in the event of failure? Development teams will have
happier customers if outages are fixed quickly when changes occur and ideally have
changes that do not fail at all!
Service Operation Metrics captures operational performance via metrics called
Service Availability. Assume your product sales website has crashed. The company
accountant might ask the question, Is our application generating revenue right now?
Measuring service availability is a good way to link a technical goal with the desired
business outcome.
The metrics defined so far go a long way in helping a team understand key software,
delivery, and operations metrics that are desirable for better organizational outcomes.
To help capture and radiate these metrics, Red Hat has been investing in an open
source project that has resulted in a dashboarding tool called Pelorus3.
3 https://github.com/konveyor/pelorus/
502 | Measure and Learn
Pelorus
Pelorus is an executive dashboard that helps visualize the progress we make on the SDO
success metrics. It makes use of open source tools such as Prometheus and Grafana to
track progress both locally (within a team) and globally (across the organization):
Pelorus consists of a set of exporters to customize data points to capture metrics from
different providers. The sources from which exporters automate the collection of
metrics are growing as more people contribute. This currently includes OpenShift (as
Deploy time exporter), Git providers GitHub, GitLab, and Bitbucket (as Commit time
exporters), and JIRA and ServiceNow (as established Issue trackers).
Using the data points that are collected from the providers, metric indicators are
calculated to represent a measure. Each outcome is made measurable by a set of
representative measures: Lead Time for Change, Deployment Frequency, Mean Time
to Restore, and Change Failure Rate.
Pelorus offers a great opportunity to have real-time information radiators next to teams
giving out these important metrics. Teams can also regularly inspect and discuss these
at Showcase and Retrospective events and ask themselves what improvement actions
or experiments they can run to try and improve these metrics further. What else should
we measure?
Most of these metrics are part of what is known as lagging indicators for performance.
In other words, they measure what has already happened. It is also possible to
formulate leading indicators that can help predict future performance from this data.
One such measurement example is based on the flow of work items or stories into and
out of a team. This net flow of work items allows us to predict the confidence of due
date delivery. As teams accept more and more work, this slows down their ability to
deliver items on time. So, the net flow of stories in and out of a team becomes a lead
indicator for measuring work item delivery performance.
It is surprising what start and end dates can tell us. The distribution of items with time
can also help categorize the type of work undertaken by a team. For example, normal
work items may look different to priority work items that come to the team from
production failures. Another example may be work that requires long wait times for
approval from, say, security or compliance teams. These would have large lead times
compared to normal work items that don't require such approval.
It can be very hard to measure SLAs properly. A service may be available but with
degraded service performance, for example. It can also become more complicated
when only some portion of your users experience a partial outage. Capturing this type
of SLA complexity and measuring it accurately is hard to do.
One benefit of SLAs is that they allow IT managers to quantitatively measure business
outcomes. So rather than having to deal with a generic qualitative complaint, such as,
"My application won't load and it is very slow", they can measure application availability
(uptime) and percentile page load speed instead.
An SLO is an agreement about a metric within a given SLA. A simple example may
be that we agree to return search results quickly to our end users, with an average
search latency of 200 milliseconds. In contrast, the SLI typically measures the SLO. So,
we might define a target upper bound for our search, for example, by specifying that
search latency for 99% of all performed searches must be less than 300 milliseconds.
By quantitatively specifying and publishing SLAs, SLIs, and SLOs for our services, end
users can set their expectations on how well the service will perform. This prevents
qualitative complaints about the service being slow or over-reliance on a service where
users expect it to be more available than it actually is.
Any external system that has a lower availability SLA will negatively impact
the PetBattle V2 SLA when it fails. The PetBattle team writes the code and
runs the service, and ideally there are as few external service providers in
the Development and Delivery Loop as possible. Any new service that is
incorporated into the application must be designed to meet these SLAs.
By quantitatively specifying and publishing SLAs, SLIs, and SLOs for our services, end
users can set their expectations on how well the service will perform. This prevents
metrics-driven qualitative complaints about the service being slow or over-reliance on
a service where users expect it to be more available than it actually is.
Measuring Security
Malicious users are out there. How can we ensure that the user data that is kept in our
applications is not misused? Or that our application services are not put to unintended
usage, causing organizational or reputational harm? It is commonplace to see data
leaks and security breaches related to software applications and services in the media
today. Answering these questions is most often the primary concern of an information
security (InfoSec) analyst or team of people. A modern approach to help tackle these
security concerns is called shifting security left. The term is associated with teams
that build InfoSec into the software delivery process instead of making it a separate
phase that happens downstream of the development process.
Building security into software development not only improves delivery performance,
but also improves security quality. By designing and implementing security and
compliance metrics into the system, it becomes possible to measure continual
compliance against security standards:
506 | Measure and Learn
PetBattle Security
The PetBattle team has asked the founders to hire an InfoSec professional
who can be part of the development process. They are worried about
potential data breaches of user details in PetBattle V2, especially if we need
to start collecting payment details when we start monetizing the site. The
PetBattle team has educated their developers of the common security risks,
such as the OWASP Top 10 and how to prevent them.
The team plans to implement security scanning and testing measurement
into their build pipeline. By using the OpenShift platform with a trusted
software supply chain, the team can spend significantly less time
remediating security issues that may arise.
The team has identified that everyone needs to become very familiar
with the detailed network design of the overall solution. This should help
avert attacks by malicious actors. The team also wants to make sure all of
the platforms and applications are easy to patch—so they can easily keep
frameworks up to date when Common Vulnerabilities and Exposures
(CVEs)4 arise.
4 https://cve.mitre.org
What to Measure? | 507
Having a secure system involves making sure all of the layers in the stack are
themselves secure – having a secure hardware environment, securing the operating
system, securing the containers image layers being used, securing the dependencies
that your application uses, securing your application code, securing the network that
exposes your application services, and ultimately ensuring that your end users can
interact with your applications securely.
Measuring these security layers, making sure they comply with various industry
standards, and having concrete action plans for when new vulnerabilities arise, requires
teams to work together at every stage of the software delivery life cycle. Security
should not just be the Chief Information Security Officer's (CISO's) job. Done right,
security is pervasive and designed into the platforms and software systems, with
security professionals actively participating as part of the core delivery team and not
just being the Mr. No5 when penetration testing is carried out.
Security is one of those topics that needs its own book. By shifting security left, we will
cover technical topics later in the book that include container image and vulnerability
scanning, the container health index, CVE patching, OpenShift Compliance Operator,
OpenShift Container Security Operator, and Security Policy enforcement with ACM.
These tools can help you to complement and build out a continually compliant platform
and application suite for yourself.
Measuring Performance
There is an adage that software developers use—first make it work, then make it work
fast. The truth to this statement is that some functionality must exist before it can be
made to work fast! Qualities such as the performance and security of our applications
are usually the most important non-functional requirements and they must be designed
into the system from the very start in order to be successful. Let's define what we mean
by performance.
Performance measures how fast the system processes a single transaction. This can
be measured in isolation or under load. The system's performance has a major impact
on its throughput. When end users talk about performance, what they are usually
talking about is throughput; and they only care about the performance of their own
transactions, not anyone else's. As far as they are concerned, if a system's response
time exceeds their expectation, the system is down.
Throughput describes the number of transactions the system can process in a given
time span. The system's performance clearly affects its throughput, but not necessarily
5 Mr. No is a Mr. Men book only available in France. Mr. No always disagrees with
everyone and everything. Mr. No is published under the title Monsieur Non in
France. This is one of the two Mr. Men titles that were not published in English.
508 | Measure and Learn
If we measure the number of end user requests, the throughput will vary depending
on how many requests there are. This is a measure of scalability. When a system is
horizontally scalable, it means we can add capacity (more servers, more pods, or
containers) to handle more throughput. In a shared-nothing architecture, we can add
capacity until we reach a known bottleneck. For example, in OpenShift, this may be the
number of pods per node, or the maximum number of nodes per cluster.
It is also worth noting that for any given system, an "acceptable response time" may vary!
For a mobile/web app, any response longer than a second or two means users will use
their fingers to walk away to another website or app. For a trading system at a bank, the
response time may be in the order of milliseconds or less. To understand the capacity a
system requires, we need to understand the system as a whole, before breaking it down
into its constituent parts. This is called systems thinking. By thinking holistically and then
in detail about a system, we can determine the bottleneck in a system.
At any point in time, exactly one constraint determines the system's capacity. Let's
imagine it is the database that limits the transaction throughput. Once we improve
that bottleneck—for example, by using faster storage or adding indexes, or using better
database technology—the next bottleneck in the system then becomes the performance
constraint—the application server capacity is now limiting throughput, say.
PetBattle Performance
When PetBattle is released to the world, users will gang up on us.
Sometimes, users come in really, really big mobs. Picture the Instagram
editors giggling as they point toward the PetBattle website, saying "Release
the cat hoard!" Large mobs can trigger hangs, deadlocks, and obscure race
conditions in our applications.
The PetBattle team wants to run special stress tests to hammer deep links or
hot URLs within the UI and API layers. Currently, there is a direct coupling
between users and the database, so the developers already know that some
form of caching could be required when scaling.
If we make monitoring a priority, we can refine our infrastructure and
application monitoring systems to ensure we are collecting information on
What to Measure? | 509
the right services and putting that information to good use. The visibility
and transparency yielded by effective monitoring are invaluable. Proactive
monitoring is a key part of a strong technical foundation.
The team plans to deploy the PetBattle application suite into an environment
where they can start performance testing parts of the system under load.
They want to start simply by targeting various parts of the system, for
example, the API to gain an understanding of system behavior under load.
By doing this early and often, they can identify bottlenecks and from there
work out a plan to fix them.
Like security, moving performance testing left in the build and deployment process
provides faster feedback and the opportunity to remediate and reveal issues early in
the development life cycle. Complex system testing for a full suite of applications is
often not feasible until closer to the end of development; however, it is possible to
test individual APIs, endpoints, or even parts of the UI often and early in an automated
manner. One of the main benefits of doing this type of automated performance testing
is to build up a baseline understanding of how the system performs. Any code changes
that alter the system performance characteristics often go unnoticed if this type of
testing is not being automated and you end up trying to firefight performance issues
right before a go-live.
Disruptions are a fact of life in all systems, OpenShift included. There are a host of
510 | Measure and Learn
Kubernetes primitives that will help our business services stay up and available for
customers (including replication controllers, rolling deployments, health checks, pod
disruption budgets, horizontal pod autoscalers, and cluster autoscalers). But even with
the best infrastructure, failures can still occur. Cloud service disruptions, hardware
failures, resource exhaustion, and misconfigurations can still threaten the business
service.
Measuring Culture
There is a link between metrics and transformation. Measurements help inform
the practices that are used. Gathering CI/CD metrics, for example, the time that
builds and deployments take for the PetBattle apps, allows the team to adopt and
develop streamlined CI/CD practices so they can release quickly, more often, and
with confidence that failures will be rare. As the team becomes comfortable releasing
software into production regularly, the business starts to trust that releases can
be performed with little risk to end users. This leads to the faster releasing of new
features, a tighter, shorter feedback loop, and ultimately, a culture that encourages
rapid change and innovation. Now we have really started to unlock transformational
change for the company:
Operational metrics often measure failure in very complex adaptive systems. When
things fail, an authoritarian blame-like culture will look to find the "human error" and
assign blame. This type of response to failure is not only bad but should be considered
harmful.
The goal of any failure analysis should be to discover how we can improve information
flow so that people have better and more timely information, or to find better tools to
help prevent catastrophic failures following apparently normal operations.
What to Measure? | 511
There are lots of possible stability metrics the team thought they could easily measure
and alert on if they were over a certain threshold. Slow responses tend to propagate
upward from layer to layer in a gradual form of cascading failure. When the website
becomes slow, users tend to hit the Refresh button more often, causing more and more
traffic. If we give our system the ability to monitor its own performance (in other
words, it becomes observable), then the system can also tell the team when it isn't
meeting its SL agreements. Some examples include the following:
• The number of users and active sessions
• Blocked threads in the API layer
• Out of memory events in the API or database layer
• Slow responses in the user interface
• High database CPU
OpenShift comes with a cost management and metering application that can be used
to show infrastructure usage. You deploy the Cost Management Metrics Operator in
OpenShift and the reporting and APIs are delivered as part of a SaaS solution from
cloud.redhat.com.
It allows the PetBattle team to do the following:
• Visualize, understand, and analyze the use of resources and costs.
• Forecast their future consumption and compare them with budgets.
• Optimize resources and consumption.
• Identify patterns of usage that should be investigated.
• Integrate with third-party tools that can benefit from cost and resourcing data.
There are a lot of visualizations and dashboards available. Figure 13.11 shows the
overview dashboard in a large demo environment as an example. It is possible to track
cost and usage at both an infrastructure and business level. Users can tag projects and
applications to gain detailed breakdowns as well as historical trends. The dashboards
can help answer common questions, for example:
• Show me the top project and top clusters by usage and cost:
• Which projects are costing me the most?
• Which clusters are costing me the most?
• Show how metrics contribute to the costs:
• What is driving costs? CPU, memory, storage?
• What are the predicted costs for the next calendar month?
What to Measure? | 513
It is possible to change between accumulated and daily costs, as well as filter and drill
down across clusters, clouds, and projects:
You can check out the cost management product documentation,6 which has a lot more
details about this service, including common configuration options across a hybrid
cloud.
6 https://access.redhat.com/documentation/en-us/cost_management_service/2021
514 | Measure and Learn
Errors should be investigated because they can degrade performance, and may not be
immediately noticed when the failure mode is recoverable. This includes operations
that fail and are retried, and devices from a pool of redundant devices that fail:
We can map the USE metrics against common resources in a tabulated form, as shown
in Figure 13.12, allowing for quick identification of the types of issues that are occurring
in our system. The OpenShift metrics stack supports pre-configured USE method
dashboards at the cluster and node level:
What to Measure? | 515
Similarly, the RED method can be summarized as for every service, monitor the request:
• Rate: The number of requests per second
• Errors: The number of requests that fail
• Duration: The amount of time requests take to complete
The RED method is a good baseline that can be applied to most request-based services.
It reduces the cognitive load for SREs; in other words, they can think the same way
about a large range of supported services for baseline metrics. The RED method does
break down for batch-oriented or streaming services. In the Google SRE Book, the
original "Four Golden Signals" included the RED and Saturation metrics.
We have shown in this section many different levels of quantitative analysis we can
perform using metrics. Let's now see how these translate to our focus on outcomes.
When people who are not part of the core team come into the team space, information
that is radiated on the walls can instantly be read. Stakeholders and people interested
in the team's work can immediately gain a better understanding and awareness of the
work that is going on. This activity is often referred to as walking the walls. Interested
parties can inspect artifacts on the walls and have a conversation with team members
about them. This is a hugely different experience from when information is hidden away
in a system behind a login.
Proactive Notification
So, what happens when things start to go wrong? What happens when the system itself
cannot resolve an issue by automatically restarting a pod, or scaling up cluster nodes?
Enter the realm of Alerts.
Alerting can take many forms. It could be that horrible text at 03:00 when things really
start going wrong, or a more subtle alert, such as a Slack message to say something
has updated successfully. The key here is that the information is being pushed, and not
pulled. OpenShift has alerting capabilities built into it and we'll explore this in more
detail in Chapter 16, Own It.
The classic forms of alerting are when there are spikes in memory usage in applications.
This could lead to an application failing or constantly having to restart. In these
instances, the team might spot the spikes on their dashboards and go and investigate
the issues. We can, of course, make this feedback loop even shorter by combining data
from different sources and alert on that. For example, if our application memory spikes,
if we could capture the logs from around that time and push both events to the team,
it could help diagnose the problems quicker. The real power of smart notifications is
being able to respond even more quickly to the event.
Alerting the Development teams to things that have broken is extremely important.
Notifications can come from all layers of the system; it doesn't just need to be that
terrible call in the dead of night to say the website is down! Whenever a job runs to
build our code or deploy a new version of an application, sending a quick alert to the
team's instant messaging software is a good way to notify the concerned stakeholders.
If the information is timely, then we can respond more effectively. This could mean
pulling the Andon cord and halting the production line while we gather together to fix
the issue.
518 | Measure and Learn
7 https://status.quay.io/
Visualize Measurable Outcomes | 519
Failing the build is seen sometimes as the cursed state. We try to encourage
a no-blame culture, but sometimes it can be a bit of fun to have a bit of a
blame game. I am a fan of this one for sure! Whenever you fail the build,
you're deemed to be the Cowboy Coder, the one who rides into town and
shoots from the hip without a care for the consequences. Or maybe you just
left a typo in your code!
Either way, if the dashboard turns red, then you have to dress like the
cowboy you are. This team took it one step further: not only did you have
to wear the pink cowboy hat until you had fixed the problem, but you also
had to ride around on the wooden hobby horse! Even if you had to go to the
bathroom or get a coffee, the horse and hat went with you! You'd be amazed
at the funny looks you get on the way to the canteen wearing this attire:
More silly examples come from a project I was working on years back. This
was in the days prior to using a container platform, and we had several
manually configured VMs that were critical for us when releasing software.
It was a massive project, with seven Scrum teams globally distributed. We
were building a suite of 50 product microservices so our build farm was
always busy!
Visualize Measurable Outcomes | 521
On one occasion, we had to get the IT company that was managing our
infrastructure to roll back to one of the previous backups, as someone
had executed a large number of commands as root and broken several
things for Jenkins. Raising a sev1 ticket still took a few days to resolve! We
couldn't let this kind of thing happen again as it was a massive drain on the
team's morale and productivity. So, we decided to get inventive. We were
using Slack as our messaging client and knew you could send messages to
channels via a webhook. We also knew that if anyone logged into a machine,
we could execute a bash script. Tying these items together, we created the
Kenny Loggins channel in our Slack instance...because when you log into a
server as root, you're in the DANGER ZONE!
This section has shown many different ways we can visualize outcomes and use metrics
to trigger proactive notifications to learn faster. Let's see how this can be summarized
with everything else we've learned in our Delivery Loop.
As we come out of the Delivery Loop and return to the Options Pivot in Section 7,
Improve It, Sustain It, we will complete the final section of this map by asking, What are
our next steps?
Now let's look at PetBattle's Delivery Map at the end of their first iteration of the
Delivery Loop.
Creating a Delivery Map | 523
The Delivery Map provides a powerful summary of the journey we've been round the
Delivery Loop. Like all other artifacts, it is a living and breathing summary and should
be revisted regularly and updated after every subsequent iteration.
Conclusion
We have now completed our journey around the Mobius Loop. In this chapter, we have
focused on the measurements and learning we can take away from the features we
launch, the experiment we run, and the research we conduct. A relentless focus on
measurement enables us to take more concrete decisions that are backed by metrics-
based evidence.
The Showcase and Retrospective events that are often run by Scrum and other Agile
teams provide ample opportunity to showcase metrics and highlight learnings. We take
this opportunity to re-examine the experiments that we designed on the Options Pivot
and investigate what actually happened. That often involves looking at the analytics
provided by the advanced deployment capabilities offered by the platform – the results
of A/B tests, canary launches, feature toggles, and dark launches.
We also highlighted the importance of running usability tests with the full team
involved, while being connected directly to end users to develop further empathy and
see them testing the evolving application. Guerilla testing also provides a low-cost and
simple way to gather learning from users:
We explored the many different metrics made available by the platform, the software,
and our teams. Service Delivery and Operational Performance metrics popularized
by DORA and Accelerate, and made available by open source tools such as Pelorus,
provide leading indicators of the success of DevOps culture and practice. These can
be supported by further metrics captured about security, performance, culture, the
application itself, and the infrastructure. The importance of radiating these, in real
time, in a very open and transparent manner cannot be underestimated, nor can
putting the behaviors and practices in place to be reactive and responsive to changes in
metrics.
As we conclude Section 5, Deliver It, we can see just how many practices have allowed
us to navigate the Mobius Loop, on top of our foundation of culture and technology:
While we have completed one revolution around the Mobius Loop, we have not
completed the journey. We will never complete the journey until the whole product is
turned off and decommissioned. This is because the Mobius Loop is infinite and will
never end. As we come out of the Delivery Loop, we return to the Options Pivot. We will
do this in Chapter 17, Improve It, when we explore the insights from our trip around the
Loop and ask what we have learned, followed by what we are to do next.
Before that, we are going to spend a few chapters diving a bit deeper into the technical
solution. We have already started that in this chapter. In Chapter 14, Build It, we will
look at other aspects of how we build the solution. In Chapter 15, Run It, we'll focus
on running the solution. In Chapter 16, Own It, we'll explore what it means to own
the solution. These three chapters form Section 6 of our book and are all about how
product teams Build It, Run It, Own It.
Section 6: Build It,
Run It, Own It
In previous sections, we've been discussing the approach we're taking in discovering
and prioritizing work to deliver applications such as PetBattle. This includes the many
aspects we need to consider when building the solution's components. Now it's time to
actually deliver working software:
Once our master cat plan is complete, we will become filthy rich in the process!
As PetBattle investors, we can then buy islands, invest in space engineering, or
just sip martinis by the pool all day until it becomes so boring that we start a new
cryptocurrency named Pet-Coin and lose it all.
To set expectations correctly, we're not really going into detail about all of the
application code itself. All the source code is available (in the Git repositories for the
book) and you can go through it at your own pace. There is a lot of value going through
the code, in particular the Helm, Quarkus, and AI code.
Our focus is on how to use the tooling provided within OpenShift to build, deploy, and
run the applications. We will go into certain advanced features, such as Operators,
Serverless, Service Mesh, and CI/CD tooling examples and techniques. Our intention
is to provide examples and advice around why we choose to use certain tools or
techniques and approaches and how all of them fit together in a cohesive approach
specific to PetBattle. You can then pick and choose which ones you want to use in your
own projects.
The section is broken down into three parts:
1. Chapter 14, Build It: This is where we introduce how we will use Git as the single
source of truth. We will cover taking our source code and packaging it using either
Tekton or Jenkins.
2. Chapter 15, Run It: This section covers testing, introducing a new component to our
app with Knative, running A/B tests, and capturing user metrics using some of the
advanced deployment capabilities within OpenShift.
3. Chapter 16, Own It: This section covers keeping the lights on with monitoring and
alerting feedback loops. We will also touch upon the importance of Operators in
Kubernetes.
Throughout these chapters, we cover a core belief that all teams should want to take
pride in their code. A team member should want to Build It, Run It, and ultimately
Own It.
| 529
If your code has failed the build because you forgot to check in a dependency, or if
your tests are not passing, then it is your responsibility to fix it! The purpose of the
deployment pipeline is to create a repeatable process that our code will pass through
and accelerate releases while also de-risking them. If we know that on every commit
to a repository all of our tests are executed, we're going to have a large amount of
confidence that the resulting application will work well. If we're continuously increasing
the test volume as the application's complexity increases, that too should grow our
confidence. It's critical for teams to want to own their software pipelines.
Having a platform such as OpenShift is a bit like the Beatles singing on the rooftop of
the Apple Corps building on Saville Row about people coming together. Developers,
operations, testers, architects, designers, database administrators, and analysts,
everyone coming together and using a platform like OpenShift provides a shared
space to collaborate upon. Building applications and business services on the platform
where developers can self-service all of their requirements in a safe, access-controlled
manner, bringing down the walls between teams, and removing bottlenecks such as
having to wait for permissions to deploy an application—this gets everyone speaking the
same language to deliver business outcomes through modern application delivery and
technological solutions.
Cluster Resources | 533
Cluster Resources
This section of the book will be one of the most technical. As described in the
Appendix, the minimum requirements for running the code examples using CodeReady
Containers (CRCs) in this chapter are as follows:
With the amount of memory required to follow, and the technical content out of the
way, let's dive into things in more detail. We'll start by looking over the components of
the existing PetBattle applications as they move from a hobby weekend project into a
highly available, production-based setup that is built and maintained by a strong, cross-
functional team.
Back in Chapter 9, Discovering the How, we went through an Event Storming exercise
that helped drive a newly proposed architecture. It consisted of a UI component and
a backing service that provided different REST-based APIs to the UI, as shown in
Figure 14.2:
PetBattle Components
In the coming chapters, we will explore automation, testing, and the extension of
PetBattle to include aspects such as monitoring and alerting, Knative Serving, and
Service Mesh. But first, let's imagine that the PetBattle team has completed a few
sprints of development. They have been building assets from the Event Storms and now
have the components and architecture as seen in Figure 14.3. Through the Event Storm,
we also identified a need for authentication to manage users. The tool of choice for that
aspect was Keycloak.
User Interface
The UI is written in Angular1 v12, a complete JavaScript framework from Google for
building web and mobile applications. The application is transpiled and the static site
code is then served from a container running Nginx (a webserver) instance provided
by Red Hat. The application is set up to pull its configuration on startup, which sets
up endpoints for all of the dependent services, such as Keycloak and the APIs. This
configuration is managed as a ConfigMap in OpenShift.
Pet Service
The Pet service is a straightforward service that uses Java Quarkus2 as the framework,
backed by a MongoDB database to retrieve and store details of the pets uploaded to
partake in a tournament.
Tournament Service
The Tournament service also uses the Quarkus framework and stores the state in both
MongoDB and an Infinispan distributed cache. MongoDB is used to store the details of
the tournament such as which pet won the tournament—but why did we use a cache?
Well, the answer is that a tournament only exists for a finite period of time and using
a database to store temporal data is not a great fit for our use case. Also, Infinispan
stores the cache data in memory, which is much faster to access than data on disk.
The drawback of this is that if the Infinispan pod dies/crashes, then the data is lost.
However, we plan to circumvent this in production by having at least two replicas, with
the data being replicated between the pods.
User Management
User management, authentication, and access control are a few other critical parts
of the architecture that need to be addressed. We're using Keycloak,3 an open source
identity and access management tool, to provide this functionality. We could have
written some code ourselves for this functionality, but security is an area that requires
a lot of expertise to get it right, and Keycloak does a great job of using open standards
to do this job correctly.
1 https://angular.io/
2 https://quarkus.io/
3 https://www.keycloak.org/
Existing PetBattle Architecture | 537
Plan of Attack
Initially, we are going to get the core PetBattle application components and services
up and running on OpenShift in a fairly manual way. We want to be able to develop
locally, adding new functionality to show how easy it is to combine Helm and OpenShift
to repeatedly deploy our code. Once that is completed, we are going to automate the
setup and deployment process using various tools, including Tekton/Jenkins, Argo CD,
and GitOps. We will explore how to add new components to our architecture using
Knative and experiment with some of the more advanced deployment capabilities that
we can utilize. Finally, in Chapter 16, Own It, we will look at application monitoring
and alerting along with Service Mesh for traceability. Figure 14.4 shows the additional
components added to the architecture, including the Knative Cat Detector Service
being proxied via the Service Mesh.
We will be using the command line as much as possible to show and explain the
commands involved. Each step can also be performed via the OpenShift web console.
If you're new to OpenShift, the web console is a great place to get started as it's full of
tips and tutorials!
538 | Build It
Running PetBattle
In Chapter 6, Open Technical Practices – Beginnings, Starting Right, we talked about
Helm and its use as an application lifecycle manager for installing, upgrading, and
rolling back application deployments. We are going to start with the command line,
but you can skip to the end of this section if you would like to follow the web console
method. If you need help installing the Helm command-line tool, take a look at
Chapter 6 as a refresher. Now let's see how we can easily deploy the PetBattle suite of
applications as Helm charts into a single project on OpenShift. On your terminal, add
the PetBattle Helm repositories:
$ helm repo add petbattle \
https://petbattle.github.io/helm-charts
There are three main applications that make up PetBattle and are searchable in the
Helm repository:
Name Description
pet-battle PetBattle frontend – Angular app deployed on Nginx
pet-battle-api PetBattle API – Cats API that stores our uploaded images
in MongoDB
pet-battle-tournament PetBattle Tournament – Service for managing and
running each weekly competition
Table 14.2: The three main applications making up PetBattle
Name Description
pet-battle-infra PetBattle infrastructure – Chart contains Keycloak,
Grafana, and alerting components
pet-battle-nsff PetBattle NSFF components; includes the machine
learning algorithm and APIs
Table 14.3: The infrastructure and NSFF Helm charts
Running PetBattle | 539
We can search for the latest versions of these charts using the following command:
$ helm search repo pet-battle
Let's now deploy the main PetBattle application into our OpenShift cluster. We need
to update a local copy of the PetBattle frontend's Helm values.yaml file to match our
cluster URLs. This is needed to connect the frontend when deployed to the correct
collection of backend services. We can provide these values to our Helm charts when
deploying the suite of PetBattle applications. Let's download an example of the values.
yaml file for us to edit:
$ wget https://raw.githubusercontent.com/petbattle/pet-battle/master/chart/
values.yaml/tmp/values.yaml
Open the values.yaml file and replace the five URLs listed in the config_map to match
your OpenShift cluster (change the apps.cluster.com domain to apps-crc.testing,
for example, if you are using a CRC). For example:
# custom end point injected by config map
config_map: '{
"catsUrl": "https://pet-battle-api-petbattle.apps.cluster.com",
"tournamentsUrl": "https://pet-battle-tournament-petbattle.apps.cluster.
com",
"matomoUrl": "https://matomo-labs-ci-cd.apps.cluster.com/",
"keycloak": {
"url": "https://keycloak-petbattle.apps.cluster.com/auth/",
"realm": "pbrealm",
"clientId": "pbclient",
"redirectUri": "https://pet-battle-petbattle.apps.cluster.com/*",
"enableLogging": true
}
}'
540 | Build It
Gather the latest chart version for each of the PetBattle applications from the
preceding Helm search command and install the three applications pet-battle,
pet-battle-api, and pet-battle-tournament into your cluster. To do this, you will
need to be logged in to your OpenShift cluster. For example:
# Login to OpenShift
$ oc login -u <username> --server=<server api url>
Using helm list should give you a list of the installed charts. You should see the
following pods running in your petbattle project. An example is shown in Figure 14.5:
Running PetBattle | 541
The Tournament service will take several minutes to deploy and stabilize. This is
because its dependent infrastructure chart is deploying operator subscriptions for
Keycloak, Infinispan, and Grafana. Navigate to the OpenShift web console and you
should now be able to explore the PetBattle application suite as shown in Figure 14.6.
Browse to the PetBattle frontend to play with the applications.
Figure 14.6: PetBattle Helm charts deployed in the OpenShift Developer view
542 | Build It
You have now been shown how to install PetBattle Helm charts using the command
line—some may say the hard way! We are now going to demonstrate some of the
integrated features of Helm in OpenShift—some may say the easier way! We can create
a HelmChartRepository Custom Resource object that points to our PetBattle Helm
chart repository; think of it as helm repo add for OpenShift. Run this command to
install the chart repository:
cat <<EOF | oc apply -f -
apiVersion: helm.openshift.io/v1beta1
kind: HelmChartRepository
metadata:
name: petbattle-charts
spec:
name: petbattle
connectionConfig:
url: https://petbattle.github.io/helm-charts
EOF
With this in place, we can browse to the Developer view in OpenShift and select Add
Helm Charts, and see a menu and a form-driven approach to installing our Helm
charts—just select a chart and install it:
This can provide a great developer experience for teams sharing services with each
other. A backend team can produce a new app to the repository and a downstream
team can deploy it to their developer environment in a single click. In fact, if you
add a Helm values schema file, OpenShift will build a What You See Is What You Get
(WYSIWYG) form for easy configuration of the values file.
Argo CD
When we established our foundations in Section 2, Establishing the Foundation, we
bootstrapped all of our builds, deployment, and tooling using Helm and Argo CD. We
made some opinionated choices when running that bootstrap automation and it's
worth discussing some of the trade-offs we made in a bit more detail. We followed our
call to action when establishing our technical foundation and planned out what worked
for us as the PetBattle product team and reviewed and discussed what was working and
not working so well.
It turned out that for our development team, bootstrapping all of the CI/CD tools
became an extremely important task. We had been given an arbitrary (but necessary)
constraint that our development OpenShift cluster needed to be rebuilt from scratch
every two weeks. So we needed to be confident that our CI and CD could stand
up quickly and repeatedly. By following our everything-as-code practice, all of our
OpenShift infrastructure definitions, CI/CD tooling, and pipeline definitions are stored
in Git. The declared CI/CD tooling state is continuously synced to our development
cluster by Argo CD, so updating the SonarQube Helm chart version, for example, is as
easy as changing one line and pushing it to Git. The change is synchronized and rolled
out to our cluster a minute later.
Being able to effectively lifecycle-manage all of the supporting tools involved with
building your applications takes effort and attention to detail, but it is worth it in the
long run, as you will have built a system that can handle change easily and repeatedly.
We have optimized our application lifecycle around the cost of change, making the cost
(in man hours) as small as possible. Human time is our biggest resource cost after all!
544 | Build It
The versions of all our tooling are checked into Git using Semantic Versioning4
(SemVer). There is a good lesson to be learned here in terms of version control;
nearly all modern open-source software uses this pattern for it. Often you can be
surprised with the resulting deployment when a chart or operator references the
latest images from external sources. The latest tag is a moving target and is updated
often. Referencing tags for your versions in Git is like walking a tightrope, there is a
balancing act to be performed between wanting to easily accept your toolchain and
wanting to update it for bugs and security fixes—hence the use of tags and knowing
with confidence that a specific version is working. Normally in SemVer MAJOR.MINOR
versions are tags that move with small bug fixes and security patches. MAJOR.MINOR.
PATCH versions are not tags and yet they specify a fixed version (ideally!). Choose a
strategy that does not incur too much technical debt that strands the team on old and
unsupported versions forever. This is balanced with not having to constantly update
version numbers all the time. Of course, if you have optimized for a small cost of change
through automation, this problem of changing versions becomes much less of an issue!
We have chosen a push (CI) and pull (CD) model for our software delivery lifecycle. The
job of building images and artifacts (Helm charts and configuration), as well as unit and
integration testing, is part of a push CI model. On every code commit, a build pipeline
trigger (Tekton or Jenkins) fires. It is the job of the Argo CD controller to keep what we
have deployed in our OpenShift cluster in sync with the declared application state in
our Git repositories. This is a GitOps pull model for CI. The key thing here is that Git is
the single source of truth and everything can be recreated from this source.
Figure 14.8: A GitOps push and pull model for continuous delivery
4 https://semver.org/
Trunk-Based Development and Environments | 545
The main benefit we see with this approach is that it is developer-centric. Any change
in the code base triggers a pipeline build and deployment. This gives the team fast
feedback for any breakages, since automated tests are always run against the new code.
The pull CD model decouples the synchronous nature of a build and testing pipeline.
Built artifacts (container images and configuration) can be built once, then tagged
and promoted through a lifecycle, all of which is controlled from Git. This is great for
auditability and discovering who changed what and when. We can easily trace code
committed and pushed with builds, tests, and deployments. It is also a flexible approach
in that not all artifacts need to be built per se. Configuration can be changed and
deployed using the same model. The model is also very flexible in its ability to support
different development workflow models. For example, Gitflow and Trunk-based
development can easily be catered for, depending on how the team chooses to work.
This helped us to clarify where the responsibilities lie between our Git branches, our
continuous integration tasks, continuous delivery tasks, and which OpenShift projects
these would occur in.
Figure 14.11: Application packaging, Helm, and Argo CD with the app-of-apps pattern
5 https://trunkbaseddevelopment.com
The Anatomy of the App-of-Apps Pattern | 547
One strict view of GitOps is that mutating a state is not as pure as just checking in
the filled-in templates with the values themselves. Kustomize, for example, has no
templating and follows this approach. We use Kustomize for deploying our CI/CD
automation with Argo CD because we think it fits that use case better. This means
that we are less likely to have a large number of CI/CD environments for our PetBattle
product—at the moment there is just one.
The trade-off here is that while we use GitOps to synchronize the Helm chart itself, the
supply of application values may come from multiple places, so you have to be careful
to understand where overriding values and precedence occurs, as follows:
• Lowest precedence: values.yaml provided with the chart (or its sub chart
dependencies)—these are kept in sync by the Argo CD controller.
• Higher precedence: Override values specified when Argo CD creates the
Helm application. These have higher precedence than value files, such as helm
template --set on the command line. These can be specified in a template or a
trigger, depending on how the pipeline is run.
We deploy each of our applications using an Argo CD application definition. We use one
Argo CD application definition for every environment in which we wish to deploy the
application. This is the red box depicted in Figure 14.11. We make use of Argo CD with
the app-of-apps pattern6 to bundle these all up; some might call this an application
suite! In PetBattle we generate the app-of-apps definitions using a Helm chart. This is
the third, outer green box in Figure 14.11. The configuration for this outer box is kept in
a separate Git repository to our application.
The app-of-apps pattern is where we declaratively specify one Argo CD app that
consists only of other apps. In our case, this is the pet-battle-suite application.
We have chosen to put all of our applications that are built from the main/master
under this pet-battle-suite umbrella. We have a PetBattle suite for testing and stage
environments. Figure 14.12 shows the app-of-apps for the stage environment:
6 https://argoproj.github.io/argo-cd/operator-manual/cluster-
bootstrapping/#app-of-apps-pattern
548 | Build It
The Argo CD sync policy for our applications is set to automated + prune, so that child
apps are automatically created, synced, and deleted when the manifest is changed. You
can change or disable this if you need to. We also configure a webhook against the CI/
CD Git repository so that any changes trigger Argo CD to sync all applications; this
avoids having to wait for the three-minute sync cycle when CI/CD code changes.
The Git revision can be set to a specific Git commit Secure Hash Algorithm (SHA) for
each child application. A Git SHA is a unique 40-character code computed for every
commit to the repository, and is therefore not movable, unlike a tag. This ensures that
even if the child app's repository changes, the app will only change when the parent
app changes that revision. Alternatively, you can set it to HEAD/master/main or a
branch name to keep in sync with that particular branch. It's a good idea to use Git
commit SHAs to manage your application versions the closer you are to production
environments. Pinning to an exact version for production ensures easier traceability
when things go wrong. The structure here is flexible to suit your product team's needs.
Build It – CI/CD for PetBattle | 549
Figure 14.13: The Big Picture including the tools the team thinks they will use
550 | Build It
As a quick reminder, our Big Picture from Section 2, Establishing the Foundation,
identified all the tools we might use as depicted in Figure 14.13, which include:
• Jenkins: To automate the building and testing of our software
• Nexus: To host our binaries and Helm charts
• Argo CD: To manage our deployments
• SonarQube: To assess our code quality
• Zalenium: For automated browser testing
Now that the tools are in place, let's think about the stages our code should move
through for being deployed. A team should start small—what is the minimum amount of
automation we need to get code compiling and deployed? It's very important for teams
to start small with a basic end-to-end flow of their code; otherwise, things become
messy quite quickly, leading to unnecessary complexity and potentially not delivering
anything. It's also important because the feedback loop we are creating needs to be fast.
We don't want a brilliant, complex process that thinks of everything but takes hours to
run! It's not the kind of feedback loop we're trying to create.
We always use three simple stages: Build > Bake > Deploy. A good pattern for engineers
to take is to keep to an abstract definition of their pipeline so they can get greater reuse
of the pattern across any of their apps, irrespective of the technology they use. Each
stage should have a well-defined interface with input and output. Reusing a pipeline
definition in this way can lower the context switch when moving between backend
and frontend. With this in mind, we can define the stages of our build in the following
manner.
Build It – CI/CD for PetBattle | 551
The Build
Input: The code base
Output: A "compiled" and unit-tested software artifact
The build should always take our source code, compile it, and run some unit tests
before producing some kind of artifact that will be stored in Nexus. By defining
the interface of the build process as lightly as this, it means we can substitute the
implementation of what that looks like in each technology or application type. So,
for example, when building the PetBattle frontend Angular app, we will use the Node
Package Manager (npm) to complete these steps within the Build stage, but a Java
application would likely use Gradle or Maven to achieve the same effect. The hardest
work will happen in this stage and it is usually what has the highest dependency on the
framework or language that's being used. We will see this in later stages; the technology
originally being used becomes less important, so higher reuse of code can occur.
552 | Build It
The Bake
Input: A "compiled" software artifact
Output: A tagged container image
This is the act of taking our software artifact that was created as an output in the
previous step and packaging it into a box, that is, a Linux Container Image. This image
is then tagged and stored in a container registry, either one built into OpenShift or an
external one. In OpenShift there are many different ways we can achieve this, such as
using source-2-image, binary build, or providing a Containerfile/Dockerfile.
Build It – CI/CD for PetBattle | 553
The Deploy
Input: A tagged image
Output: A running app in a given environment
Take the image that has just been pushed to the registry and deploy it along with any
other services or configuration required for it to run. Our applications will be packaged
as Helm charts, so the deployment will likely have to patch the image referenced in
our app's chart. We want our pipeline to support multiple workflows. For feature
development, we can just helm install into the development namespace. But for
release candidates, we should be committing new release information to Git for it to
trigger the rollout of changes. The implementation of this workflow is the responsibility
of the steps, the lower level of what is being executed. The abstract view of a Deploy
should result in a verified app deployed on our cluster (and ultimately promoted all the
way to production).
554 | Build It
The team captures these stages for the applications they're building by adding some
nice doodles to their Big Picture. Next, they begin thinking about promoting the
application across the environment from test to production. When building applications
in containers, we want to ensure the app can run in any environment, so controlling
application configuration separately is vital. The team will not want to rebuild the
application to target different environments either, so once an image is baked and
deployed it needs to be verified before promotion. Let's explore these stages further.
System Test
Input: The app name and version under test
Output: A successful test report and verified app
Figure 14.17: The SYSTEM TEST component from the Big Picture
Drive the user behavior within the application via the frontend by verifying whether the
app is behaving as expected. If all the connected parts that make up the application are
behaving as expected (the microservices, authentication, and frontend) then the app
can be signed off and will not need to be rebuilt. Our system test cases for PetBattle will
be the acceptance criteria the team has agreed upon. Because of this, we can sign off
the application as ready for real-world users. Any component that has changed in the
stack should trigger this stage; it is not just the responsibility of the frontend.
Build It – CI/CD for PetBattle | 555
Promote
Input: A verified image name and version
Output: Running app in production environment
With the application working as expected (based on our passing system test cases), we
can now promote the images that make up our app to the new environment, along with
their configuration. Of course, in the world of GitOps, this is not a manual rollout of a
new deployment but committing the new version and any custom configuration to our
configuration repositories, where they will be picked up by Argo CD and deployed.
556 | Build It
Figure 14.19: The Big Picture including all the stages of the pipeline in place
In Figure 14.19, we can see the Big Picture with the stages of the pipeline drawn in.
Now that the team knows the stages that their software will pass through on the way
across the cluster, they can fill in the lower-level details, the steps. At this stage, the
team is looking to see how they can build common pipeline steps, irrespective of the
technology they're using. This will provide greater reuse across their software stack
but, more importantly, reduce the cognitive load for engineers writing software in
multiple technologies. For this, it's a good idea to put on the Big Picture the technology
being used. In PetBattle's case, it is Angular and Quarkus (Node.js and Maven for the
build tools). They use a new color sticky to write the steps that each service will go
through in order to fulfill the interface defined at each stage.
Build It – CI/CD for PetBattle | 557
In Figure 14.20, we detail what these steps could look like for the Build stage of our
pipeline. First, we install the application dependencies. Following this, we test, lint, and
compile the code. Finally, we store the successful artifacts in the Nexus repository to
use in the next stage, the Bake.
Figure 14.20: The Build stage and the breakdown of its steps
558 | Build It
The team continues to flesh out the steps across all the stages. Finally, they add some
example containers deployed to each namespace at each stage to give a view of all the
components deployed for the PetBattle system to work. This is detailed in Figure 14.21:
Figure 14.21: The complete Big Picture for our software delivery process
The Big Picture is a helpful practice for getting team alignment on what's in our
toolchain and how we use it. It can be a great thing to play back to non-technical team
members too, giving them an idea of the complexity and usefulness of being able to
repeatedly build and test code. As with all our practices, it's also never done; when a
new tool enters our toolchain or we add a new stage in our pipeline, we add it to the
Big Picture first. It is the living and breathing documentation of our software delivery
process. With the Big Picture complete for now, let's move on to implementing the
components it describes.
Jenkins is the build tool of choice for lots of companies and developers alike. It has
been around for some time and it has its set of quirks for sure. It was never intended
to be deployed as a container when it was first conceived. In order to keep things
current and have an eye on the future, we've decided to write the code for the Big
Picture using both Tekton and Jenkins. Both can easily be tweaked for both frontend
and backend development, but for the purposes of this book we will use Jenkins to
automate the parts of the Big Picture for our Angular application. For the API, written in
Java (Quarkus), we will use Tekton, and go through setting up the same things in a more
Kubernetes native way. Both paths are available for the avid reader to play with and get
working, but we'll split the narrative this way for illustrative purposes.
So, like you would in a choose your own adventure book, you can now pick the path that
you would like to follow next. If you're not interested in Jenkins automation, then skip
ahead to the Tekton section directly. The code for both options is available in the Git
repositories for the book.
Before attempting the pieces in this chapter, make sure to have completed the
bootstrap steps in Chapter 7, Open Technical Practices – The Midpoint, under the
Implementing GitOps – Let's Build the Big Picture With Some Real Working Code! section.
These steps deploy the CI/CD tooling into your cluster using GitOps. The main tools
we are going to use in the next sections include Jenkins, Argo CD, and Tekton.
Jenkins–The Frontend
Jenkins is our trusty friend who will do the hard crunching of code—compiling, testing,
and so on—on our behalf. In order to get the best out of all the tools in our kit bag, there
are a few items we need to configure first. This includes, among other things, managing
secrets and adding webhooks to trigger our Jenkins automation as soon as a developer
commits their code.
To create this connectivity between our configuration repository and Argo CD, we need
to create an Argo CD app-of-apps to point to the repository. The app-of-apps pattern is
a neat way to describe all elements of a system. Imagine we have an app, named App-1,
which is our full system. This App-1 is made up of independently deployable services
such as App-1a, App-1b, App-1c, and so on. For PetBattle, we have the whole system
that is all of our frontend, APIs, and other services. We also have one of these for our
staging and test environments; this allows us to think of our app-of-apps as a suite of
applications.
If we clone the ubiquitous-journey7 project that we set up in Chapter 7, Open
Technical Practices – The Midpoint, to bootstrap our cluster, there is another set of
charts in here for our application stacks located in applications/deployments. When
applied, these definitions will create our Argo CD application Custom Resource pointing
to our Helm charts that will be created by the running builds on either Tekton or
Jenkins.
The values files (values-applications-stage.yaml) contain the Helm chart version
and application version that will be updated by Jenkins on successful builds. We want
Argo CD to monitor these values when applying changes to the cluster. These values
files also contain our overrides to the base Helm chart for specific environments, for
example, the config map that the frontend is configured with to communicate with
the services it requires to work properly (tournament-svc, cats-svc, and so on).
The following snippet shows the definition of this. These values will differ between
development, testing, and staging, so this pattern gives us the ability to version control
the configuration we want the application to use on startup.
pet_battle_stage:
name: pet-battle-stage
enabled: true
source: *helm_repo
chart_name: pet-battle
sync_policy_automated: true
destination: labs-staging
source_ref: 1.0.6
values:
7 https://github.com/petbattle/ubiquitous-journey
Jenkins–The Frontend | 561
fullnameOverride: pet-battle
image_repository: quay.io
image_name: pet-battle
image_namespace: petbattle
config_map: '{ "catsUrl": "https://pet-battle-api-labs-staging.apps.
hivec.sandbox1405.opentlc.com", "tournamentsUrl": "https://pet-battle-
tournament-labs-staging.apps.hivec.sandbox1405.opentlc.com", "matomoUrl":
"https://matomo-labs-ci-cd.apps.hivec.sandbox1405.opentlc.com/", "keycloak":
{ "url": "https://keycloak-labs-staging.apps.hivec.sandbox1405.opentlc.
com/auth/", "realm": "pbrealm", "clientId": "pbclient", "redirectUri":
"https://pet-battle-labs-staging.apps.hivec.sandbox1405.opentlc.com/*",
"enableLogging": true } }'
image_version: "master"
project:
name: pet-battle-stage
enabled: true
So, when we deploy an Argo CD application pointing to this Git repository, it will find
additional apps and so create our app-of-apps pattern. The structure of the repository
is trimmed, but you can see that the chart is very basic, having just two templates for
creating a project in Argo CD and the application definitions to put inside the project.
ubiquitous-journey/applications
├── README.md
├── alerting
│ └── ....
├── build
│ ├── ...
└── deployment
├── Chart.yaml
├── argo-app-of-apps-stage.yaml
├── argo-app-of-apps-test.yaml
├── templates
│ ├── _helpers.tpl
│ ├── argoapplicationdeploy.yaml
│ └── argocd-project.yaml
├── values-applications-stage.yaml
└── values-applications-test.yaml
562 | Build It
We could go to the Argo CD UI and connect it to this repository manually, or use the
Argo CD CLI to create the Argo CD application Custom Resource, but let's just run this
handy one-liner to connect things up for both our staging and test app-of-apps:
# from the root of ubiquitous-journey
$ cd applications/deployment
Figure 14.22: Argo CD sync of the pet-battle application suite for the staging environment
Jenkins–The Frontend | 563
To extend this app-of-apps pattern now is very simple. We only need to connect Git to
Argo CD this one time. If, after the next few Sprints, the PetBattle team realizes they
need to add a new component or service, they can simply extend the values YAML, that
is, values-applications-stage.yaml or values-applications-test.yaml for their
staging or test environment, with a reference to the new component chart location and
version. For example, for cool-new-svc:
cool_new_svc_stage:
name: cool-new-svc-stage
enabled: true
source: *helm_repo
chart_name: cool-new-svc
sync_policy_automated: true
destination: labs-staging
source_ref: 1.0.1 # version of the helm chart
values:
fullnameOverride: cool-new-svc
image_repository: quay.io
image_name: pet-battle
image_namespace: petbattle
image_version: "2.1.3" # version of the application image
project:
name: pet-battle-stage
enabled: true
Quay.io
When on Quay, create three new repositories, one for each of the application
components that we will be building. You can mark them as public, as private
repositories cost money.
These repositories serve as empty image stores for us to push our images to from
within the pipeline. But we need to provide Jenkins with the correct access to be able
to push them, so go ahead and hit the robot icon on the UI to create a new service
account that Jenkins can use. Give it a sensible name and description for readability.
We are going to mark all the repositories we created previously as Write by this robot.
Hit Add permissions:
Now that the repositories and robot account have been created, we can download the
secret to be used in our pipelines! Hit the cog on the side of the secret name and select
View Credentials.
On the page that pops up, download the Kubernetes YAML and store it in your fork of
pet-battle.
GitHub
A secret is also required for Jenkins to be able to push updates to our Helm values
files stored in Git. The values files for our applications will contain the properties we
want to pass to our templates, such as ConfigMap variables, or locations of images,
such as Quay.io. Our values files for the deployment of our applications will also
hold a reference to the image version (that is, the SemVer of our app, such as 1.0.1)
to be deployed by patching our DeploymentConfigs. We don't want to manually
update this but have a robot (Jenkins) update this when there has been a successful
build. Therefore, this secret will be needed to write these changes in versions to our
configured repositories, which are being pointed out by Argo CD. We track version
changes across all our environments in this way because, after all, if it's not in Git, it's
not real.
Jenkins–The Frontend | 567
To create a secret for GitHub, simply go to the Developer Settings view. While
logged into GitHub, that's Settings > Developer Settings > Personal access tokens
or https://github.com/settings/tokens for the lazy. Create a new Personal Access
Token (PAT); this can be used to authenticate and push code to the repository. Give it a
sensible name and allow it to have repository access.
Save the token's value, as you won't be able to access it again without generating a new
one. With the token in place, we can create a secret in Jenkins by adding it to a basic-
auth secret. In order for Jenkins, which is running in the same namespace as where
this secret will be created, to be able to consume the value of the secret, we can apply a
special annotation, credential.sync.jenkins.openshift.io: "true". This little piece
of magic will allow any credentials to be updated in Jenkins by only updating the secret!
568 | Build It
Update the secret with your values for GITHUB_TOKEN and GITHUB_USERNAME if you're
following along in your own fork and apply them to the cluster:
$ cat <<EOF | oc apply -f-
apiVersion: v1
stringData:
password: GITHUB_TOKEN
username: GITHUB_USERNAME
kind: Secret
metadata:
labels:
credential.sync.jenkins.openshift.io: "true"
name: git-auth
namespace: labs-ci-cd
type: kubernetes.io/basic-auth
EOF
SealedSecrets
You might be thinking that these secrets should probably be stored somewhere safe—
and you're right! If you want to explore the idea of storing the secrets in Git so they
too are GitOpsy (yes, I did just invent a word there), then we could use SealedSecrets
by Bitnami. It provides a controller for encrypting secrets, allowing us to store them
as plain text. This means we can commit them to Git! Through the magic of the
SealedSecret Custom Resource, it decrypts the SealedSecret, and creates a regular
Kubernetes secret on your behalf. We've written our Jenkins Helm chart to accept
SealedSecrets for this very reason!
You can deploy SealedSecrets to the cluster by enabling it in the Ubiquitous Journey
Git project. Open up bootstrap/values-bootstrap.yaml. It's as simple as changing the
enabled flag to true and, of course, Git committing the changes. This will resync with
Argo CD and create an instance of Bitnami SealedSecrets in your cluster, by default
in the labs-ci-cd namespace. Because this is a new component we're adding to our
tooling, we should of course also update our Big Picture with the tool and a sentence to
describe how we use it.
sealed-secrets:
# Disabled by default
enabled: true
nameOverride: sealed-secrets
Jenkins–The Frontend | 569
Once the controller has been created, we can seal our secrets by following these few
steps:
1. Install kubeseal using the instructions found on their GitHub releases page:
https://github.com/bitnami-labs/sealed-secrets/releases.
2. Log in to the cluster where SealedSecrets is deployed and take note of the
namespace (in our case this defaults to labs-ci-cd).
3. Process your existing secret using the kubeseal command-line utility. It is
important to set the correct namespace otherwise the secret will not be unsealed.
In this case, we're going to seal it as super-dooper-secret. It should look
something like this:
# create secret file from step 3
$ cat << EOF > /tmp/super-dooper.yaml
---
apiVersion: v1
kind: Secret
metadata:
name: super-dooper
labels:
credential.sync.jenkins.openshift.io: "true"
type: "kubernetes.io/basic-auth"
stringData:
password: "myGitHubToken"
username: "donal"
EOF
4. You can now apply that secret straight to the cluster for validation, but you should
add it to the cluster using Argo CD by committing it to Git. If it's not in Git, it's
not real. Here, we can see what the SealedSecret looks like before it's applied to
the cluster. As you can see, it's a very large, encrypted string for each variable we
sealed:
# have a look at the sealed secret
$ cat /tmp/sealed-super-dooper.yaml
apiVersion: bitnami.com/v1alpha1
kind: SealedSecret
metadata:
creationTimestamp: null
name: super-dooper
namespace: labs-ci-cd
spec:
encryptedData:
password: AgC6NyZa2to2MtKbXYxJBCOfxmnSQ4PJgV8KGdDRawWstj24FIEm5YCyH6n/
BXq9DEPIJL4IshLb2+/kONKHMhKy0CW5iGVadi13GcvO7lxZpVLeVr4T3nc/AqDwPrZ2KdzAI-
62h/7O4o2htRWrYnKqzmUqdESzMxWCK9dl7HZyArGadvwrH98iR48avsaNWJRvhMEDD6EM-
jS5yQ2vJYFMcMzOVyMtbD4f8e3jK7OO+vqoXsHtiuHk4HB63BZZqreiDcFgZMGaD6Bo6FyMSs/
tbkBjttiRvP5zZJ5fqC8IEgbZeuwhJ1eVOeKs/2xGBUMoEiYo6cKaUOqV9k130K2wcdX-
gN8B25phkRK9DpO23LoF/7/uLwNn01pCcxAxm1/2kvX24uPLtirmg1rQ03E9qrnlvyky-
J+9G3QBNtIlsiuoYmEYogZCSRZX29Cm0GWLolYPhlhMDDN6VQI6ktKCH6ubMcbh888Gn2K-
F8NzpQvV5wN9mQVFMR8+wNVkLGsaN+EEdgAc2CmiajIXur3zu4Menq3iWzJcWHdyT-
NlROpJeFH9qyfJLzbkWinPyzyBZEXeiZVKZ/ZAYEvXpyHAUngbnNnUO8HBwsLHb//
uYEzWRufIJezCy9PYxUVSBNIdfPybuCSeb87Bgry/+5D5aUjrqLuKJUhsLWIL3waHyvQswU-
jCQlcgFA7OZ9lwMqkDUYy9SnYatIZ98kf1Z6DA==
username: AgDY4NgxKug07A+jZ63h0Rdisfm6o7kVaKaiaPek9ZOiHsox1A0P4k-
lYaK/7cTEyOCpFVC/2nx00TX6F2KbA1GsRHkjnU/79nOkYWqsWWTU32c/0Re8sSEIPX7aVgR/
sMXYeWyRediRogA23xFcFzIFSvw4fZ2XpeX0BZNPbMdwZv2b+j/cjW8Po75B5gqbjwhMy-
H36QUApnjmoWmutLONVgAnHVM2rBr1Kx4wgxyy+hdmj+6ZkgMBckd53lMVX0unRVW93I-
j2eDcxTwN+HvVY7nBDmxVHuYAt6t31+DXpqBew10kNDxd8Xw2MpUFDb3JpMwIVtTnt-
mgeoyCHmo7nCYzQkGhwdrEYzoLVQBq+jf0Wmu3YRpEzZbegdTU3QfS1J7XM+86pAF6g-
cgbmrhpguGkU+PwnzPMxGNkq445oEPpvRemftjyFf7A8C+bZ90lrvVzZsfOue8WdXKm-
66vZoYuMPqA2o2HQV0IraaNGYPt9FmiAuXqWhzKsSVsbURXUUOaZIPAyX1z5V1reRz+gs/
cGHYKbmUua7XOFQr32siANI1IkRPi9cT+9iP9GGdq5RzZL75cJGFV8BorZ3CMADGC+skrFKO-
ExFvSrvofBnODB/xnPuirzsnQPcxtdvIz+sCv4M8qG2j0ASH1DBLLF7vMP9rLBgA1sPtzqX-
0CBakjuOjYDqpbXaKqHrM6kdTuBvO7tTDpAYA==
template:
metadata:
creationTimestamp: null
Jenkins–The Frontend | 571
labels:
credential.sync.jenkins.openshift.io: "true"
name: super-dooper
namespace: labs-ci-cd
type: kubernetes.io/basic-auth
sealedsecret.bitnami.com/super-dooper configured
5. To GitOpsify (yes, again I did just make that up), open up your Jenkins configuration
in ubiquitous-journey/values-tooling.yaml. Set your values on Jenkins sealed_
secrets as follows using the output of the secret generation step to add the
encrypted information to each key. The example here is trimmed for readability:
- name: jenkins
enabled: true
source: https://github.com/redhat-cop/helm-charts.git
...
values:
...
sealed_secrets:
- name: super-dooper
password: AgAD+uOI5aCI9YKU2NYt2p7as.....
username: AgCmeFkNTa0tOvXdI+lEjdJmV5u7FVUcn86SFxiUAF6y.....
6. If you've already manually applied the secret in Step 4, delete it by running
cat /tmp/sealed-super-dooper.yaml | oc delete -f- -n labs-ci-cd. Then
Git commit these changes so they are available to Jenkins and, more importantly,
stored in Git. In Argo CD, we should see that the SealedSecret generated a regular
secret.
7. In Jenkins, we should see all that the synchronized secrets using the magic
annotation (credential.sync.jenkins.openshift.io: "true") have become
available.
For simplicity here, we will continue without having sealed the secrets; the topic of
secrets and GitOps has been included only for illustrative purposes.
For us, we are creating the components in the Big Picture, which are Build > Bake > Deploy.
The Build should always take the source code, compile it, run some linting (static code
checking) and testing before producing a package, and store it in Nexus. We should
produce test reports and have them interpreted by Jenkins when deciding to fail the
build or not. We are building an Angular application, but Jenkins does not know how
to execute npm or other JavaScript-based commands, so we need to tell it to use the
agent that contains the npm binary. This is where the agents that we bootstrapped to
the Jenkins deployment will come in handy. Each agent that is built extends the base
agent image with the binary we need (that is, npm) and is pushed to the cluster. This
ImageStream is then labeled role=jenkins-slave to make it automatically discoverable
by Jenkins if they are running in the same namespace. This means that for us to use
this, we just need to configure our Jenkins stage to use agent { label "jenkins-
agent-npm" }.
Jenkins–The Frontend | 575
The Build stage will use this agent and execute some steps. The first thing is to capture
the app's version to be used throughout the pipeline by reading the app's manifest
(pom.xml for Java or package.json for Node). This version is then used on all generated
artifacts, including our image and Helm chart version, and should follow SemVer (for
example, <major>.<minor>.<patch> = 1.0.1). We will then pull our dependencies,
run our tests, lint, and build our code before publishing the results to Jenkins and the
package to Nexus.
576 | Build It
Our Bake will always take the output of the previous step, in this case, the package
stored in Nexus, and pop it into a container. In our case, we will be running an
OpenShift build. This will result in the package being added to the base container and
pushed to a repository. If we are executing a sandbox build, say some new feature on a
branch, then we are not concerned with pushing the image externally—so we can use
the internal registry for OpenShift. If this build is a release candidate then we'll push
into Quay.io (our external registry for storing images). The breakdown of the steps for
a Bake is found in the Git repository that accompanies this book: https://github.com/
petbattle/pet-battle/blob/master/Jenkinsfile.
From a bird's-eye view, the idea is to get the package from Nexus and then create an
OpenShift BuildConfig with a binary build and pass the package to it. You should then
see the build execute in the OpenShift cluster.
The deployment will take the application that has just been packaged up with its
dependencies and deploy it to our cluster. Initially, we will push the application to
our labs-test environment. We want to package the application and its Kubernetes
resources as a Helm chart, so for the deployment we will patch the version of the
application referenced in the values file with the latest release. For this reason, our
Deploy stage is broken down into two parts.
578 | Build It
The first one patches the Helm chart with the new image information, as well as any
repository configuration, such as where to find the image we just Baked! This is then
stored in Nexus, which can be used as a Helm chart repository.
Secondly, it will install this Helm chart. Depending on what branch we're on, this
behavior of how the application will be deployed differs. If we're building on master
or main, it is a release candidate, so there is no more oc applying some configuration—
this is GitOps land! Instead, we can commit the latest changes to our Argo CD config
repository (Ubiquitous Journey). The commits on this repository should be mostly
automated if we're doing this the right way. Managing our apps this way makes rollback
easy—all we have to do is Git revert!
Figure 14.34: Jenkins automated commit of the new version from a pipeline run
Jenkins–The Frontend | 579
Branching
Our pipeline is designed to work on multibranch, creating new pipeline instances
for every branch that is committed to in Git. It is intended to have slightly different
behavior on each branch. In our world, anything that gets merged to master or main is
deemed to be a release candidate. This means that when a developer is ready to merge
their code, they would amend the package.json version (or pom.xml version for Java
projects) with the new release they want to try and get all the way through the pipeline
to production. We could automate the version management, but because our workflow
has always been easier, a developer will do this management, as they are best placed to
decide whether it's a patch, a minor, or a major release.
This means that anything not on the main or master branch is deemed to be a sandbox
execution of the pipeline. If something is a sandbox build, it is there to provide fast
feedback to the developers of the current state of development in that feature. It can
also act as a warning to other engineers that something is not ready to be merged if
it's failing. The sandbox builds should be thought of as ephemeral—we're not interested
in keeping them hanging around—hence we make some key changes to the pipeline to
accommodate this:
1. Internal registry: If our built image is pushed to our external repository, it will
become clogged up and messy with unnecessary images. Every time a developer
commits to any branch it would create new images, so it can introduce a cleanup
headache; hence we use the internal registry, which automatically prunes old
images for us. We only use the external registry when we know a release could go
all the way to production.
2. Helm install: For our deployments, we're not interested in bringing in a
heavyweight tool like Argo CD to manage the development/sandbox deployments.
It's unnecessary, so we just use Jenkins to execute a Helm install instead. This will
verify that our app can deploy as we expect. We use Argo CD and GitOps to manage
the deployments in test and staging environments, but any lower environments we
should also treat as ephemeral (as we should test and staging too).
This approach allows us to support many different types of Git workflow. We can
support GitHub Flow, Gitflow, and Trunk, all via the same consistent approach to the
pipelines.
580 | Build It
Webhooks
Before we actually trigger Jenkins to build things for us, it's important to add a few
webhooks to make our development faster. We need two, one for the Argo CD config
repo and one for Jenkins in our source code repository.
When we commit a new change to the Git repositories that Argo CD is watching for,
it polls. The poll time is configurable, but who can be bothered to wait at all? Argo CD
allows you to configure a webhook to tell it to initiate a sync when a change has been
made.
This is particularly important if we want things to happen after Argo CD has worked its
magic, such as in a system test. Our pipeline in Jenkins runs synchronously, but Argo
CD is asynchronous and therefore anything we can do to reduce the wait between
these behaviors is critical.
On GitHub, we can configure the webhook for Ubiquitous Journey to trigger Argo CD
whenever the repository updates. On GitHub, add the webhook with the address of our
Argo CD server followed by /api/webhook.
Jenkins
Every time we commit to our source code repository, we want Jenkins to run a build.
We're using the multibranch plugin for Jenkins, so this means that when we commit to
the repository, the webhook will trigger a branch scan, which should bring back any
new feature branches to build pipelines or create builds for any new code commits on
any branch.
Configuring the Jenkins webhook for the pet-battle frontend is simple. On GitHub's
Hooks page, add the URL to our Jenkins instance in the following form, where the
trigger token is the name of our GitHub project. As a convention, I tend to use the name
of the Git project as the token, so the same would apply for the backend if you were
building it using Jenkins too:
JENKINS_URL/multibranch-webhook-trigger/invoke?token=[Trigger token]
For example, the frontend application's webhook URL would look something like this:
https://jenkins-labs-ci-cd.apps.petbattle.com/multibranch-webhook-trigger/
invoke?token=pet-battle
If you’re following along with a fork of the Ubiquitous Journey and want to see the
pipeline run end to end, update both ARGOCD_CONFIG_REPO to point to your fork
and QUAY_ACCOUNT to resolve to your user on Quay.io.
582 | Build It
These are used by the seed-multibranch-pipelines job that is baked into the Jenkins
image to scan the organization for repositories that contain a Jenkinsfile and are not
archived. If it finds any, it will automatically scaffold out multibranch Jenkins jobs for us.
In our case, we have a Jenkinsfile for both the Cats API and the PetBattle frontend,
so jobs are created for us without having to configure anything! If you're following
along and not using GitHub but GitLab, you can set GITLAB_* environment variables to
achieve the same effect.
If you open Jenkins and drill down into the pet-battle folder for the frontend code
base, you should see builds; for example, a Git branch called cool-new-cat and the
master with pipeline executions for each of them. Opening the Blue Ocean view, we get
a much better understanding of the flow control we built, as previously discussed.
For the master branch, which we deem to be a release candidate, the artifacts that are
built could go all the way. If we are updating our application, we bump the manifest
version along with any changes we're bringing in and Git commit, which should trigger
the build. From this point, our build environment is configured, and the pipeline
should execute. We target an external repository and the image that's built will be
pushed to Quay.io for portability across multiple clusters. Our Helm chart's values are
patched and pushed to Nexus for storage. If we need to update our Helm chart itself,
for example, to add some new configuration to the chart or add a new Kubernetes
resource, we should of course bump the chart version too. For our deployment,
we patch the Argo CD config repository (Ubiquitous Journey) with the new release
information, and it should sync automatically for us, deploying our application to the
labs-test namespace! We then run a verify step to check that the version being rolled
out matches the new version (based on the labels) and has been successful.
For our feature branches, the idea is much the same, but without the need for an
external repository. Our charts are also manipulated to override the name to include
the branch. This means that on each commit to a feature branch, we get a new
application deployed containing the branch name in the route. So, for our cool-new-
cat branch, the application is deployed as cool-new-cat-pet-battle and is available in
the developmental environment.
The remaining stages that were added to the Big Picture, System Test and Promote, will
be covered in the next chapter, when we look in more detail at the testing for PetBattle.
584 | Build It
Jenkins does tackle these problems with the use of shared libraries, and more recently
the Jenkins Templating Engine (JTE). The JTE tackles the problem by enforcing
pipeline approaches from a governance point of view. While this might seem like a great
way to standardize across an enterprise—here be dragons!
Applying a standard pipeline without justification or the ability for teams to pull
requests and make changes for their own specific use case is the same as having Dev
and Ops in separate rooms. We've worked with plenty of customers who have tried
approaches like this and ultimately it makes them go slower, rather than faster. The
teams putting the pipelines in place think they're helping and providing a great service,
but when things go wrong, they are the bottleneck to fixing it. For some teams, the
hammer approach might not be applicable for their use case and so the pipeline
becomes something in the way for them to go faster.
Tekton is another way for us to get greater pipeline reusability and also honor more of
our GitOps landscape. Let's explore it now for our Java microservices.
Tekton–The Backend
Tekton8 is an open source cloud-native CI/CD tool that forms the basis for OpenShift
Pipelines.9
Tekton Basics
There are many similarities between what Jenkins does and what Tekton does. For
example, both can be used to store pipeline definitions as code in a Git repository.
Tekton is deployed as an operator in our cluster and allows users to define in YAML
Pipeline and Task definitions. Tekton Hub10 is a repository for sharing these YAML
resources among the community, giving great reusability to standard workflows.
8 https://tekton.dev
9 https://docs.openshift.com/container-platform/4.7/cicd/pipelines/
understanding-openshift-pipelines.html
10 https://hub.tekton.dev
586 | Build It
OpenShift also makes these available globally as ClusterTasks. To write a pipeline you
can wire together these task definitions. OpenShift provides a guided Pipeline builder
UI for just this task. You link various tasks together and define parameters and outputs
as specified in each task definition.
There are numerous task activities in our pipeline definitions that require persistent
storage. When building our backend PetBattle API and Tournament applications using
Maven, we pull our Java dependencies via our Nexus repository manager. To speed
up this process, we can perform the same caching we might do on our laptops and
store these locally between builds in a .m2/repository folder and share this between
builds. We also use persistent storage for built artifacts so they can be shared between
different steps in our pipeline. Another use case is to mount Kubernetes secrets into
our pipelines:
# maven pipeline
apiVersion: tekton.dev/v1beta1
kind: Pipeline
metadata:
name: maven-pipeline
labels:
petbattle.app/uj: ubiquitous-journey
spec:
workspaces:
- name: shared-workspace
- name: maven-settings
- name: argocd-env-secret
- name: maven-m2
- name: git-auth-secret
claimName: maven-m2
- name: git-auth-secret
secret:
secretName: git-auth
In Tekton, we link these Kubernetes objects with the named workspaces when we
create what is called the PipelineRun, a piece of code that represents one run of a
pipeline. Similarly, the execution of a single task is a TaskRun. Each workspace is then
made available for the tasks in that PipelineRun as shown.
Reusable Pipelines
There are some choices to be made before you start writing and designing your Tekton
pipeline. The first is to choose whether you write a pipeline for each application, or
whether you write reusable pipelines that can be used for applications that are similar.
In PetBattle, we started with one pipeline per application; this is similar to having
a Jenkinsfile in each application Git repository. Both the API and Tournament
PetBattle applications are built using Java, Quarkus, and Maven, so it makes sense to
consolidate the pipeline code and write a reusable parameterized pipeline for these two
applications because they will always have similar tasks. We use our maven-pipeline in
PetBattle to do this.
Of course, you could keep the reuse to the Task level only but we share common tasks
across the PetBattle UI, API, and Tournament applications. Ultimately, the development
team has to balance the benefits of maintaining one pipeline over application pipeline
autonomy. There is no one-size-fits-all answer.
Figure 14.43: The list of task definitions used by PetBattle's Tekton Pipelines
The maven-pipeline starts by cloning the application and CI/CD (Ubiquitous Journey)
repositories into the shared workspace. We check the code quality by invoking Maven
to build and test the application, with quality reports being uploaded to our SonarQube
image.
590 | Build It
We check that the quality gate in SonarQube has passed and then invoke Maven to
package our application. Tekton offers us useful constructs to retry a task step if it fails
by specifying the number of retries as well as the ordering of task steps using the
runAfter task name list.
- name: quality-gate-check
retries: 1
taskRef:
name: sonarqube-quality-gate-check
workspaces:
- name: output
workspace: shared-workspace
params:
- name: WORK_DIRECTORY
value: "$(params.APPLICATION_NAME)/$(params.GIT_BRANCH)"
runAfter:
- save-test-results
In Java Quarkus, the packaging format could be a fat JAR, an exploded fast JAR, or a
native GraalVM-based image. There are various trade-offs with each of these formats.11
However, we are using the exploded fast JAR in PetBattle, which allows us to trade off
between faster build times or faster startup times. This is the end of the Build stage.
We have moved the unit testing left in our pipeline, so we get fast feedback on any code
quality issues before we move on to the Bake and Deploy pipeline phases.
Figure 14.44: The view of a PipelineRun in OpenShift showing the tasks being executed
11 https://quarkus.io/guides/maven-tooling
Tekton–The Backend | 591
The Bake stage is next. We use a standard OpenShift BuildConfig object, which is loaded
using Kustomize, as we do not package that with our Helm chart. We perform a binary
build using the oc start build command on the packaged application. We decided
not to upload the built application package to Nexus because we want to work with
container images as our unit of deployment. If we were building libraries that needed
to support our services, then they should be captured in Nexus at this stage. It is worth
pointing out that we could also push the image to an external registry at this point in
time so it can be easily shared between OpenShift clusters.
The next step is to lint and package the application Helm chart. The versioned chart
is then uploaded to Nexus. If we were on an application branch, the next pipeline
step would be a helm install into the labs-dev project. We can make use of when
statements in our Tekton pipeline to configure such behavior:
- name: helm-install-apps-dev # branches only deploy to dev
when:
- Input: "$(params.GIT_BRANCH)"
Operator: notin
Values: ["master","trunk","main"]
taskRef:
name: helm-install-from-chartrepo
592 | Build It
When on trunk/HEAD, the ImageStream is versioned and tagged into the namespaces
we are going to deploy our application to (labs-test, labs-staging). Because we
are practicing GitOps, the applications are deployed using Argo CD and Git. The Argo
CD app-of-apps values files are updated with the new chart and image versions. This
is checked into source code by the pipeline and git commit is executed. Argo CD is
configured to automatically sync our applications in labs-test and labs-staging, and
the last step of the pipeline is to make sure the sync task was successful.
There is a lot of pipeline information available to the developer in the OpenShift web
console and all of the pipeline task logs can be easily seen.
Tekton also has a great command-line tool called tkn, which can be used to perform all
of the pipeline actions available in the OpenShift console, such as viewing logs, starting
pipeline runs, and defining Tekton objects.
$ tkn pr list -n labs-ci-cd
Tekton–The Backend | 593
12 https://docs.github.com/en/developers/webhooks-and-events/webhook-events-
and-payloads
13 https://github.com/google/cel-go
Tekton–The Backend | 595
There are some convenience templates loaded into the labs-ci-cd project by
Ubiquitous Journey that can be used to manually trigger a PipelineRun—this is handy if
you have not configured the GitHub webhook yet.
$ oc -n labs-ci-cd process pet-battle-api | oc -n labs-ci-cd create -f-
$ oc -n labs-ci-cd process pet-battle | oc -n labs-ci-cd create -f-
$ oc -n labs-ci-cd process pet-battle-tournament | oc -n labs-ci-cd create -f-
You can manually add webhooks to your GitHub projects14 that point to the
EventListener route exposed in the labs-ci-cd project.
$ oc -n labs-ci-cd get route webhook \
-o custom-columns=ROUTE:.spec.host --no-headers
Otherwise, check out the PetBattle Ubiquitous Journey documentation for Tekton tasks
that can be run to automatically add these webhooks to your Git repositories.
14 https://docs.github.com/en/developers/webhooks-and-events/creating-
webhooks
596 | Build It
15 https://landscape.cncf.io/card-mode?category=continuous-integration-
delivery&grouping=category
Which One Should I Use? | 597
• Extensibility model: This is important for the ecosystem around the tooling.
Jenkins has a great plugin model that allows lots of different extensions to the
core. Tekton has a similar model, but it is different in that users have the ability
to use any container in a task. It is important to weigh up these extensions as
they offer a lot of value on top of the core tool itself. A good example is that
Tekton does not manage test dashboards and results as well as Jenkins and its
plugins do, so we might lean on Allure to do this. Reporting and dashboarding
extensions are important to make the feedback loop as short as possible during
CI/CD.
Once you have considered a few of these ideals, hopefully you will agree on the right
set of tools for your product and team. A measure of design and planning is required to
answer the question of where the various steps in your continuous deployment happen
and what application packaging approach should be used (templated or not templated,
for example). By now, we have instilled an experiment-driven approach to answering
these types of questions, where it is not one or the other tool, but about choosing the
right tool for the job at hand!
Conclusion
In this chapter we introduced how we use Git as the single source of truth. We covered
taking our source code and packaging it using either
Tekton or Jenkins. In the next chapter, we will focus on testing, introducing a new
component to our app using Knative, running A/B tests, and capturing user metrics
using some of the advanced deployment capabilities within OpenShift.
15 Run It
There is a saying that your code has no value until it runs in production. The sentiment
here is that until your customers use your software, it's of limited value for your
business or organization. It is certainly a broad generalization! However, it does speak
to the essential nature of software that its utility is directly related to being able to run
it for whatever purposes it was ultimately written for. To reach production with the
quality of service that our customers expect, all of the code must be put through its
paces.
In this chapter, we are going to explore how the PetBattle team tests their software so
they have greater confidence in its ability to run as expected in production. Testing is
multifaceted, as we discussed in Chapter 7, Open Technical Practices – The Midpoint,
and we are going to cover in some detail the types and scope of testing, from unit tests
to end-to-end testing, through to security checks and more.
When the hobbyist version of the application went live, the PetBattle founders soon
discovered that malicious content was being uploaded to the site. As part of this
chapter, we'll look at a modern-day solution to this problem using a trained AI-ML
model.
In the last section of this chapter, we explore some common cloud deployment patterns
and demonstrate A/B testing and experimentation, for gaining insight into how we can
safely measure and learn the impact of deploying new features in production.
600 | Run It
Why Serverless?
When deploying a container on a Kubernetes-based platform, such as OpenShift,
Kubernetes takes on the responsibility of managing the running container and, by
default, restarting it if it terminates due to an error. Basically, there's always a container
running. This is all good and fine for containers that are constantly receiving and
processing traffic, but it's a waste of system resources constantly running a container
that receives traffic either occasionally or in bursts.
What we'd like to achieve is to deploy a container and have it start up only when
needed, that is, during incoming requests. Once active, we want it to process the
incoming requests and then, after a period of no traffic, shut down gracefully until
further incoming requests are received. We'd also like the container instances to scale
up in the event of a surge of incoming requests.
The Not Safe For Families (NSFF) Component | 601
It is possible to automate the scaling up and down of the number of container instances
running on the platform using the Kubernetes Horizontal Pod Autoscaler; however,
this does not scale to zero. We could also use something like the oc scale command,
but this requires a fair amount of scripting and component integration. Thankfully, the
Kubernetes community thought about this and came up with a solution called Knative.1
Knative has two major components, Knative Serving and Knative Eventing. Serving
is used to spin up (and down) containers depending on HTTP traffic. Knative Eventing
is somewhat equivalent but is focused on spinning up containers based on events and
addresses broader use cases. For the purposes of this book, we are going to focus on
using Knative Serving. However, we will also give an example of how Knative Eventing
could be used.
1 https://knative.dev/
2 https://github.com/yahoo/open_nsfw
3 https://www.seldon.io/
4 https://min.io/
602 | Run It
We loaded the trained data model into MinIO and it looked as follows:
The saved model is something we can serve up using TensorFlow Serving,5 which
basically gives us an API endpoint to call our saved model with. There is an open source
TensorFlow serving image we can deploy and it's a matter of configuring that to find
our saved model in our S3 storage location.
We have glossed over the large portion of engineering that goes into making AI, ML,
and Ops pipelines not because it is not an interesting subject, but mainly because it
would require a whole other book to do it justice! If this subject is close to your heart,
then take a look at the Open Data Hub project.6 This is an open source project based
on Kubeflow,7 providing tools and techniques for building and running AI and ML
workloads on OpenShift.
5 https://www.tensorflow.org/tfx/guide/serving
6 http://opendatahub.io/
7 https://www.kubeflow.org/
The Not Safe For Families (NSFF) Component | 603
# Create the Knative Operator - this may have already been created for you
but here is how to do it on the command line.
# Create the Knative Operator
$ argocd app create knative\
--repo https://github.com/rht-labs/refactored-adventure.git \
--path knative/base \
--dest-server https://kubernetes.default.svc \
--dest-namespace openshift-serverless \
--revision master \
--sync-policy automated
8 https://github.com/openshift-knative/serverless-operator
604 | Run It
Once installed, you should be able to see this installed successfully in the
openshift‑serverless namespace:
We can also put this in our PetBattle UJ bootstrap from Chapter 7, Open Technical
Practices – The Midpoint, so that we don't need to run these commands manually. Add
the following to our values-tooling.yaml and check it into Git:
# Knative stanza in values-tooling.yaml
- name: knative
enabled: true
destination: openshift-serverless
source: https://github.com/rht-labs/refactored-adventure
source_path: knative/base
source_ref: master
sync_policy: *sync_policy_true
no_helm: true
The operator is now ready for us to use to deploy our Knative service.
It may take a minute or so for the containers to start up and load the model data into
MinIO; they may restart a few times while doing this. The output of the oc get pods
command should look like this once successful – the MinIO S3 pod and its completed
data load and a TensorFlow Knative service pod:
$ oc get pods --namespace petbattle
After a couple of minutes, the Knative Serving TensorFlow pod will terminate because it
is not yet being called. This is what's called Serverless scale to zero, that is, when there
are no calling workloads there is no need to run the service. An equivalent service can
also be created using the Knative command-line tool kn, which can be downloaded and
installed from the OpenShift9 console. This is useful if you want to create a new service
or are developing a service from scratch:
$ kn service create tensorflowserving-pb-nsff --namespace petbattle \
--image=docker.io/tensorflow/serving:latest \
--cmd "tensorflow_model_server" \
--arg "--model_config_file=s3://models/models.config" \
--arg "--monitoring_config_file=s3://models/prometheus_config.config" \
--arg "--rest_api_port=8501" \
--env S3_LOCATION=minio-pet-battle-nsff:9000 \
--env AWS_ACCESS_KEY_ID=minio \
--env AWS_SECRET_ACCESS_KEY=minio123 \
--env AWS_REGION=us-east-1 \
--env S3_REGION=us-east-1 \
--env S3_ENDPOINT=minio-pet-battle-nsff:9000 \
--env S3_USE_HTTPS="0" \
--env S3_VERIFY_SSL="0" \
--env AWS_LOG_LEVEL="3" \
--port 8501 \
--autoscale-window "120s"
9 https://docs.openshift.com/container-platform/4.7/serverless/serverless-getting-
started.html
606 | Run It
10 https://knative.dev/docs/serving/
The Not Safe For Families (NSFF) Component | 607
$ curl http://tensorflowserving-pet-battle-nsff-labs-
dev.apps.hivec.sandbox882.opentlc.com/v1/models/test_model
{
"model_version_status": [
{
"version": "1",
"state": "AVAILABLE",
"status": {
"error_code": "OK",
"error_message": ""
}
}
]
}
11 https://developers.redhat.com/blog/2020/06/30/kourier-a-lightweight-knative-
serving-ingress/
12 https://medium.com/@kamesh_sampath/serverless-blue-green-and-canary-
with-knative-kn-ad49e8b6aa54
608 | Run It
Now submit these to our NSFF service using a simple curl command:
$ HOST=$(kn service describe tensorflowserving-pb-nsff -o url)/v1/models/
test_model:predict
The response from our model is a predictions array containing two numbers. The first
is a measure of Safe for Families, the second is a measure of Not Safe for Families, and
they add up to 1.
So, we can see that Daisy Cat has a very high safe for families rating (0.993) compared
to our wrestlers (0.014) and we can use this in our PetBattle API to determine whether
any given image is safe to display. By arbitrary testing, we have set a limit of >=0.6 for
images we think are safe to view in the PetBattle UI.
We can redeploy our PetBattle API service to call out to the NSFF service by setting the
nssf.enabled feature flag to true and using the hostname from the Knative service
from a bash shell using the command line:
$ HOST=$(kn service describe tensorflowserving-pet-battle-nsff -o url)
$ helm upgrade --install pet-battle-api petbattle/pet-battle-api \
--version=1.0.8 \
--set nsff.enabled=true \
--set nsff.apiHost=${HOST##http://} \
--set nsff.apiPort=80 --namespace petbattle
610 | Run It
If we now upload these test images to PetBattle via the UI and check the API server, we
can see that the boxing picture has a false value for the ISSFF (Is Safe for Families) flag
and Daisy Cat has a true value:
Figure 15.5: PetBattle API saved images with the ISSFF flag
The API code will not return any pictures to the PetBattle UI that are deemed NSFF.
For example, the API code to return all pets in the PetBattle database is filtered by the
ISSFF flag being set to true:
@GET
@Operation(operationId = "list",
summary = "get all cats",
description = "This operation retrieves all cats from the
database that are safe for work",
deprecated = false, hidden = false)
public Uni<List<Cat>> list() {
return Cat.find(ISSFF, true).list();
}
Now that we have the API up and running it's time to test it and see if it performs as we
expect.
The Not Safe For Families (NSFF) Component | 611
Generally speaking, we have chosen to use what are considered the default test tools
for each of the application technology stacks as these are the simplest to use, are the
best supported, have good user documentation, and are generally easy to adopt if
people are new to them:
@Test
void testCat() {
PanacheMock.mock(Cat.class);
Mockito.when(Cat.count())
.thenReturn(Uni.createFrom().item(23l));
Assertions.assertEquals(23, Cat.count().await().indefinitely());
13 https://quarkus.io/guides/getting-started-testing
14 https://junit.org/junit5/
15 https://github.com/petbattle/pet-battle-api/blob/master/src/test/java/app/
battle/CatResourceTest.java
The Not Safe For Families (NSFF) Component | 613
In Java, we use annotations to make our Java class objects (POJOs) into tests. We use
the @QuarkusTest annotation to bring in the JUnit framework for this class and we can
think of the class as a test suite that contains lots of individual tests. Each method is
a single test that is annotated with @Test. For this unit test, we don't have a database
running, so we use mocks16 for the Cat.class. A mock is a fake object. It does not
connect to a real database, and we can use it to test the behavior of the Cat class. In
this case, we are asserting in our test that when we call the method Cat.count(),
which corresponds to the number of likes of our pet image in PetBattle, we receive back
the expected number (23). We use the Uni and await() functions because we are using
the reactive programming model in our Quarkus application.17
We run these unit tests as part of the automated continuous deployment pipeline and
visualize and report on the tests' success and history using our CI/CD tools, including
Jenkins, Tekton, and a test report tool such as Allure.18
In the next section, we'll continue with service and component testing with REST
Assured and Jest.
16 https://quarkus.io/guides/mongodb-panache
17 https://quarkus.io/guides/getting-started-reactive#mutiny
18 https://github.com/allure-framework
614 | Run It
We are going to cover the basics of component testing the PetBattle user interface19
using Jest. The user interface is made of several components. The first one you see
when landing on the application is the home page. For the home page component, the
test class20 is called home.component.spec.ts:
describe('HomeComponent', () => {
let component: HomeComponent;
let fixture: ComponentFixture<HomeComponent>;
So in this case, the unit test will expect the HomeComponent to be created correctly
when the test is run.
19 https://angular.io/guide/testing
20 https://github.com/petbattle/pet-battle/blob/master/src/app/home/home.
component.spec.ts
The Not Safe For Families (NSFF) Component | 615
Similarly, within the API application, REST Assured is a testing tool that allows us to
write tests using the familiar Given, When, Then syntax from Chapter 7, Open Technical
Practices – The Midpoint. Let's examine one of the service API tests in the test suite
CatResourceTest.java21:
@Test
@Story("Test pet create")
void testCatCreate() {
CatInstance catInstance = new CatInstance();
RestAssured.given()
.contentType(ContentType.JSON)
.body(catInstance.cat)
.log().all()
.when().post("/cats")
.then()
.log().all()
.statusCode(201)
.body(is(notNullValue()));
}
In this test, we are creating a Cat object. The Cat class is the data object in PetBattle
that contains the pet's uploaded image, along with its PetBattle vote count, and is
stored in MongoDB. In the test, given the Cat object, we use an HTTP POST to the /cats
endpoint and expect a return status code of (201), which is CREATED. We also test the
HTTP response body is not empty. It should contain the ID of the newly created Cat:
@QuarkusTest
@QuarkusTestResource(MongoTestResource.class)
@Epic("PetBattle")
@Feature("PetEndpointTest")
class CatEndpointTest {
21 https://github.com/petbattle/pet-battle-api/blob/master/src/test/java/app/
battle/CatResourceTest.java
616 | Run It
In this service test, we make use of the @QuarkusTestResource annotation to create and
start an embedded MongoDB for testing against. So, this test is a bit more sophisticated
than the basic unit test that was using mocks only. We also track the execution of these
service tests using our test report tool:
Now we have seen what unit tests look like, let's move up the test pyramid to have a
look at service-level testing.
package com.petbattle.integration;
...
22 https://www.testcontainers.org/
The Not Safe For Families (NSFF) Component | 617
@QuarkusTest
@DisplayName("API Test Cases")
@QuarkusTestResource(MongoTestContainer.class)
@QuarkusTestResource(InfinispanTestContainer.class)
@QuarkusTestResource(KeycloakTestContainer.class)
public class ITPetBattleAPITest {
This is a great example of how containers can be used as part of a testing phase. The
containers are spun up, the tests are run, and the containers are removed. The only real
prerequisite is that the Docker daemon is run on the machine running the tests. To run
the integration tests use the command mvn clean verify -Pintegration.
End-to-End Testing
Our application is made up of a frontend written in Angular, which makes calls for
data to two APIs. One is for tournaments and the other is for cats. We can think of the
interplay between these components as the system as a whole. Any time a change is
made to either of these individual applications, it should require revalidating the whole
system. The end-to-end automated testing is performed primarily in the user interface
but exercises the underlying services layer.
There are loads of tools to do testing from the user interface level. Some of the
more popular ones are things like Selenium and Cypress, which are used to drive
a web application and simulate user behavior. There are pros and cons to each –
Selenium is just browser automation so you need to bring your own test frameworks,
whereas Cypress is an all-in-one testing framework. Selenium Grid, when running
on Kubernetes, allows us to test against multiple browsers in parallel by dynamically
provisioning the browser on each test execution, meaning we don't have browsers
waiting idly for us to use them.
For our end-to-end testing, we're using Protractor from the Angular team. We already
deployed an instance of Selenium Grid built for Kubernetes by the Zalando team (called
Zalenium https://opensource.zalando.com/zalenium/) when we deployed our tooling.
Zalenium is pretty handy as it allows us to play back previous test runs and watch them
live. In your cluster, if you get the route for Zalenium (oc get routes -n labs-ci-cd)
and append /grid/admin/live, you can follow the tests as they execute or go to /
dashboard to watch the historical test executions.
618 | Run It
Figure 15.8: Zalenium dashboard showing test history and video playback
Figure 15.9: Example of BDD written as acceptance criteria on a Sprint board for PetBattle
The Not Safe For Families (NSFF) Component | 619
Here's a test for a tournament feature written in the Given, When, Then syntax:
Feature: Tournament features
The system-test project has its own Jenkinsfile, so it's already connected to Jenkins
via our seed job. We won't go through the contents of this Jenkinsfile in detail. Suffice
to say, the pipeline has two stages, as per our Big Picture, one to run the tests and
the other to promote the app if the tests have passed. Explore the code for this in the
accompanying Git repo https://github.com/petbattle/system-tests. To extend our
Jenkinsfile for pet-battle to trigger our system test job, we just need to add another
stage to trigger the job. We could use the Jenkins post{} block, but we only want to
trigger the system tests if we're on master or main and producing a release candidate.
There are a few parameters that are passed between the jobs:
• APP_NAME: Passed to the job so if the tests are successful, the promote stage
knows what app to deploy.
• CHART_VERSION & VERSION: Any update to the chart or app needs to be patched
in Git so this information is passed by the job that triggers the system tests.
620 | Run It
We can run the system tests job manually by supplying this information to the job, but
each service with a Jenkinsfile should be able to pass these to the system tests. This job
can also be triggered from Tekton too if we were to mix the approach to the pipelines.
With the two pipelines wired together, we can trigger one if the webhook is set up by
running the following command:
$ git commit --allow-empty -m "🍌 kickoff jenkins 🦆" && git push
If we now check in the Jenkins Blue Ocean Web UI, we should see the following:
On Jenkins, we should see the system tests pipeline running and promoting if
successful. The Cucumber reports are also included for the job.
These provide insight into which cases were executed for what browser and report
any failures that may have occurred. Let's switch gear a little now and take a look at
non-functional testing.
Pipelines and Quality Gates (Non-functionals) | 621
SonarQube
As part of the Ubiquitous Journey, we have automated the Helm chart deployment of
SonarQube, which we are using to test and measure code quality. In values-tooling.
yaml, the SonarQube stanza references the Helm chart and any extra plugins that are
required. Many of the common language profile plugins are already deployed with the
base version of SonarQube, for example, Java, JavaScript, and Typescript. We add in extra
plugin entries for Checkstyle, our Java formatting check tool, and a dependency checker
for detecting publicly disclosed vulnerabilities contained within project dependencies:
# Sonarqube
- name: sonarqube
enabled: true
source: https://github.com/redhat-cop/helm-charts.git
source_path: "charts/sonarqube"
source_ref: "sonarqube-0.0.14"
sync_policy: *sync_policy_true
destination: *ci_cd_ns
values:
initContainers: true
plugins:
install:
- https://github.com/checkstyle/sonar-checkstyle/releases/
download/8.35/checkstyle-sonar-plugin-8.38.jar
- https://github.com/dependency-check/dependency-check-sonar-
plugin/releases/download/2.0.7/sonar-dependency-check-plugin-2.0.7.jar
With the basic SonarQube pod deployed, there is one more piece of configuration we
need to automate – the creation of a code quality gate. The quality gate is the hurdle
our code must pass before it is deemed ready to release. This boils down to a set of
conditions defined in code that specify particular measurements, for example:
• Do we have new blocking issues with the code that was just added?
• Is the code test coverage higher than a given percentage?
• Are there any identifiable code vulnerabilities?
622 | Run It
SonarQube lets us define these quality gates23 using its REST API. For PetBattle, we use
a Kubernetes job to define our quality gate AppDefault and package it as a Helm chart
for deployment. The chart is deployed using Ubiquitous Journey and ArgoCD.
The SonarQube server can be queried via a REST API, whether a recent report against
a particular project has passed or failed this quality gate. We have configured a Tekton
step and task in our pipelines to automatically check this each time we run a build.
Our PetBattle Java applications are configured using Maven to talk to our SonarQube
server pod and generate the SonarQube formatted reports during each build, bake, and
deploy. In the reusable maven-pipeline.yaml, we call the following target to generate
these reports:
# code analysis step maven pipeline
- name: code-analysis
taskRef:
name: maven
params:
- name: MAVEN_MIRROR_URL
value: "$(params.MAVEN_MIRROR_URL)"
- name: MAVEN_OPTS
value: "$(params.MAVEN_OPTS)"
- name: WORK_DIRECTORY
value: "$(params.APPLICATION_NAME)/$(params.GIT_BRANCH)"
- name: GOALS
value:
- install
- org.owasp:dependency-check-maven:check
23 https://docs.sonarqube.org/latest/user-guide/quality-gates/
Pipelines and Quality Gates (Non-functionals) | 623
- sonar:sonar
- name: MAVEN_BUILD_OPTS
value:
- '-Dsonar.host.url=http://sonarqube-sonarqube:9000'
- '-Dsonar.userHome=/tmp/sonar'
- name: code-analysis
taskRef:
name: nodejs
params:
- name: NPM_MIRROR_URL
value: "$(params.NPM_MIRROR_URL)"
- name: GOALS
value:
- "run"
- "sonar"
Similarly, for the PetBattle UI using nodejs, we can configure the client to call
SonarQube as part of its Tekton pipeline. Once these steps have successfully run, we
can explore the SonarQube Web UI and drill down into any areas to find out more
information.
In a bit of recent development for A/B testing support in the PetBattle UI, some code
bugs seemed to have crept in! Developers can drill down and see exactly what the
issues are and remediate them in the code base. SonarQube ranks issues based on
severity defined in the Language Quality Profile, which can be altered to suit your
development code quality needs.
SonarQube also reports on the last run's code testing coverage. On the code base
side, you generate coverage reports using the LCOV24 format, so in Java, this is done by
JaCoCo25 and in JavaScript, the coverage reports are produced by the mocha/jasmine
modules. These reports are uploaded into SonarQube and give the team visibility into
which parts of their code base need more testing. A nice way to view this information
is using the heatmap, which visualizes the bits of code that have near 100% coverage
(green), down to areas that are not covered at all 0% (red). The statistics are also
reported – the percentage coverage overall, the number of lines covered, and so on.
24 https://github.com/linux-test-project/lcov
25 https://www.eclemma.org/jacoco/
626 | Run It
The last plugin we use for our Java applications is the OWASP Dependency-Check
plugin.26 We move security checking "left" in our pipeline. In other words, we want to
discover early in the development process when security vulnerabilities or CVEs are
creeping into our applications' dependencies. By identifying which dependencies are
vulnerable to a CVE early as part of the build cycle, developers are in a much better
position to update them, rather than finding there are issues once our applications are
deployed.
The plugin sources data from multiple open source resources including the US National
Vulnerability Database27 and Sonatype OSS Index.28 In conjunction with security team
members, developers can verify known vulnerabilities and suppress any false positives
using a configuration file. The report is very detailed and includes links to these sites to
assist CVE identification and reporting.
26 https://github.com/dependency-check/dependency-check-sonar-plugin
27 https://nvd.nist.gov/
28 https://ossindex.sonatype.org/
Pipelines and Quality Gates (Non-functionals) | 627
29 https://github.com/rakyll/hey
30 https://httpd.apache.org/docs/2.4/programs/ab.html
628 | Run It
We like hey because it is small, fast, written in Golang, and reports statistics in a format
we can easily understand. In the preceding screenshot, we can see a very simple
invocation using hey on the command line to call the PetBattle API and list all of the
pets. We pass in some parameters that represent:
• -c: Number of workers to run concurrently
• -n: Number of requests to run
• -t: Timeout for each request in seconds
We can see the summary statistics reported, and this is the bit we love – a histogram of
latency distribution, HTTP status code distribution, as well as DNS timing details. This
is super rich information. Histograms are graphs that display the distribution of the
continuous response latency data. A histogram reveals properties about the response
times that the summary statistics cannot. In statistics, summary data is used to
describe the complete dataset – minimum, maximum, mean, and average, for example.
Hey gives us these summary statistics at the top of the output.
The graph brings the data to life as we can start to understand the distribution of the
latency response over the time the test ran. Over the 4.2 seconds it took to send the
100 requests, we can see that most of the data is clustered around the 0.4-second
mark, which is nearly 50% of all traffic. Often, in service performance design, we are
interested in what the 95% or 99% percentile number is. That is, for all of the sample
data, what the response latency is for 95% (or 99%) of the traffic. In this test run, it is
measured at 0.57 seconds – in other words, 95% of the data was at or below this mark.
The shape of the histogram is also important. Where are the response latencies
grouped? We can easily see if the response times are distributed evenly around the
mean (Gaussian) or if they have a longer or shorter tail. This can help us characterize
the performance of the service under various loads. There are many types of load
profiles you could use, for example, burst loads where we throw a lot of instantaneous
traffic at our API, compared to more long-lived soak tests under a lower load. You might
even have known loads from similar applications in production already. A great open
source tool for designing these types of test loads, which can model threading and
ramping really well, is Apache JMeter31 and we highly recommend it as a tool to have in
your toolbox. To keep things simple, we won't cover that tool here.
31 https://jmeter.apache.org/
Pipelines and Quality Gates (Non-functionals) | 629
The two diagrams shown in Figure 15.19 display simple load tests. The one on the left is
a burst type of test – 300 consecutive users calling 900 times to our PetBattle API. We
can see the 95% is 15.6 seconds – this is quite a long time for users to wait for their cats!
The one on the right is a soak test – 50 consecutive users calling 10,000 times to our
PetBattle API. A very different set of statistics: a test duration of 461 seconds, and the
95% is 2.8 sec—much better from an end user's perspective.
At this point, it is important to think about what the test is actually doing and how it
relates to the PetBattle application suite in general. If we think about it, the test may
not be totally indicative of the current user interface behavior. For example, we do not
perform a call to return all of the images in our MongoDB at once but rather page the
results. And there are of course other API endpoints to test, for example, the topcats
API, which returns the top three most popular pets and is called every time you visit
the home page. We are returning the test dataset we have loaded into PetBattle, that is,
around 15 pet images, so it is not a massive amount of data. It's important to always step
back and understand this wider context when we run performance tests so we don't
end up testing the wrong thing!
Figure 15.19: Burst and soak tests against the PetBattle API
630 | Run It
Nonetheless, this is good data to ponder. A good result is that both the soak and burst
tests only returned HTTP 200 response statuses – there were no error responses from
the API. That gives us confidence that we have not broken anything or reached any
internal system limits yet. We can also examine the details to make sure DNS resolution
is not causing issues from the client-calling perspective.
Now we are familiar with the client or calling side of performance testing, let's switch to
the PetBattle API application running on the server side. If we browse to the Developer
view and select the pet-battle-api pod in the labs-test namespace, we can see some
important server-side information:
• The PetBattle API is autoscaled to two pods.
• Monitoring metrics for the pods (check the appendix if you haven't enabled this
for CRC).
As a developer, we have configured the PetBattle API application to use the Horizontal
Pod Autoscaler (HPA). This specifies how the OpenShift Container Platform can
automatically increase or decrease the scale of a replication controller or deployment
configuration, the number of running pods, based on the metrics collected from the
pods that belong to our application.
In our PetBattle API Helm chart, we specified the HPA with configurable values for
minimum pods, maximum pods, as well as the average CPU and memory targets. Using
hey, we can now test out various scenarios to help us tune the PetBattle API application
under load:
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
name: {{ include "pet-battle-api.fullname" . }}
labels:
{{- include "pet-battle-api.labels" . | nindent 4 }}
spec:
scaleTargetRef:
{{- if .Values.deploymentConfig }}
apiVersion: v1
kind: DeploymentConfig
{{- else }}
apiVersion: apps/v1
kind: Deployment
{{- end }}
name: {{ include "pet-battle-api.fullname" . }}
minReplicas: {{ .Values.replicas.min }}
maxReplicas: {{ .Values.replicas.max }}
metrics:
- type: Resource
resource:
name: cpu
target:
type: AverageValue
averageValue: {{ .Values.hpa.cpuTarget }}
- type: Resource
resource:
name: memory
target:
type: AverageValue
averageValue: {{ .Values.hpa.memTarget }}
We initially took a rough guess at these settings in our HPA, for example, min replicas
= 2, max replicas =6, CPU = 200m, mem = 300Mi, and set the resource limits and
requests in our Deployment appropriately. We always have a minimum of two pods, for
high availability reasons. The HPA is configured to scale based on the average memory
and CPU loads. We don't yet understand whether the application is memory- or
CPU-intensive, so choose to scale based on both these measurements.
632 | Run It
Figure 15.21: PetBattle API HPA in action, scaling pods under load
We use hey to start a burst workload, 400 concurrent requests, and watch the behavior
of the HPA as it starts more pods to keep to the specified memory and CPU averages.
Once the test concludes, the HPA scales our workload back down to the minimum
as the application recovers resources, in this case through Java garbage collection.
OpenShift supports custom metrics for the HPA as well as other types of pod scalers,
for example, the Vertical Pod Autoscaler.32
To conclude this section, we want to point out one more Kubernetes object that the
developer needs in their toolbelt – the Pod Disruption Budget (PDB). Again, using a
Helm chart template for the PDB, we can limit the number of concurrent disruptions
that the PetBattle API application experiences. By setting up a PDB, we can allow for
higher availability while permitting the cluster administrator to manage the life cycle of
the cluster nodes.
32 https://docs.openshift.com/container-platform/4.7/nodes/pods/nodes-pods-
using.html
Pipelines and Quality Gates (Non-functionals) | 633
If the cluster is being updated and nodes are being restarted, we want a minimum of
one pet-battle-api pod available at all times:
$ oc get pdb
This ensures a high level of business service for our PetBattle API. We can see ALLOWED_
DISRUPTIONS is set to 1 – this is because, at the time, the HPA had scaled the number of
available replicas to 3 and this will change as the number of available pods changes.
One of the great things about performance testing applications on OpenShift is that all
of the tools are at a developer's fingertips to be able to configure, test, measure, and
tune their applications to achieve high availability and performance when under load.
Each application service is independently scalable, tunable, and deployable, which makes
for a faster and targeted feedback loop when dealing with scale and performance issues.
In the next section, we are going to take a look at what makes a good OpenShift
Kubernetes citizen, automating Kubernetes resource validation as part of our pipeline.
Resource Validation
One aspect of testing that doesn't yet get much thought is the quality of the Kubernetes
resources being deployed on the cluster. For applications to be considered good citizens
on Kubernetes, there are a number of deployment best practices to be followed—
including health checks, resource limits, labels, and so on—and we will go through a
number of these in in Chapter 16, Own It. However, we need to validate the resource
definitions being applied to the cluster to ensure a high level of compliance to not only
industry recommendations but also any other resource recommendations that we
see fit to add. This is where Open Policy Agent (OPA)33 and associated tools can come
into play. This enables us to validate resource definitions during a CI pipeline and also
when applying resources to a cluster. OPA by itself is a policy validator and the policies
are written using a language called Rego. Additional OPA tools such as Conftest34
and Gatekeeper35 add a lot of value and governance from a usability and deployment
perspective. OPA is also embeddable into other third-party tools such as KubeLinter.36
33 https://www.openpolicyagent.org/
34 https://github.com/open-policy-agent/conftest
35 https://github.com/open-policy-agent/gatekeeper
36 https://github.com/stackrox/kube-linter
634 | Run It
37 https://github.com/open-policy-agent/gatekeeper
38 https://github.com/redhat-cop/rego-policies
39 https://www.openshift.com/blog/automate-your-security-practices-and-
policies-on-openshift-with-open-policy-agent
40 https://play.openpolicyagent.org/
Pipelines and Quality Gates (Non-functionals) | 635
We can combine this rule with Conftest and a Helm template to create a way to
statically validate our resources. In the PetBattle frontend code, there is a policy folder
that has a few more policies to check whether all the standard Kubernetes labels41
are set on our generated resources after we run the helm template command. By
running a few commands, we can verify these are in place. First, we template our chart
to produce the Kubernetes resources we will apply in deploying our software, and
secondly, we tell Conftest to check each file generated against the rule:
# from the pet battle front end repository (https://github.com/petbattle/
pet-battle.git)
$ for file in $(ls policy/helm-output/pet-battle/templates/); do conftest
test policy/helm-output/pet-battle/templates/$file; done
When executing the rules from the command line, we get a good insight into what's
missing from our chart. Of course, we could just assume that we'd always make our
charts adhere to the best practices, but the jenkins-agent-helm has also got the
Conftest binary so we can execute the preceding statements in our Jenkins pipeline
too. This example might seem simple but, hopefully, it gives you some idea of the things
that can be automated and tested that might seem less obvious.
41 https://kubernetes.io/docs/concepts/overview/working-with-objects/common-
labels/#labels
636 | Run It
Image Scanning
Red Hat provides the Quay Container Security Operator in OpenShift to bring Quay
and Clair image scanning and vulnerability information into our OpenShift cluster. Any
container image that is hosted on Quay.io is scanned by Clair.
Any image vulnerability data is exposed back in the OpenShift Web UI so that users
and administrators can easily view which images are considered vulnerable and which
namespace they are deployed to.
With this operator deployed, the OpenShift overview status displays image vulnerability
data, which an operator can drill into to find out the status of container images running
on the platform. For PetBattle, we don't have any enforcement for image vulnerabilities
discovered in our cluster. If we wanted to move the security scanner "left" in our
deployment pipeline, there are some great open source scanning tools available on the
OpenSCAP website.42
Linting
A linter is a static code analysis tool that can check a code base for common pitfalls
in design or stylistic errors. This does not check the compiled application, but the
structure of the application. This is super important for languages that are not
compiled, such as JavaScript. Browsers can interpret JavaScript in different ways so
consistency is super critical.
If you think about a large enterprise application, there could be hundreds of developers
working on the one code base. These developers could even be globally distributed
with different teams looking after different parts of the application's life cycle.
Having consistency in the approach to writing the software can dramatically improve
maintenance costs. JavaScript is very flexible in how you can write it, whether this is
from a functional programming standpoint or object-oriented, so it is important to get
this consistency right.
The PetBattle frontend uses TSLint/ESLint43 to check the style of the code adheres to
a standard set of rules. These rules can be manipulated by the team, but the rules are
checked into Git so if someone was to disable them or manipulate them, it would be
noticed. Our Jenkins pipeline is configured to automatically check the code base using
the npm lint command and our build will fail if a developer does not adhere to the
standard.
42 https://www.open-scap.org
43 https://eslint.org/
Pipelines and Quality Gates (Non-functionals) | 639
Figure 15.24: Linting PetBattle's frontend locally scans both the JavaScript and HTML
For Java Quarkus apps, Checkstyle44 is used to analyze the code base.
For Kubernetes resources, the aforementioned Open Policy Agent can assist, and Helm
also has the helm lint45 command to validate your charts.
Code Coverage
So, you've written a load of tests and you think things are going great – but how do
you know your tests are any good and covering all parts of the code base? Allow me to
introduce code coverage metrics! A code coverage reporter is a piece of software that
runs alongside your unit test suites to see what lines of code are executed by the tests
and how many times. Coverage reports can also highlight when if/else control flows
within an application are not being tested. This insight can provide valuable feedback as
to areas of a system that remain untested and ultimately reduce the number of bugs.
44 https://checkstyle.sourceforge.io/
45 https://helm.sh/docs/helm/helm_lint/
640 | Run It
Our PetBattle frontend is configured to run a coverage report when our Jest tests
execute. Jest makes generating the report very simple as it has a flag that can be passed
to the test runner to collect the coverage for us. The coverage report is run on every
execution of the build and so should be reported through Jenkins.
Figure 15.25: Code coverage report from the frontend unit tests locally
When executing our tests in the Jenkins pipeline, we have configured Jest to produce an
HTML report that can be reported by Jenkins on the jobs page. For any build execution,
the report is added to the jobs home page. The report will allow us to discover what
lines are being missed by our tests. Being able to drill into a report like this can give a
good insight into where our testing is lacking.
Pipelines and Quality Gates (Non-functionals) | 641
So, what should I do with these results? Historically, we have worked where coverage
is low. It can serve as a great talking point to bring up in a Retrospective. Printing out
the reports and discussing them as a team is a great way to assess why the team is
struggling to write enough tests. Sometimes teams are drowning by being overwhelmed
with pressure to churn out features and so testing can slip to the wayside. Having
a coverage reporter in your build can help keep a team honest. You could even set
thresholds so that if testing coverage falls below a certain percentage (some teams
aim for 80% and above), the build will fail, thus blocking the pipeline until the quality is
increased.
642 | Run It
Let's look at some other ways we can visualize risks during continuous delivery.
Pipelines and Quality Gates (Non-functionals) | 643
46 https://www.zaproxy.org/
47 https://github.com/redhat-cop/containers-quickstarts/tree/master/jen-
kins-agents
644 | Run It
In doing so, the web report that's created can be viewed in Jenkins, which gives great
details on the cause of the security vulnerability as well as any action that should be
taken to remedy it.
In the final non-functional testing section, let's have a look at deliberately breaking our
code using a technique called chaos engineering.
Chaos Engineering
Chaos engineering is the process of deliberately breaking, hobbling, or impacting a
system to see how it performs and whether it recovers in the ensuing "chaos." While
most testing is seen as an endeavor to understand how a system performs in a known,
stable state, chaos engineering is the computing equivalent of setting a bull free in a
fine-china shop—you know it's going to end badly but you just don't know exactly the
magnitude of how bad it's going to be.
Pipelines and Quality Gates (Non-functionals) | 645
The purpose of chaos engineering is to build confidence in the resiliency of the system.
It also allows you to better understand where breakage points occur and the blast
radius of any failures. There are many resilience features built into the Kubernetes
API specification. Pod replicas are probably the simplest mechanism, having more
than one of your applications running at any given time. It is also desirable to use
application‑specific mechanisms such as circuit breakers, which prevent failures from
spreading throughout your system. Chaos engineering takes these ideas one step
further and tests a system when one or more components fully or partially fail, such as
when CPU or memory resources are low.
The basic premise is that the system under test is observed in a stable working state,
then a fault is injected. The system is then observed to see if it recovers successfully
from the fault or not. Outcomes from such testing are a potential list of areas to tune/
fix as well as an understanding of the Mean Time to Recovery (MTTR) of a system. It's
important to note that chaos engineering is focused on the system as a whole—both
application and infrastructure performance need to be considered and tested.
One of the key mantras behind chaos engineering is contained in its defining
principles48 – The need to identify weaknesses before they manifest in system-wide,
aberrant behaviors.
This is one of the most important aspects to be considered when adopting
this approach. You don't want to be learning about weaknesses during a
production‑impacting incident. It's similar to the rationale behind regularly testing
disaster recovery plans. To paraphrase, a colleague of ours here at Red Hat said,
"When the excrement hits the fan, the first thing to do is turn off the fan!" Not much time
for learning there.
There are a number of tools and frameworks that can help with setting up a chaos
engineering practice. Here's some to get started with (though there are others):
• Litmus Chaos49
• Kraken50
• Chaos Mesh51
48 https://principlesofchaos.org/
49 https://litmuschaos.io/
50 https://github.com/cloud-bulldozer/kraken
51 https://chaos-mesh.org/
646 | Run It
In a world where practices such as everything-as-code and GitOps are our only way to
build software and the systems that support them, a great way to validate the ability
to respond to missing items is to redeploy everything, including your infrastructure,
from scratch every week or every night! This might seem extreme, but it's a great way
to validate that there is no hidden magic that someone has forgotten to write down or
codify.
So, what happens when some silly person accidentally deletes it? After
the initial shock and panic, the team pulled the Andon Cord. We quickly
stormed together to see what exactly had happened in order to plan how
we could recover not just Git but all the things we'd added to the cluster.
Luckily for us, everything we had done was stored in Git so we were able to
redeploy our tools and push our local, distributed copies of the software and
infrastructure back into the shared Git repository.
The team was cross-functional and had all the tools and access we needed
to be able to respond to this. Within 1 hour, we had fully restored all our
applications and tools with all of our automation running smoothly again.
I think the real power in this example is how, given the right equipment and
the right ownership, an empowered team can have it all. We acted as one
unit fixing things at lightning speed. We were not stuck waiting in a queue
or having to raise a ticket on another team to restore our infrastructure. We
could do it for ourselves within minutes – not days or weeks later.
Another thing I learned was not to keep Git in the same project as the
other tools in case another person like me comes along. I also learned to be
mindful of the permissions we have within a cluster. As an administrator,
I was able to remove things that perhaps I should not have been
playing with.
So we've written the code, tested, quality-checked it and even scanned it for
vulnerabilities. Now it's time to deploy it onto the cluster. Let's explore one of the key
areas of benefit of using Kubernetes - the different ways you can deploy applications
depending on your needs and perform user-driven experiments to determine what
features your users prefer.
648 | Run It
Advanced Deployments
The time between software being written and tested till it is deployed in production
should be as short as possible. That way your organization is able to realize value from
the software changes as quickly as possible. The modern approach to this problem is, of
course, through automation. There are simply too many details and configuration items
that need to be changed when deploying to production that even for a small application
suite like PetBattle, manual deployment becomes error-prone and tedious. This drive
to reduce manual toil is at the heart of many of the DevOps practices we have been
discovering in this book.
We can minimize the downtime (ideally to zero!) during software deployment changes
by adopting the right application architecture and combining that with the many
platform capabilities that OpenShift offers. Let's look at some common deployment
strategies that OpenShift supports:
• Rolling deployment:
• Spin up a pod of the new version and then spin down a pod of the existing
old version automatically. Very useful for a zero-downtime approach.
• Canary deployment:
• Spin up a single pod of the new version, perform testing to ensure that
everything is working correctly, and then replace all the old pods with new
ones.
• Blue/Green deployment:
• Create a parallel deployment and verify that everything is working correctly
before switching traffic over.
• Service Mesh traffic mirroring functionality can be useful with this approach
to validate that the new version is working as expected.
• Recreate deployment:
• Basically, scale the existing pods down to zero and then spin up the new
version.
• Use where an application must be restarted, for example, to migrate
database schema or tables.
• Think of this as a Ripley deployment: "take off and nuke the entire site from
orbit. It's the only way to be sure."52
We can roll back to previous deployment versions using the Helm chart life cycle or
the out-of-the-box oc rollback support. Images and configuration are versioned and
cached in OpenShift to easily support rolling back to previous versions.
52 https://en.wikiquote.org/wiki/Aliens_(film)
Advanced Deployments | 649
A/B Testing
A/B testing an application is an amazing way to test or validate a new feature in
production. The process is pretty simple: you deploy two (or more) different versions of
your application to production, measure some aspect, and see which version performs
better. Given that A/B testing is primarily a mechanism of gauging user experience,
better depends on what aspect/feature you're experimenting with. For example, you
could make a subtle change to a web page layout and measure how long it takes for
the user to navigate to some button or how long the user continues to interact with
specific items on the page.
It's a brilliant way to de-risk a new release or validate some new business or UI features
with a smaller audience before releasing to a wider group. User behavior can be
captured and experiments can be run to make informed decisions about what direction
a product should take.
The Experiment
Let's cast our minds back to the earlier chapters where we spoke about generating
options. There we spoke about the importance of experiments and our Value Slicing
board included an item for which we could do an A/B test. One experiment that came
up was to assess how users would vote for cats in the competition. Should they just
be able to upvote (with a 👍) or should they be able to downvote (👎) too? We can build
and deploy two versions of our application: one with the ability to both upvote and
downvote, and one with just the ability to upvote. Our experiment is simple: to track
how often people actually use the downvote button, so we can decide whether it's a
feature we need or whether we should focus on building different functionality.
Let's now look at how we could set up a simple experiment to deploy both variants of
the application and route traffic between each deployed instance to generate some data
to help inform our decision-making.
53 https://matomo.org/
Advanced Deployments | 651
So how do we install the Matomo platform? Here comes Helm to the rescue again. We
automated this installation as part of the PetBattle platform by just enabling it in our
Ubiquitous Journey project. It's deployed by default into our labs-ci-cd namespace
from this configuration in ubiquitous-journey/values-tooling.yaml:
# Matamo
- name: matomo
enabled: true
source: https://github.com/petbattle/pet-battle-analytics.git
source_path: charts/matomo
sync_policy: *sync_policy_true
destination: labs-ci-cd
source_ref: main
ignore_differences:
- group: apps
kind: Deployment
jsonPointers:
- /spec/replicas
- /spec/template/spec/containers/0/image
However, if you want to just install the tool without involving ArgoCD, you can just
clone the repository and install it manually. This chart has been forked from an existing
chart54 to tweak it for easier installation on OpenShift. Specifically, the security
contexts in the MariaDB and Redis dependencies have been disabled so that the
deployment will automatically use the target namespace default service account and
associated Security Context Constraint (SCC) in OpenShift. For COTS software where
repackaging or running as a random UID is not always possible, there are other more
permissive, less secure SCCs such as anyuid. Also, an OpenShift route has been added
to the chart to allow ingress traffic to the application:
$ oc login ...
$ git clone https://github.com/petbattle/pet-battle-analytics.git \
&& cd pet-battle-analytics
$ helm install pba charts/matomo
With the Matomo analytics deployed, we just need to configure the frontend to connect
to it. To do this just update the config map's matomoUrl in the chart/values.yaml in
the frontend to have the tracking code automatically track the site. This will provide
basic site tracking such as the time spent on a page or the number of pages visited.
54 https://gitlab.com/ideaplexus/helm/matomo
652 | Run It
For a more meaningful test, we might want to capture specific user behavior. The
application has been instrumented to report certain events back to the Matomo server,
such as mouse clicks. Whenever a user clicks the button to vote for a cat, it will capture
it and report it in Matomo for us. It's very simple to do this – we just add a one-liner to
the event we want to track:
this.matomoTracker.trackEvent('A/B Tests', 'Voting', voting)
Let's deploy our prod version of the pet-battle application and add the no-down-vote
app as one of the services we'll connect to. Our Helm chart is configured to accept the
name of the service and the weight we want to apply to the experiment feature via a_b_
deploy.svc_name and a_b_deploy.weight. It's defaulted to be a 50/50 round-robin
split. Let's deploy it with this setup:
# install prod version
$ helm install prod --set image_version=latest chart \
--set a_b_deploy.svc_name=no-down-vote-pet-battle --namespace petbattle
# list pods
$ oc get pods
Navigate to the pet-battle UI and you should see on refreshing that there is a 50/50
chance that you will get the upvote-only version. If you open up incognito mode or
a different browser and try to hit the frontend, you should get the alternative one. A
different browser session is required, as the OpenShift router will by default return you
to the same pod, so you'll always land on the same site version.
654 | Run It
Running oc get routes should show one route and more than one service connected
to it with a 50/50 split prod-pet-battle(50%),no-down-vote-pet-battle(50%). You
can view the weights set as 100 each by running oc get route prod-pet-battle -o
yaml:
# display the routes
$ oc get routes
The weights for the traffic routed to each application can be updated quite easily using
Helm:
# update route weights
$ helm upgrade prod --set image_version=latest chart \
--set a_b_deploy.svc_name=no-down-vote-pet-battle \
--set a_b_deploy.weight=10 --namespace petbattle
Advanced Deployments | 655
Figure 15.32: Matomo showing the number of clicks for UP_VOTE versus DOWN_VOTE
It might take a few minutes for Matomo to render the pie chart. Our simple experiment
shows that more people use the UP_VOTE feature than the DOWN_VOTE feature. By
connecting the A/B test to the data captured in Matomo, we can now make more
informed decisions about the next actions that need to be taken for our product.
This experiment proves how easy it is to set up an A/B test. We can use the OpenShift
platform to dynamically route users to multiple application versions concurrently
deployed while we collect data about what is working well and what is not. There is
some thinking that needs to be put into how we instrument the application to collect
specific data, but the open source tooling available to us makes this easy too!
656 | Run It
Blue/Green deployments
The Blue/Green deployment strategy is one of the fundamental deployment strategies
that every team deploying applications into production should know about. Using this
strategy minimizes the time it takes to perform a deployment cutover by ensuring
you have two versions of the application available during deployment. It is also
advantageous in that you can quickly roll back to the original version of the application
without having to roll back any changes.
The trade-off here is that you need to have enough resources to be able to run two
versions of the application stack you are deploying. If your application has persistent
state, for example, a database or non-shared disk, then the application architecture and
constraints must be able to accommodate the two concurrent versions. This is normally
not an issue for smaller microservices and is one of the benefits of choosing that style
of deployment.
Let's run through Blue/Green deployment using the PetBattle API as the example
application stack. In this case, we are going to deploy two full stacks, that is, both the
application and MongoDB. Let's deploy the blue version of our application:
# install the blue app stack
$ helm upgrade --install pet-battle-api-blue \
petbattle/pet-battle-api --version=1.0.15 \
--namespace petbattle --create-namespace
Now deploy the green application stack. Note that we have a different tagged image
version for this:
# install the green app stack
$ helm upgrade --install pet-battle-api-green \
petbattle/pet-battle-api --version=1.0.15 \
--set image_version=green \
--namespace petbattle
Advanced Deployments | 657
Next, we expose our production URL endpoint as a route that points to the blue service:
# create the production route
$ oc expose service pet-battle-api-blue --name=bluegreen \
--namespace petbattle
Finally, we can switch between the two using the oc patch command:
# switch service to green
$ oc patch route/bluegreen --namespace petbattle -p \
'{"spec":{"to":{"name":"pet-battle-api-green"}}}'
If you browse to the bluegreen route endpoint, you should be able to easily determine
the application stack:
Even though this is somewhat of a contrived example, you can see the power of
developers being allowed to manipulate the OpenShift routing tier in a self-service
manner. A similar approach could be used to deploy the NSFF feature as an example –
use the Helm chart parameters --set nsff.enabled=true to deploy an NSFF-enabled
version. You can also point both applications to the same database if you want to with
similar manipulation of the Helm chart values.
If you have more complex use cases where you need to worry about long-running
transactions in the original blue stack, that is, you need to drain them, or you have data
stores that need migrating alongside the green rollout, there are several other more
advanced ways of performing Blue/Green deployments. Check out the ArgoCD rollout
capability, which has a ton of advanced features,55 the Knative Blue/Green rollout
capability, or indeed Istio56 for more ideas.
Deployment previews
We should think of OpenShift as something of a playground that we can use to deploy
our applications for production all the way down to a developer preview. Gone are
the days when a development team needed to raise a ticket to provision a server and
manually configure it to show off their applications. Building applications in containers
allows us to make shippable applications that can be repeatedly deployed in many
environments. Our automation for PetBattle in Jenkins is configured to run on every
commit. For Jenkins, we're using the multi-branch plugin so anytime a developer
pushes a new feature to a branch, it will automatically scaffold out a new pipeline and
deploy the latest changes for that feature.
When this was discussed in the previous chapter, about sandbox builds, you may have
thought this was overkill and a bit of a waste. Why not just build on a pull request? It's a
valid question to ask and depending on the objective you're trying to achieve, building
on a pull request is probably sufficient. We have used the sandbox builds as another
way to introduce feedback loops.
55 https://argoproj.github.io/argo-rollouts
56 https://github.com/hub-kubernetes/istio-blue-green-deployment
Advanced Deployments | 659
Developers do not exist in isolation; they are surrounded by other members of the
team, including Product Owners and Designers. Our ability to dynamically spin up
a new deployment of a feature from our pipeline means we can connect the coding
efforts to the design team really easily. Developers can get very fast feedback by
sharing a link to the latest changes or the implementation of a new feature with the
design team. This feedback loop can quickly allow subtle changes and revisions to be
made before the engineer loses the context of the piece of work. Creating deployment
previews from every commit also allows a developer to very quickly share two versions
of what an app might look like with a Product Owner while they make their decision
about which to choose.
From our Jenkins pipeline, there is a branch called cool-new-cat. When this is built,
it will push a new version of the app to the dev environment. The change in the app
is subtle for illustrative purposes, but we can see the banner has been changed. With
this new version of the app in the dev environment, we can get some feedback prior to
merging it to master and generating a release candidate.
Figure 15.35: New feature deployed to the sandbox generating a deploy preview to collect feedback
Figure 15.35 shows the sandbox version of the being deployed along with it's associated
route, service and configmap.
660 | Run It
Conclusion
Congratulations! You've just finished the most technologically focused chapter of
this book so far. Please don't go off and think that you have to use each and every
technology and technique that has been mentioned—that's not the point. Investigate,
evaluate, and choose which of these technologies applies to your own use cases and
environment.
Several of the testing practices are part of our technical foundation. Unit testing,
non-functional testing, and measuring code coverage are all critical practices for
helping build quality into our applications and products from the start. We covered
many small but invaluable techniques, such as resource validation, code linting, and
formatting, that help make our code base less of a burden to maintain.
We covered a number of different approaches for deployments, including A/B, Canary,
Blue/Green, and Serverless. These core techniques allow us to deliver applications
more reliably into different environments. We even briefly covered artificial intelligence
for reducing unwanted images uploaded into our PetBattle product. By focusing our
efforts on what happens when things go wrong, we can more easily embrace and
prepare for failures—big and small.
16 Own It
"Annnddd we're live! PetBattle is finally in production, we can crack open the Champagne
and toast our success." But now what? How do we know that the site is doing what we
expect it to do and, more importantly, how will we know when it isn't performing as we
intended? Do we just sit around waiting for customers to complain that the site is down
or that errors are happening? Not exactly a good user experience model—or a good use
of our time.
In this chapter, we will discuss the tools and techniques that can be utilized to monitor
the site and notify us when things start to go wrong so we can react before the entire
site goes down. We will also discuss advanced techniques, such as Operators, that can
help you automate a lot of the day-to-day operations.
Observability
Observability1 is the process of instrumenting software components to assist with
extracting data. This data can then be used to determine how well a system is
functioning and subsequently be used to notify administrators in the event of issues.
1 https://en.wikipedia.org/wiki/Observability
662 | Own It
When it comes to observing the state of our PetBattle applications, there are a number
of aspects to consider:
• How do we know if an application instance is initialized and ready to process
traffic?
• How do we know if the application has failed without crashing, such as by
becoming deadlocked or blocked?
• How do we access the application logs?
• How do we access the application metrics?
• How do we know what version of the application is running?
Let's start by exploring some application health checks.
Probes
Hello, hello?... Is this thing on? In Kubernetes, the health of an application is determined
by a set of software probes that are periodically invoked by the kubelet. A probe is
basically an action invoked by the platform on each Pod that either returns a success
value or a failure value.
Probes can be configured to perform one of the following types of actions:
• Connect to a specific TCP port that the container is listening on. If the port is
open, the probe is considered successful.
• Invoke an HTTP endpoint, if the HTTP response code is 200 or greater but less
than 400.
• Shell into a container and execute a command—this may involve checking for a
specific file in the directory. This enables probes to be placed on applications
that don't natively provide health checks out of the box. If the command exits
with a status code of 0, then the probe is successful.
If a probe fails a configured number of times, the kubelet managing the Pod will take a
pre-determined action, for example, by removing the Pod from the service or restarting
the Pod.
Kubernetes currently supports three different kinds of probes:
1. Readiness: This decides whether the Pod is ready to process incoming requests.
If the application needs some time to start up, this probe ensures that no traffic is
sent to the Pod until this probe passes. Also, if the probe fails while it's running, the
platform stops sending any traffic to the Pod until the probe once again succeeds.
Readiness probes are key to ensuring a zero-downtime experience for the user
when scaling up or upgrading Pods.
Observability | 663
2. Liveness: This checks to see whether a Pod has a process deadlock or it's crashed
without exiting; if so, the platform will kill the Pod.
3. Startup: This is used to prevent the platform from killing a Pod that is initializing
but is slow in starting up. When the startup probe is configured, the readiness and
liveness probes are disabled until the startup probe passes. If the startup probe
never passes, the Pod is eventually killed and restarted.
Most of the time, you will probably only utilize the readiness and liveness probes, unless
you have a container that's very slow in starting up.
In the PetBattle Tournament Service component, the liveness and readiness probes are
configured as follows.
In DeploymentConfig (or the Deployment), the /health/live and /health/ready URLs
are automatically created by the Quarkus framework:
...
livenessProbe:
failureThreshold: 3
httpGet:
path: /health/live
port: 8080
scheme: HTTP
initialDelaySeconds: 0
periodSeconds: 30
successThreshold: 1
timeoutSeconds: 10
readinessProbe:
failureThreshold: 3
httpGet:
path: /health/ready
port: 8080
scheme: HTTP
initialDelaySeconds: 0
periodSeconds: 30
successThreshold: 1
timeoutSeconds: 10
Different probes can invoke the same action, but we consider this bad practice.
The semantics of a readiness probe are different from those of a liveness probe. It's
recommended that liveness and readiness probes invoke different endpoints or actions
on the container.
664 | Own It
For example, a readiness probe can invoke an action that verifies whether an
application can accept requests. If during the Pod's lifetime the readiness probe fails,
Kubernetes will stop sending requests to the Pod until the probe is successful again.
A liveness probe is one that verifies whether an application can process a request
successfully; for example, if an application were blocked or accepting a request but
waiting a long time for a database connection to become available, the probe would
fail and Kubernetes would restart the Pod. Think of liveness probes as the Kubernetes
equivalent of the IT Crowd2 way of working.
Domino Effect
One question that we get asked a lot is, should a health check reflect the state of the
application's downstream dependencies as well as the application itself? The absolute,
definitive answer is it depends. Most of the time, a health check should only focus on
the application, but there are always scenarios where this isn't the case.
If your health check functionality does a deep check of downstream systems, this
can be expensive and result in cascading failures, where a downstream system has an
issue and an upstream Pod is restarted due to this downstream issue. Some legacy
downstream systems may not have health checks, and a more appropriate approach
in this scenario is to add resilience and fault tolerance to your application and
architecture.
Fault Tolerance
A key aspect of this is to utilize a circuit breaker pattern when invoking dependencies.
Circuit breakers can short circuit the invocation of downstream systems when they
detect that previous calls have failed. This can give the downstream system time to
recover or restart without having to process incoming traffic.
The basic premise of a circuit breaker is that in the case of the failure of a downstream
system, the upstream system should just assume that the next request will fail and not
send it. It potentially also takes appropriate actions for recovery by, say, returning a
default value.
After a given period of time, known as the backoff period, the upstream system should
try sending a request to the downstream system, and if that succeeds, it reverts to
normal processing. The rationale behind the backoff period is to avoid the situation
where the upstream systems overwhelm the downstream systems with requests as
soon as it starts up.
2 https://www.quotes.net/mquote/901983
Observability | 665
Logging
Ahh, logging! No true developer4 has earned their stripes until they've spent countless
hours of their existence trawling through production logs trying to figure out exactly
what went wrong when a user clicked "Confirm". If you have managed to do this across
multiple log files, all hosted on separate systems via multiple terminal windows, then
you are truly righteous in the eyes of the IDE-bound masses.
The good news is that application logging on Kubernetes is a first-class citizen on
the platform—just configure your application to write its logs to STDOUT and the
platform will pick it up and you can view/trawl through them. OpenShift goes one level
deeper by shipping an aggregated logging stack with EFK (Elasticsearch, Fluentd, and
Kibana) out of the box. This allows developers to search and view logs across multiple
containers running on multiple nodes across the cluster. If you want to give this a try,
follow the documentation at https://docs.openshift.com/container-platform/4.7/
logging/cluster-logging-deploying.html.
3 https://quarkus.io/guides/smallrye-fault-tolerance
4 https://en.wikipedia.org/wiki/No_true_Scotsman
666 | Own It
Tracing
So, first things first: no, tracing is not application logging running at the trace log level.
When it comes to OpenShift, tracing is the functionality added to the Kubernetes
platform that enables developers to trace a request across a distributed set of
application components running in different containers on different nodes of a cluster.
Tracing is an exceptionally useful tool used to determine and visualize inter‑service/
component dependencies and performance/latency blackholes throughout a
distributed system.
Tracing is provided as part of the OpenShift service mesh component. The underlying
tracing functionality is provided by the Jaeger5 distributed tracing platform. To support
tracing, applications must include a client library that sends instrumented request
metadata to a Jaeger collector, which in turn processes it and stores the data. This data
can then be queried to help visualize the end-to-end request workflow. The Jaeger
client libraries are language-specific and utilize the vendor-neutral OpenTracing
specification.
If you're thinking, "Woah! Collecting metadata for every request would be very expensive
to store and process," you'd be right. Jaeger can do this, but for scale purposes it's better
to record and process a sample of requests, rather than each and every one of them.
Metrics
Probes are useful for telling when an application is ready to accept traffic, or whether
it is stuck. Tracing is great at providing a measure of latency throughout a distributed
system, while logging is a great tool to retrospectively understand exactly what
occurred and when it occurred.
However, to comprehend the deep state (no, not that deep state!) of a system and
potentially predict its future state after a period of time, you need to measure some of
the key quantitative characteristics of the system and visualize/compare them over a
period of time.
The good news is that metrics are relatively easy to obtain; you can get them from
infrastructure components and software components such as JVM, and you can also
add domain-specific/custom metrics to your application.
5 https://www.jaegertracing.io/
Observability | 667
Given the multitude of metrics available, the hard bit is figuring out which metrics are
valuable to your role and need to be retained; for example, for application operator
connection pool counts, JVM garbage collection pause times are invaluable. For a
Kubernetes platform operator, JVM garbage collection pause times are less critical, but
metrics from platform components, such as etcd-related metrics, are crucial.
The good news is that OpenShift provides metrics for both the cluster and the
applications running on it. In this section, we're going to focus on the application-level
perspective. In the Kubernetes community, the de facto approach is to use Prometheus6
for gathering and Grafana7 for the visualization of metrics. This doesn't mean that you
can't use other metrics solutions, and there are some very good ones out there with
additional features.
OpenShift ships with both Prometheus and Grafana as the default metrics stack.
Additionally, it also ships with the Prometheus Alertmanager. The Alertmanager
facilitates the sending of notifications to operators when metric values indicate that
something is going or has gone wrong and la merde has or is about to hit the fan.
Examples of this include a high number of threads or large JVM garbage collection
pause times.
Great, so how do we enable this for PetBattle? It is relatively straightforward:
1. Use a metrics framework in your application that records metrics and exposes the
metrics to Prometheus.
2. Configure Prometheus to retrieve the metrics from the application.
3. Visualize the metrics in OpenShift.
Once the metrics are being retrieved, the final step is to configure an alert using the
Prometheus Alertmanager.
6 https://prometheus.io/
7 https://grafana.com/oss/
8 https://micrometer.io/
668 | Own It
To add this to the Tournament service, we simply need to add the dependency to the
Maven POM along with the Prometheus dependency. For example:
<dependency>
<groupId>io.quarkus</groupId>
<artifactId>quarkus-micrometer</artifactId>
</dependency>
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
Then, we configure a Prometheus registry, which is used to store the metrics locally in
the application before being retrieved by the Prometheus collector. This is done in the
src/main/resources/application.properties file.
# Metrics
quarkus.micrometer.enabled=true
quarkus.micrometer.registry-enabled-default=true
quarkus.micrometer.binder-enabled-default=true
quarkus.micrometer.binder.jvm=true
quarkus.micrometer.binder.system=true
quarkus.micrometer.export.prometheus.path=/metrics
With this configuration, the Prometheus endpoint is exposed by the application Pod.
Let's go ahead and test it:
# grab the pod name for the running tournament service
$ oc get pod -n petbattle | grep tournament
$ oc exec YOUR_TOURNAMENT_PODNAME -- curl localhost:8080/metrics
...
# HELP mongodb_driver_pool_size the current size of the connection pool,
including idle and and in-use members
# TYPE mongodb_driver_pool_size gauge
mongodb_driver_pool_size{cluster_id="5fce8815a685d63c216022d5",server_
address="my-mongodb:27017",} 0.0
# TYPE http_server_requests_seconds summary
http_server_requests_seconds_
count{method="GET",outcome="SUCCESS",status="200",uri="/openapi",} 1.0
http_server_requests_seconds_
sum{method="GET",outcome="SUCCESS",status="200",uri="/openapi",} 0.176731581
http_server_requests_seconds_count{method="GET",outcome="CLIENT_
ERROR",status="404",uri="NOT_FOUND",} 3.0
http_server_requests_seconds_sum{method="GET",outcome="CLIENT_
ERROR",status="404",uri="NOT_FOUND",} 0.089066563
Observability | 669
http_server_requests_seconds_
count{method="GET",outcome="SUCCESS",status="200",uri="/metrics",} 100.0
# HELP http_server_requests_seconds_max
# TYPE http_server_requests_seconds_max gauge
http_server_requests_seconds_
max{method="GET",outcome="SUCCESS",status="200",uri="/openapi",} 0.176731581
http_server_requests_seconds_max{method="GET",outcome="CLIENT_
ERROR",status="404",uri="NOT_FOUND",} 0.0
...
If successful, you should get an output similar to the above. Notice that you're not just
getting the application-level metrics—the MongoDB connection pool metrics are also
there. These are automatically added by the Quarkus framework once configured in the
application.properties file.
9 https://github.com/prometheus-operator/prometheus-operator
670 | Own It
There are a few things to note that might save you some time when trying to
understand this configuration: basically, this configuration will scrape associated
components every 30 seconds using the default HTTP path /metrics. Now, port:
tcp-8080 is mapped to the port name in the service—see below, highlighted in bold. If
the service had a port name of web, then the configuration would be port: web.
$ oc describe svc my-pet-battle-tournament
Name: my-pet-battle-tournament
Namespace: pet-battle-tournament
Labels: app.kubernetes.io/component=pet-battle-tournament
app.kubernetes.io/instance=my
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=pet-battle-tournament
app.kubernetes.io/version=1.0.0
deploymentconfig=my-pet-battle-tournament
helm.sh/chart=pet-battle-tournament-1.0.0
Annotations: Selector: app.kubernetes.io/component=pet-battle-
tournament,app.kubernetes.io/instance=my,app.kubernetes.io/name=pet-battle-
tournament,deploymentconfig=my-pet-battle-tournament
Type: ClusterIP
IP: 172.30.228.67
Port: tcp-8080 8080/TCP
TargetPort: 8080/TCP
Endpoints: 10.131.0.28:8080
Port: tcp-8443 8443/TCP
TargetPort: 8443/TCP
Endpoints: 10.131.0.28:8443
Session Affinity: None
Events: <none>
10 https://docs.openshift.com/container-platform/4.7/monitoring/enabling-moni-
toring-for-user-defined-projects.html
Observability | 671
scraping the metrics is consistent across all of the selected components; that they are
all using the tcp-8080 port, for example.
selector:
matchLabels:
app.kubernetes.io/component: pet-battle-tournament
We're big fans of labels and their associated selectors. They're very powerful as a
method of grouping components: Pods, Services, and so on. It's one of those hidden
gems that you wish you knew of earlier.
Let's explore some of the built-in dashboards OpenShift provides for monitoring.
672 | Own It
The good news is that all of this is done via Helm charts, so you just have to do the
following:
$ oc get routes
...
grafana-route grafana-route-pb-noc.apps.someinstance.com
Open grafana-route in a browser, log in, et voila! It should look something like that
shown in Figure 16.2. If there is an error with no data, check the BEARER_TOKEN is in
place. This can be fixed manually by running the commands at https://github.com/
petbattle/pet-battle-infra/blob/main/templates/insert-bearer-token-hook.yaml#L80
We will now take a look at some of the tools that can help us further with observability.
Metadata and Traceability | 673
Labels
As mentioned earlier, labels and label selectors are among the more powerful metadata
management features of Kubernetes. At its core, labels are a collection of text-
based key/value pairs that can be attached to one or more objects: Pods, services,
Deployments, and so on. Labels are intended to add information/semantics to objects
that are relevant to the user and not the core Kubernetes system. A label selector is a
method by which a user can group items together that have the same labels.
One of the most common uses of labels and label selectors in Kubernetes is the way
that services use label selectors to group related Pods as endpoints for the service.
It's probably better shown by way of an example.
So, let's start with our three Infinispan Pods. Given that the Infinispan operator deploys
its Pods via StatefulSets, the Pod names are pretty straightforward: infinispan-0,
infinispan-1, infinispan-2. Take note of the labels attached to the Pods (highlighted
in bold).
$ oc get pods --show-labels=true
When the Tournament service wants to connect to one of these Infinispan pods, it uses
the Infinispan service that is also created and managed by the operator.
$ oc get svc
If we go into the definition of the service, we'll see the selector (highlighted in bold):
$ oc describe service infinispan
Name: infinispan
Namespace: pet-battle-tournament
Labels: app=infinispan-service
clusterName=infinispan
infinispan_cr=infinispan
Annotations: service.alpha.openshift.io/serving-cert-signed-by:
openshift-service-serving-signer@1607294893
service.beta.openshift.io/serving-cert-secret-name:
infinispan-cert-secret
service.beta.openshift.io/serving-cert-signed-by:
openshift-service-serving-signer@1607294893
Selector: app=infinispan-pod,clusterName=infinispan
Type: ClusterIP
IP: 172.30.154.122
Port: infinispan 11222/TCP
TargetPort: 11222/TCP
Endpoints: 10.128.2.158:11222,10.129.3.145:11222,10.131.0.25:11222
Session Affinity: None
Events: <none>
11 https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/
Metadata and Traceability | 675
Great, so now what? What information could you use to label a resource? It depends on
what your needs are. As demonstrated previously in the monitoring section, labels and
selectors can be useful in configuring Prometheus. Labels can also be useful in grouping
components together, as in the components that comprise a distributed application.
Kubernetes has a set of recommended labels12 that we've used when building and
deploying the PetBattle application:
pet-battle-
app.kubernetes.io/name The name of the application
tournament-service
A unique name identifying the
app.kubernetes.io/instance petbattle
instance of an application
12 https://kubernetes.io/docs/concepts/overview/working-with-objects/com-
mon-labels/
676 | Own It
With these labels in place, it is possible to retrieve and view the components of the
application using selectors, such as to show the component parts of the PetBattle
application without the supporting application infrastructure, that is, Infinispan or
Keycloak. The following command demonstrates this:
$ oc get all -l app.kubernetes.io/part-of=petbattleworld \
--server-print=false
NAME AGE
replicationcontroller/dabook-mongodb-1 2d18h
replicationcontroller/dabook-pet-battle-tournament-1 26m
NAME AGE
service/dabook-mongodb 2d18h
service/dabook-pet-battle-tournament 2d18h
NAME AGE
deploymentconfig.apps.openshift.io/dabook-mongodb 2d18h
deploymentconfig.apps.openshift.io/dabook-pet-battle-tournament 26m
NAME AGE
imagestream.image.openshift.io/dabook-pet-battle-tournament 26m
NAME AGE
route.route.openshift.io/dabook-pet-battle-tournament 2d18h
Let's look at other mechanisms we can use to enhance traceability.
Software Traceability
One of the issues that we've observed from customers over the years is the reliance
that people have on the name of the software artifact that they're putting into
production, such as super-important-app-1.2.99.0.bin or critical-service-
1.2.jar. While this works 99.9% of the time, occasionally we've noticed issues where
an incorrect version has been deployed with interesting outcomes.
In the land of containers, your deployment is a versioned artifact that contains a
version of your software, and this in turn may be deployed using a versioned Helm
chart via a GitOps approach. A good build and deployment pipeline will ensure that
these levels of artifact versioning will always be consistent and provide traceability. As
a backup, we also add additional traceability to the deployed artifacts as annotations on
the resources and build info logging in the application binary.
Metadata and Traceability | 677
Annotations
Annotations are similar to Kubernetes labels—that is, string-based key/value pairs—
except that they're not used to group or identify objects via selectors. Annotations
can be used to store different types of information; in our case, we're going to use
annotations to store Git information to help with software traceability.
apiVersion: v1
kind: Service
metadata:
annotations:
app.openshift.io/vcs-url:
https://github.com/petbattle/tournamentservice.git
app.quarkus.io/commit-id:a01a310aadd46911bc4c66b3a063ddb090a3feba
app.quarkus.io/vcs-url:
https://github.com/petbattle/tournamentservice.git
app.quarkus.io/build-timestamp: 2020-12-23 - 16:43:07 +0000
prometheus.io/scrape: "true"
prometheus.io/path: /metrics
prometheus.io/port: "8080"
The annotations are automatically added as part of the Maven build process using
the Quarkus Maven plugin. Also notice the annotations are used to provide scrape
information for Prometheus, as can be seen highlighted in the preceding code.
Build Information
An approach that has nothing to do with Kubernetes per se, but we strongly
recommend to be used in general, is to output source control and build information as
part of the application startup. An example of this is embedded into the Tournament
service.
$ java -jar tournament-1.0.0-SNAPSHOT-runner.jar
We use the Maven plugin git-commit-id-plugin to generate a file containing the Git
information and package that file as part of the Java archive (jar). On startup, we simply
read this file and output its contents to the console. Very simple stuff, but very effective
and a lifesaver when needed. When running on OpenShift, this information will be
picked up by the OpenShift logging components.
Alerting
So we have all the metrics to provide us with some insight into how the system is
performing. We've got spectacular graphs and gauges in Grafana but we're hardly going
to sit watching them all day to see if something happens. It's time to add alerting to the
solution.
What Is an Alert?
An alert is an event that is generated when some measurement threshold (observed or
calculated) is about to be or has been breached. The following are some examples of
alerts:
• The average system response time in the last five minutes goes above 100
milliseconds.
• The number of currently active users on the site falls below a certain threshold.
• Application memory usage is approaching its maximum limit.
Alerts usually result in notifications being sent to human operators, whether that is
through an email or instant message, say. Notifications can also be sent to trigger
automation scripts/processes to deal with the alert. Service owners can analyze their
existing alerts to help improve the reliability of their services and systems and reduce
the manual work associated with remediating problems.
Why Alert?
Alerts call for human action when a situation has arisen within the system that cannot
be automatically handled. This may include scenarios where automatic resolution of
the problem is deemed too risky and human intervention is required to help triage,
mitigate, and resolve the issue. Alerting can also be an issue by causing concern for
site reliability engineers who manage and operate the system, particularly when alerts
are numerous, misleading, or don't really help in problem cause analysis. They may
generate benign alerts that don't prompt any action.
Alerting | 679
There are certain qualities that make up a good alert. Alerts should be actionable by the
human beings who respond to them. To be actionable, the alert must also have arrived
in time for something to be done about it and it should be delivered to the correct team
or location for triaging. Alerts can also include helpful metadata such as documentation
links to assist in making triage faster.
Alert Types
We can think of alerts as falling into three broad categories.13 The first are proactive
alerts, meaning that your business service or system is not in danger yet but may be
in trouble after some period of time. A good example of this is where your system
response time is degrading but it is not at a stage where external users would be aware
of the issue yet. Another example may be where your disk quota is filling up but is not
100% full yet, but it may do in a few days' time.
A reactive alert means your business service or system is in immediate danger. You are
about to breach a service level and immediate action is needed to prevent the breach.
An investigative alert is one where your business service or system is in an unknown
state. For example, it may be suffering a form of partial failure or there may be unusual
errors being generated. Another example may be where an application is restarting too
many times, which is indicative of an unusual crash situation.
Each of these alerts may also be directed to different teams, depending on their
severity. Not all alerts need to be managed with the same level of urgency. For example,
some alerts must be handled by an on-call human resource immediately, while for
others it may be fine to handle them during business hours by an application business
support team the following day. Let's explore how we can easily configure and add
alerting to our applications using the OpenShift platform features to help us out.
13 https://www.oreilly.com/content/reduce-toil-through-better-alerting/
680 | Own It
Managing Alerts
OpenShift has platform monitoring and alerting that supports both built-in platform
components and user workloads. The product documentation is the best place to
start when looking to configure these.14 As we outlined earlier, monitoring and alerting
make use of the Prometheus monitoring stack. This is combined with an open-source
tool called Thanos15 that aggregates and provides access to multiple instances of
Prometheus in our cluster.
A basic configuration for the PetBattle application suite consists of creating two
ConfigMaps for user workload monitoring and alerting. We use ArgoCD and a simple
kustomize YAML configuration to apply these ConfigMaps using GitOps. If we open
up the ubiquitous journey values-day2ops.yaml file, we can create an entry for user
workload monitoring.
# User Workload Monitoring
- name: user-workload-monitoring
enabled: true
destination: openshift-monitoring
source: https://github.com/rht-labs/refactored-adventure.git
source_path: user-workload-monitoring/base
source_ref: master
sync_policy: *sync_policy_true
no_helm: true
The next step is to make use of application metrics and a ServiceMonitor and configure
specific Prometheus alerts for our PetBattle suite.
User-Defined Alerts
In the Metrics section, we created ServiceMonitors for our API and Tournament
applications that allow us to collect the micrometer metrics from our Quarkus
applications. We want to use these metrics to configure our alerts. The simplest
approach is to browse to the Thanos query endpoint that aggregates all of our
Prometheus metrics. You can find this in the openshift-monitoring project.
$ oc get route thanos-querier -n openshift-monitoring
14 https://docs.openshift.com/container-platform/4.7/monitoring/configur-
ing-the-monitoring-stack.html#configuring-the-monitoring-stack
15 https://github.com/thanos-io/thanos
Alerting | 681
We want to create a simple reactive alert based on whether the PetBattle API,
Tournament, and UI Pods are running in a certain project. We can make use of
Kubernetes Pod labels and the Prometheus query language to test whether our Pods
are running.
16 https://github.com/petbattle/ubiquitous-journey/blob/main/applications/alert-
ing/chart/templates/application-alerts.yaml
682 | Own It
spec:
groups:
- name: petbattle.rules
rules:
- alert: PetBattleApiNotAvailable
annotations:
message: 'Pet Battle Api in namespace {{ .Release.Namespace }} is
not available for the last 1 minutes.'
expr: (1 - absent(kube_pod_status_ready{condition="true" ... for: 1m
labels:
severity: {{ .Values.petbattle.rules.severity }}
The firing alerts can be seen in the OpenShift web console as seen in Figure 16.4. In this
example, we have configured the labs-dev alerts to only have a severity of info because
they are not deemed as crucial deployments in that environment. The severity may
be set as info, warning, or critical, and we use warning for our labs-test and labs-
staging environments, for example. These are arbitrary but standard severity levels,
and we can use them for routing alerts, which we will cover in a moment.
We can use the same method to create an investigative or proactive alert. This time
we wish to measure the HTTP request time for our API application. During testing, we
found that if API calls took longer than ~1.5 sec, the user experience in the PetBattle
frontend was deemed too slow by end users and there was a chance they would
disengage from using the web application altogether.
In this alert, we use the Prometheus query language and the http_server_requests_
seconds_max metric for the PetBattle API application to test whether the maximum
request time over the last five-minute period exceeded our 1.5 sec threshold. If this
alert starts to fire, possible remediation actions might include manually scaling up the
number of API Pods or perhaps increasing the database resources if that is seen to be
slow for some reason. In future iterations, we may even try to automate the application
scale-up by using a Horizontal Pod Autoscaler, a Kubernetes construct that can scale
our applications automatically based on metrics.
In this way, we can continue to build on our set of alerting rules for our PetBattle
application suite, modifying them as we run the applications in different environments,
and learn what conditions to look out for while automating as much of the remediation
as we can.
684 | Own It
OpenShift Alertmanager
As we have seen, OpenShift supports three severity levels of alerting: info, warning,
and critical. We can group and route alerts based on their severity as well as on custom
labels—that is, project or application labels. In the OpenShift administrator console,17
you can configure the Alertmanager under Cluster Settings.
Alerts may be grouped and filtered using labels and then routed to specific receivers,
such as PagerDuty, Webhook, Email, or Slack. We can fine-tune the routing rules so
that the correct teams receive the alerts in the correct channel, based on their urgency.
For example, all info and warning severity alerts for the PetBattle UI application may be
routed to the frontend developers Slack channel, whereas all critical alerts are routed to
the on-call PagerDuty endpoint as well as the Slack channel.
17 https://docs.openshift.com/container-platform/4.7/monitoring/managing-alerts.
html
Service Mesh | 685
Service Mesh
Service mesh functionality has been one of the largest additions/extensions to
Kubernetes in its short history. There's a lot of debate around the additional complexity
of using a service mesh and whether all the features are even required.
For the purposes of this book, we're going to focus on the service mesh provided out
of the box within OpenShift, which is based on the open-source Istio project. There
are other implementations, such as Linkerd, SuperGloo, and Traefik, out there that are
excellent and offer similar functionality to Istio.
The OpenShift service mesh provides the following features out of the box:
• Security: Authentication and authorization, mutual TLS (encryption), policies
• Traffic management: Resiliency features, virtual services, policies, fault injection
• Observability: Service metrics, call tracing, access logs
18 https://github.com/App-vNext/Polly
19 https://pypi.org/project/pybreaker/
20 https://nodeshift.dev/opossum/
686 | Own It
The same is true with Mutual TLS (mTLS), which is used for encrypting traffic between
services. Operators such as CertManager or CertUtil can assist with managing and
distributing certificates, but modification of the application code is still required to use
the feature. Service meshes simplify this as the inter-component traffic is sent via a
sidecar proxy and functionality such as mTLS is automagically added to this—once again,
without having to change the application code.
The Istio component of a service mesh also manages TLS certificate generation and
distribution so that it helps reduce the management overhead when using mTLS.
So how does a service mesh perform all of this magical functionality? Basically, the
service mesh operator adds a service proxy container (based on the Envoy project) to
the application Pod and configures the application traffic to be routed through this
proxy. The proxy registers with the Istio control plane and configuration settings,
certificates, and routing rules are retrieved and the proxy configured. The Istio
documentation goes into much more detail.21
21 https://istio.io/latest/docs/
Service Mesh | 687
--namespace petbattle
petbattle/pet-battle-tournament
$ oc get deployment pet-battle-tournament -o yaml \
--namespace petbattle
...
template:
metadata:
annotations:
...
sidecar.istio.io/inject: "true"
labels:
app.kubernetes.io/component: pet-battle-tournament
app.kubernetes.io/instance: pet-battle-tournament
app.kubernetes.io/name: pet-battle-tournament
It is possible to have more than one sidecar container if required. Each container can
bring different features to the application Pod: for example, one for Istio, another for
log forwarding, another for the retrieval of security credentials, and so on. It's easy
to know when a Pod is running more than a single container; for example, the READY
column indicates how many containers are available per Pod and how many are ready—
that is, its readiness probe has passed.
$ oc get pods
Here Be Dragons!
The adoption of a service mesh isn't a trivial exercise when it comes to complex
solutions with multiple components and development teams. One thing to understand
about a service mesh is that it crosses a lot of team boundaries and responsibilities. It
includes features that are focused on the developer, operations, and security teams; all
of these teams/personnel need to work together to understand and get the best out of
using the features provided by the mesh. If you're just starting out, our advice is to start
small and figure out what features are necessary in production and iterate from there.
688 | Own It
In the case of PetBattle, we decided that we were going to primarily focus on using
some of the features in the areas of traffic management and observability. The rationale
behind this was that Keycloak already addressed many of the security requirements,
and we also wanted to finish the book before the end of the decade.
The good news is that all of these components are installed and managed by Operators,
so installation is reasonably straightforward. These components are installed via Helm,
and if you want to know more about how they are installed, then the Red Hat OpenShift
documentation will have the relevant details.
One key thing to note is that at the time of writing this book, OpenShift Service Mesh
ships with a downstream version of Istio called Maistra. This is primarily due to the
out-of-the-box multi-tenancy nature of OpenShift, as well as limiting the scope of Istio
cluster-scoped resources. OpenShift Service Mesh also ships with an Istio OpenShift
Routing (IOR) component that maps the Istio gateway definitions onto OpenShift
routes. Note that Istio is still the upstream project and bugs/feature requests are fixed/
implemented, as necessary.
For traffic management, Istio has the following core set of resources:
• Gateways: Controls how traffic gets into the mesh from the outside, akin to
OpenShift routes.
• Virtual service: Controls how traffic is routed within the service mesh to a
destination service. This is where functionality such as timeouts, context-based
routing, retries, mirroring, and so on, are configured.
• Destination rule: Service location where traffic is routed to once traffic rules
have been applied. Destination rules can be configured to control traffic aspects
such as load balancing strategies, connection pools, TLS setting, and outlier
detection (circuit breakers).
There are other resources such as service entry, filters, and workloads, but we're not
going to cover them here.
Service Mesh | 689
Gateways
The gateway resource, as stated earlier, is used to create an ingress route for traffic
coming into the service mesh.
apiVersion: networking.istio.io/v1alpha3
kind: Gateway
metadata:
name: petbattle-gateway-tls
spec:
selector:
istio: ingressgateway
servers:
- port:
number: 443
name: https
protocol: HTTPS
tls:
mode: SIMPLE
credentialName: "pb-ingressgateway-certs"
hosts:
- "*"
A few things to note with this definition is that it will create an OpenShift route in the
istio-system namespace and not the local namespace. Secondly, the route itself will use
SSL, but it won't be able to utilize the OpenShift router certificates by default. Service
mesh routes have to provide their own certificates. As part of writing this book, we took
the pragmatic approach and copied the OpenShift router certificates into the istio-
system namespace and provided them to the gateway via the pb-ingressgateway-certs
secret. Note that this is for demonstration purposes only—do not try this in production.
The correct approach for production is to generate and manage the PKI using as-a-
service certificates.
690 | Own It
Virtual services
PetBattle contains a number of VirtualServices, such as pet-battle-cats-tls, pet-battle-
main-tls, and pet-battle-tournament-tls.
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: pet-battle-cats-tls
spec:
hosts:
- "*"
gateways:
- petbattle-gateway-tls
http:
- match:
- uri:
prefix: /cats
ignoreUriCase: true
route:
- destination:
host: pet-battle-api
port:
number: 8080
retries:
attempts: 3
perTryTimeout: 2s
retryOn: gateway-error,connect-failure,refused-stream
The VirtualServices are all similar in function in that they are all configured to:
1. Match a specific URI; in the example above, /cats.
2. Once matched, route the traffic to a specific destination.
3. Handle specific errors by performing a fixed number of request retries.
Destination Rule
Finally, the traffic is sent to a destination or even distributed to a set of destinations
depending on the configuration. This is where DestinationRules come into play.
Service Mesh | 691
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: pet-battle-api-port
spec:
host: pet-battle-api.prod.svc.cluster.local
trafficPolicy: # Apply to all ports
portLevelSettings:
- port:
number: 8080
loadBalancer:
simple: LEAST_CONN
In our example, the traffic sent to a specific port is load balanced based on a simple
strategy that selects the Pod with the least number of active requests. There are
many load balancing strategies that can be used here, depending on the needs of the
application—everything from simple round robin to advanced consistent hashing load
balancing strategies, which can be used for session affinity. As ever, the documentation
goes into far greater detail.22
We can visualize the flow of traffic from the above example, as seen in Figure 16.7:
22 https://istio.io/latest/docs/
692 | Own It
Note that Figure 16.7 shows an example of how destination rules can be used to
send traffic to an alternative version of the service. This can be useful for advanced
deployment strategies such as Canary, Blue/Green, and so on. We haven't discussed
how to do this with OpenShift Service Mesh in this book, but the reader is encouraged
to explore this area in more detail. A good place to start is the aforementioned Istio
documentation.
Managing all of these resources is reasonably simple when it's just a few services,
and PetBattle utilizes service mesh functionality in a very basic manner. However,
when there are many services and features, such as multiple destinations used in
advanced deployment models, the amount of settings and YAML to interpret can be
overwhelming. This is where mesh visualization functionality can be useful to visualize
how all of this works together. For this, we use the Kiali functionality, which is part of
OpenShift Service Mesh. Figure 16.8 shows how PetBattle is visualized using Kiali.
Kiali can be very useful for diagnosing the current state of the mesh, as it can
dynamically show where traffic is being sent as well as the state of any circuit breakers
being used. It also integrates with Jaeger for tracing requests across multiple systems.
Kiali can also help prevent configuration issues by semantically validating the deployed
service mesh resources.
Next we're going to explore one of the most powerful features of OpenShift 4 -
Operators.
Operators Everywhere | 693
Operators Everywhere
Fundamental to the OpenShift 4 platform is the concept of Operators. So far, we have
used them without talking about why we need them and what they actually represent
on a Kubernetes platform such as OpenShift. Let's cover this briefly without totally
rewriting the book on the subject.23
At its heart, the Operator is a software pattern that codifies knowledge about the
running and operation of a particular software application. That application could be
a distributed key value store, such as etcd. It might be a web application such as the
OpenShift web console. Fundamentally, the operator can represent any application
domain that could be codified. A good analogy for an operator is the expert system, a
rules-based bit of software that represents knowledge about a certain thing that is put
to work in a meaningful way. If we take a database as an example, the Operator might
codify what a real human database administrator does on a day-to-day basis, such as
the deployment, running, scaling, backup, patching, and upgrading of that database.
The physical runtime for an operator is nothing more than a Kubernetes Pod, that
is, a collection of containers that run on a Kubernetes platform such as OpenShift.
Operators work by extending or adding new APIs to the existing Kubernetes and
OpenShift platform APIs. This new endpoint is called a Custom Resource (CR). CRs are
one of the many extension mechanisms in Kubernetes.
23 https://www.redhat.com/en/resources/oreilly-kubernetes-operators-automa-
tion-ebook
694 | Own It
A Custom Resource Definition (CRD) defines what the CR is. Think of it as the
definition or schema for the CR. The Operator Pod watches for events on the platform
that are related to their custom resources and takes reconciliation actions to achieve
the desired state of the system. When an Operator Pod stops or is deleted from the
cluster, the application(s) that it manages should continue to function. Removing a CRD
from your cluster does affect the application(s) that it manages. In fact, deleting a CRD
will in turn delete its CR instances. This is the Operator pattern.
With Operators, all of the operational experience required to run/manage a piece
of software can be packaged up and delivered as a set of containers and associated
resources. In fact, the whole of the OpenShift 4 platform exists as a collection
of operators! So, as the platform owner, you are receiving the most advanced
administrator knowledge bundled up through Operators. Even better, Operators can
become more advanced over time as new features and capabilities are added to them. A
good understanding of how to configure Operators is required for OpenShift platform
administrators. This usually involves setting properties in the OpenShift cluster global
configuration web console, setting CR property values, using ConfigMaps, or similar
approaches. The product documentation24 is usually the best place to find out what
these settings are for each Operator.
24 https://docs.openshift.com/container-platform/4.7
Operators Everywhere | 695
Control Loops
In very basic terms, core Kubernetes is just a Key-Value (KV) store—an etcd datastore
with an API. Processes use this API to perform Create, Read, Update, and Delete
(CRUD) actions on keys within the KV store. Processes can also register with the KV
store to be notified when there are value changes to keys or sets of keys that they're
interested in.
25 https://operatorhub.io
26 https://github.com/operator-framework/operator-sdk
696 | Own It
When these processes get a change notification, they react to that notification by
performing some activity, such as configuring iptables rules, provisioning storage, and
so on. These processes understand the current state of the system and the desired
state and work toward achieving that desired state. In other words, these processes are
performing the role of a control loop, meaning they attempt to bring the state of the
system to a desired state from where it currently resides.
In this example, the process is a controller that observes the state of a resource or
set of resources and then makes changes to move the resource state closer to the
desired state. As consumers of Kubernetes, we constantly use controllers. For example,
when we instruct Kubernetes to deploy a Pod, the Pod controller works to make that
a reality. Control loops are key to the operation on Kubernetes and it's a declarative
and, eventually, consistent approach. For much more information, take a look at the
Kubernetes Controller docs27 and the OpenShift blog site for recommendations on how
to build your own Operator.28
Operator Scopes
Operators can either be cluster-scoped or namespace-scoped. A cluster-scoped
operator is installed once in a namespace and can create and manage resources in
other namespaces; that is, cluster-wide. The OpenShift service mesh operator and its
related operators such as Kiali and Jaeger are cluster-scoped. They are installed by
default into the openshift-operators or openshift-operators-redhat namespace
and create and manage resources when a related CRD is deployed in another
namespace, such as PetBattle.
A namespace-scoped operator is one that is deployed in a namespace and only manages
resources in that namespace. We use a number of these in PetBattle, such as Cert-Utils
and Keycloak.
All Operators are installed via a CRD called a Subscription. Without going into too
much detail (see the official documentation for more), a Subscription describes how
to retrieve and install an instance of an operator. The following is an example of a
Subscription that we use to install the Grafana operator.
27 https://kubernetes.io/docs/concepts/architecture/controller/
28 https://www.openshift.com/blog/kubernetes-operators-best-practices
Operators Everywhere | 697
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: grafana-operator
spec:
channel: alpha
installPlanApproval: Automatic
name: grafana-operator
source: community-operators
sourceNamespace: openshift-marketplace
startingCSV: grafana-operator.v3.7.0
To see some of the namespace-scoped operators that PetBattle needs, run the
following command.
$ oc get subscriptions
Let us now take a look at how operators can be used by our PetBattle team.
Operators in PetBattle
We use operators to create and manage resources such as Infinispan cache and
Keycloak SSO instances. We simply install the Infinispan operator and deploy a relevant
custom resource to tell it to create and manage a replicated cache. We don't have to
know about spinning up Infinispan Pods or creating SSL certificates or provisioning
storage space. The operator will do all of this for us, and if something fails or is
accidentally deleted, the operator will look after the recreation of the resource. In the
Infinispan example, if we delete the Infinispan K8s service, the operator will be notified
about its deletion and recreate the service automatically. As developers, we don't have
to worry about managing it.
698 | Own It
It is simpler to think of Operators as looking after stuff so you don't have to. It is also
possible to use multiple operators in combination to automate complex workflows. For
example, we use Keycloak for its SSO gateway and user management functionality. The
Keycloak instance is deployed and managed via a Keycloak Operator. We just need to
build and send a custom resource to the API and the operator will do the rest. One of
the resources managed by the Operator is a Kubernetes Secret that contains the TLS
certificates and keys, which clients interacting with the Keycloak instance will need
to use. Given that Keycloak is the security gateway to our application, it is prudent
to ensure that all communications are encrypted. However, this causes issues for
Java-based applications; to use SSL, the JVM requires that it be provided with a Java
TrustStore containing the SSL/TLS certificates and keys so that the JVM can trust them.
So, how do we take the Secret with the TLS certificates and keys and convert that into a
TrustStore that the Java applications can use? We could do a whole heap of scripting with
Bash, the Java Keytool, and potentially other tools to extract the certs/keys, creating the
TrustStore, converting, and finally injecting the certs/keys into said TrustStore. This is
manual, complex, and error-prone work. We will also have to recreate these TrustStores
for each environment and handle lifecycle events such as certificate expiry.
Alternatively, we could use an operator, in this case, the Cert-Utils operator. We first
install the Cert-Utils operator in the PetBattle namespace. This Operator was developed
by the Red Hat Consulting PAAS Community of Practice29 to help manage certificates
and JVM Keystores, along with TrustStores.
To use this Operator, we first create a ConfigMap containing a set of specific
annotations. The Cert-Utils operator will detect these annotations and create a
TrustStore containing the relevant certificates and keys; it will also add the TrustStore
to the ConfigMap. Finally, we can mount the ConfigMap into a Deployment and instruct
the JVM to use that TrustStore. The following resource definition will create the
TrustStore with the relevant certificates and keys.
apiVersion: v1
kind: ConfigMap
metadata:
annotations:
service.beta.openshift.io/inject-cabundle : "true"
cert-utils-operator.redhat-cop.io/generate-java-truststore: "true"
cert-utils-operator.redhat-cop.io/source-ca-key: "service-ca.crt"
cert-utils-operator.redhat-cop.io/java-keystore-password: "jkspassword"
name: java-truststore
This does the following:
29 https://github.com/redhat-cop/cert-utils-operator
Operators Everywhere | 699
In the Tournament service, the following Quarkus configuration will mount the java-
truststore ConfigMap and configure the JVM accordingly.
# Mount the configmap into the application pod in the /tmp/config/ directory
quarkus.kubernetes-config.enabled=true
quarkus.openshift.config-map-volumes.javatruststore.config-map-name=java-
truststore
quarkus.openshift.mounts.javatruststore.path=/tmp/config/
We've only just scratched the surface of operators. OpenShift ships with a number
of supported operators and there are many community operators available as well.
We used many community-based operators in this book, such as the Infinispan
operator and Keycloak operator; there are productized versions of these operators
available as well. There are many more operators from multiple vendors available from
OperatorHub.30
It is also possible to write your own operators if required. The OperatorFramework31
is an open-source SDK with which you can write your own operators using either Go,
Ansible, or Helm.
30 https://operatorhub.io/
31 https://operatorframework.io/
700 | Own It
apiVersion: v1
kind: Service
metadata:
annotations:
description: The web server's https port.
service.alpha.openshift.io/serving-cert-secret-name: sso-x509-https-
secret
service.alpha.openshift.io/serving-cert-signed-by: openshift-service-
serving-signer@1615684126
service.beta.openshift.io/serving-cert-signed-by: openshift-service-
serving-signer@1615684126
apiVersion: v1
data:
tls.crt: …...
tls.key: …..
kind: Secret
metadata:
annotations:
32 https://docs.openshift.com/container-platform/4.7/security/certificates/ser-
vice-serving-certificate.html
Conclusion | 701
Certificate:
Data:
Version: 3 (0x2)
Serial Number: 1283774295358672234 (0x11d0e1eb7ea18d6a)
Signature Algorithm: sha256WithRSAEncryption
Issuer: CN = openshift-service-serving-signer@1615684126
Validity
Not Before: Mar 15 08:34:54 2021 GMT
Not After : Mar 15 08:34:55 2023 GMT
Subject: CN = keycloak.labs-staging.svc
Subject Public Key Info:
Public Key Algorithm: rsaEncryption
Conclusion
To be able to successfully run your software at scale in production, a good
understanding of the instrumentation that surrounds the software stack is required.
OpenShift is a modern platform that provides all of the capabilities required to observe
and, in a lot of cases, automatically heal your applications while they're running.
In this chapter, we have discussed many common technical patterns that allow
application developers to make use of these common platform capabilities. For
example, one of the simplest patterns is to always log to STDOUT so the platform
logging mechanisms can be leveraged. With containers, it becomes an antipattern to
log to specific files mounted in a temporary filesystem within your container, because
they are not clearly visible.
More complex patterns are also important to keep your business service applications
running, even during disruption and change. Correctly configuring liveness, readiness,
and startup probes so that your application can deploy without loss of service,
configuring Pod disruption budgets for when nodes are restarted. Using application
features to expose Prometheus metric endpoints for alerting and monitoring on the
platform is a great way to alert teams when human interaction is required.
702 | Own It
What should we do next? Should we go around the Delivery Loop again or should we
return to the Discovery Loop? What about revisiting our options? These will be the
questions we answer in Chapter 17, Improve It.
In Chapter 18, Sustain It, we will revisit the entire journey we've been on and explore
what it takes to sustain this way of working, including how to sustain the people,
processes, and technology improvements we have been making throughout this book.
We can then start to scale and spread these ways of working to other teams throughout
the organization.
17 Improve It
We did it! We made it all the way around the Mobius Loop.
First, we built a foundation of open culture, open leadership, and open technology. We
successfully navigated our way around the Discovery Loop using practices such as the
North Star and Impact Mapping to discover our Why, and practices such as Empathy
Mapping and other human-centered design tools to discover our Who. We even started
the Discovery of our How by commencing some Event Storming, Non-Functional
Mapping, and Metrics-Based Process Mapping. We did all of this to gather just
enough information and just enough collective, shared understanding to derive some
measurable Target Outcomes.
We used these Target Outcomes to guide our way through the Options Pivot. We
explored several prioritization techniques and practices such as User Story Mapping,
Value Slicing, impact and effort prioritization, how/now/wow prioritization, design
sprints, and weighted-short-job-first to produce our initial Product Backlog. We
designed experiments that went into this backlog.
We then moved to the first iteration of the Delivery Loop, where features were coded,
applications were written, and we carried out experiments and conducted research.
We made use of established Agile practices to achieve this in a way where we could
measure and learn against the original outcomes as quickly as possible.
In the previous sections, we took a deeper dive into the technology that teams use to
build, run, and own their solution.
706 | Improve It
In each part of the Mobius Loop, we collected valuable information that was
summarized at the end of the section on canvases. If we piece these three canvases
together, we can see how everything connects:
• Discovery:
• Who were we doing this for and why?
• What was their problem, need, or opportunity?
• What were the customer and organizational outcomes we set out with?
• What was the impact on the outcomes?
• Options:
• What were the actions we agreed to deliver?
• What were the options that would help reach the outcomes?
• What was the relative priority?
• What did we learn?
• Delivery:
• What was done?
• What did we say we were going to research, experiment, and launch?
This chapter is called Improve It. Maybe the whole book should be called that because,
really, continuous improvement is what it's all about. Everything we do is focused
on how we can continuously improve; whether this is the technology, our users'
experience, our culture, or the metrics that we employ.
What Did We Learn? | 707
In this chapter, we are going to explore what we do when we reach the end of an
iteration of the Delivery Loop: What did we learn? Did we learn enough? Have we
moved toward our target measurable outcomes? And, most importantly, what should
we do next?
This learning is very important and should drive conversations, inferences, and
conclusions about what was learned. This is why visualizing these metrics is so
powerful. Instantly, we can all see what the current measure is, what the measure was
before the last iteration of the Delivery Loop, and what the target measure is to achieve
the desired outcome and impact.
If the conversation suggests we are not able to learn from these metrics, we need
to inspect why that is. Running a deep Retrospective to ask why we are not learning
enough from our Delivery Loops can be very helpful. Techniques such as the Five Whys
or the Ishikawa Diagram are excellent deep retrospective approaches in facilitating
these discussions and driving improvement actions that the team can put in place to
facilitate learning.
Ultimately, teams need to decide whether they are measuring what matters, whether
the measures are accurate and reflective of their work, and whether the data is
confidently taking them toward their Target Outcomes. The most important question
teams should ask themselves is: Did we learn enough?
708 | Improve It
Over time, you will do more iterations of Discovery and Delivery Loops and spend more
time in the Options Pivot. The Mobius Loop provides a fantastic visualization as to how
long you've spent on each loop, how fast you've traveled around each loop, and how
often you've pivoted from one loop to another. It will also inform what level of balance
you have between continuous discovery and continuous delivery.
Some warning signs to look out for include the following:
• We just keep going round and round in circles of the Delivery Loop. This
suggests that we're not taking the time to revisit and re-assess outcomes and are
moving toward being a feature factory and blindly building outputs.
• We spend too much time on the Discovery Loop. This suggests that we are in a
mode of analysis paralysis. We overthink and overanalyze our Why and Who and
never get to test our ideas or hypotheses. We may risk missing market windows
or not delivering anything.
Did We Learn Enough? | 709
• We jump from Discovery to Delivery. This suggests that we're not taking the
learning from Discovery and distilling, organizing, and making some important
decisions about what to deliver next or how to get knowledge and learning more
quickly (through, for example, research or experiments, as opposed to blindly
building features).
• We jump from Delivery to Discovery. This suggests we're not taking the time to
factor in learning back to our options and prioritization.
• We never move back to another part of the Mobius Loop. This suggests we're
not working in an iterative or incremental way and not building learning into
our system of work. This is really linear work and, as we saw in Chapter 12, Doing
Delivery, when we explored Cynefin, it is only really a good solution when work
is in a simple domain.
Let's look at a story where a major pivot and shift between loops was triggered by
learning.
1 https://www.redhat.com/en/blog/red-hat-welcomes-swiss-based-medical-com-
pany-easier-ag-waterford-its-emea-open-innovation-labs
710 | Improve It
When stakeholders saw the first increment of the app, there was a sudden
epiphany moment. In order to fully appreciate, understand, and learn from
this solution, we would need to have prototypes for not one but two apps.
The experience of doctors and nurses triaging requests would be key.
This triggered an immediate shift back into the Discovery Loop to focus on
these personas and processes as the team lacked insight into their needs
and problems.
This is one of many examples where the team gathered just enough information and
feedback to trigger a pivot and change in direction to initial thinking.
with our existing and new customer base by the end of the year. Well, we've built
that feature now. In fact, we created an A/B test in Chapter 11, The Options Pivot, and
designed an experiment about how and when the next tournament should begin. What
did the results tell us? Did we achieve the impact of more site engagement? Did this
hypothesis prove true or false? Let's update the Impact Map.
We met Mary, one of our users. We ran the Empathy Mapping practice and other
human-centered design techniques from Chapter 8, Discovering the Why and Who. We
heard from her how much she likes the top three leaderboards. We first did some user
prototyping and testing with Mary. We could do another Empathy Map on the latest
working software of PetBattle to capture the latest information on what she thinks,
sees, hears, and says about the latest increment of the software. Let's update or create
new Empathy Maps with Mary and other users.
We used Event Storming to get us just enough understanding of the business
flow for the one where Mary enters the daily tournament and wins a prize. There
were a number of pink square sticky notes on the Event Storm, which represented
assumptions, questions, and unknowns. We know much more information now that we
have delivered some features, conducted some research, run some experiments, and
developed the conversation. We can update the Event Storm and perhaps even start to
generate a whole new part of the system or area of functionality.
Usually, we only ever have three or four Target Outcomes. Perhaps we've now met them
or are close to meeting them; our learnings and measures may warrant us rethinking or
rewriting Target Outcomes. Maybe we need to think about the next Target Outcomes
to take our application to the next level and stay ahead of the competition.
Our User Story Map, Value Slice Board, and prioritization practices were introduced in
Chapter 11, The Options Pivot. We have now delivered items in the top slice or slices of
value. Our learning may have triggered us to rethink existing priorities or re-slice and
re-plan value delivery. Let's update these artifacts to reflect our latest viewpoints.
Our Product Backlog is always ready for more Product Backlog Refinement. With all of
the updates to the Discovery and Options Pivot artifacts, it's sure to need another look
and update. Let's refine the Product Backlog.
So, all the artifacts we've produced from all the practices we have used are never done—
they are living breathing artifacts. They should always be visible and accessible by
team members and stakeholders. The more often we update these based on measures
and learning, the more valuable our top Product Backlog items will be, and the more
responsive and reactive our products can be to user, market, and stakeholder needs.
This is known as business agility.
712 | Improve It
The security controller realized this was not going to be one of those
projects putting him under severe pressure days before go-live. The team
was learning from him and he was injecting his needs and knowledge into
automation. He would still have a week set aside to run his penetration
tests but he felt much more confident. It turned out to be a very positive
experience and one of the quickest security sign-offs ever done.
This experience highlighted the importance of increasing confidence
in important stakeholders like security controllers and sharing learning
continuously between the team and stakeholders. And, of course, confidence
can be measured through practices such as confidence voting!
This story highlights the importance of always improving metrics and automation. If
someone is carrying out the same activities (such as security tests) repeatedly, it is a
candidate for automation.
walls" and see the way of working in action, but much of his leadership
had not seen it. He explained how his management only really understood
numbers and spreadsheets—they weren't really convinced by the use of
sticky notes and colorful walls. How could he prove to them that this was
truly a better way of working?
This made me think of a practice in our Open Practice Library we had not yet
used: Metrics-Based Process Mapping. This was exactly the tool to use for this
particular problem. So, we got some sticky notes and a huge movable board.
Figure 17.2: Using a Metrics-Based Process Map for an European automotive company
We captured all the things that used to happen between their business
or users requesting a feature and that feature running in production. We
captured the Lead Time, Process Time, and Completeness and Accuracy
metrics all as per the practice described in Chapter 9, Discovering the How.
We captured these for the old way they used to deliver increments of their
software and the new way. For the old way, they trawled through some
ticketing systems to help them get metrics. For the new way, they collected
metrics from Jenkins, GitHub, Ansible, and other tools used during the
engagement. They also visualized the old and new structures of teams. The
resulting board is shown in Figure 17.3:
716 | Improve It
Figure 17.3: Capturing different metrics from the legacy approach and the new approach
The actual numbers and information on the sticky notes are not important
for this book, but the process and resulting visualization shows some big
learnings. Each pink sticky note represents a thing that happens between a
feature request and the feature running in production.
Everything below the red line represents the old legacy approach to
delivering features. The tasks were completed by people in different
departments and teams—each represented by the horizontal yellow lines.
As you read the pink sticky notes from left to right, every time you have
to cross a yellow line to read the next sticky note to the right represents a
handoff or a handover to another team. This often involved raising a ticket,
booking a resource, reserving some time to communicate, and waiting or
being in a queue.
Everything above the red line represents the new approach to delivering this
software product that had been kick-started by our engagement. It's all on
one horizontal line because it is a new, long-lived, cross-functional product
team doing all the work. Infrastructure, design, development, testing,
deployment, and operations were all performed by this new product team.
"Just Enough" Leads to Continuous Everything | 717
The green line represents where the pink task immediately above has
been automated. This is CI/CD in action! The pipeline has automated
code scanning, builds, containerization, deployments, unit tests, end-to-
end tests, contract tests, vulnerability, security tests, and UI tests. As we
summed up the old ways and new ways and considered a few nuances
for different types of development work this team would undertake, the
measurable impact of continuous delivery was staggering. This was shown
to a group of senior leaders, which brought a few gasps.
to users
New Way of Working New Way of Working New Way of Working New Way of Working
31 weeks reduction 20 weeks reduction 155 hours reduction 9.5 hours reduction
One stakeholder asked what the main contributor to this shift in numbers
was. The Product Owner explained that there were three things. Some of this
was down to technology, including the OpenShift platform, being used. Some
of this was down to the ways of working, including the Mobius Loop and
Foundation practices that had been employed. But, most of all, it was because
of this team. This long-lived product team had proven that they could not
only achieve these metrics but would go on to improve them further.
The three contributing factors in the preceding story are the things we
should strive to continuously improve by continuous learning: improve the
technology, improve the ways of working, and improve the team.
718 | Improve It
Most of this chapter has been about the ways of working and how we use the Mobius
Loop model to promote continuous improvement. Let's now consider the tech and then
the team.
Above all, listen and learn from the team members and reserve time in each iteration
to promote learning. Continuous investment in team learning is one of the strongest
success factors of a high-performing, long-lived team.
Figure 17.4: Visualizing the skills in the team using spider charts
As a long-lived team, this is a great practice to get into the habit of using.
The team can regularly inspect whether there is team learning occurring
and whether the team and individuals are becoming more cross-functional.
The cowboy hat was also used to highlight anti-patterns and things that
people shouldn't have done (such as applying -DskipTests=true to get a
build deployed without running automated tests). If such behaviors were
spotted, whoever was responsible would have to wear the cowboy's hat until
the anti-pattern had been shared with the whole team and they had been
educated.
It's a bit of fun and creates a safe space for better team culture. It's also educational and
promotes continuous learning.
Conclusion
In this chapter, we explored what to do when we come out of a Delivery Loop iteration
to ensure we improve as we either go round the Delivery Loop again, return to the
Options Pivot, or go back to the Discovery Loop. We looked at how we take metrics and
learning from Delivery, assessing what we learned and whether this is enough to decide
the next steps to be taken.
We are now operating in a continuous flow of innovation from Discovery to Delivery
and back again. We started with just enough and just in time information to get going.
We're learning all the time.
We looked at how we can measure the improvements in our new system and ways
of working by returning to the Metrics-Based Process Mapping practice to quantify
improvements in the technology, ways of working, and the team. We recognized the
importance of continuous learning and continuous improvement in all of these.
In the final chapter of this book, we will look at ways to sustain everything we've
covered in this book. There has been a great focus on one team and one dream. This
has been intentional. As we look to sustain this, we will also see how we can re-use all
of the patterns, approaches, and practices used throughout this book to grow, mature,
and scale a product-centric mindset to applications and platforms—even to leadership
and strategy.
18 Sustain It
In the previous chapter, we transitioned from a single trip around the Mobius Loop,
covered from Chapter 8, Discovering the Why and Who, through Chapter 13, Measure
and Learn, into an ongoing, never-ending, continuous journey of learning and
improvement informed by metric-based outcomes.
This book has focused on how a team can travel round the Mobius Loop from Discovery
through Options and Delivery and back to Discovery again. The Open Practice Library
has helped instantiate Mobius into a true outcome-based delivery built on a foundation
of open culture and open technology.
The PetBattle case study was all about one team achieving one dream to be a high-
performing, cross-functional, long-lived unit, delivering an awesome product and
getting the very best out of the underlying technology, including OpenShift. Through
the use of people, processes, and technology, they are a seed of inspiration and a
trigger of infectious enthusiasm.
Mobius is more than just a map for a team to follow; it is the navigator for the whole
system of working.
724 | Sustain It
In this chapter, we are going to ask, how can we use Mobius and open practices to drive
this mindset of continuous discovery, continuous delivery, and Options Pivoting on a
foundation of culture and technology throughout a much larger organization? In other
words, how can we get 50 teams working together like the PetBattle team? We will
explore how we can enable more Application Product Teams to work the same way,
what it means for the platform, and what it means for the leadership and organizational
strategy.
Let's start by visualizing the journey undertaken so far.
1 https://www.mobiusloop.com/kit/
The Journey So Far | 725
We typically start on the left loop, although we don't have to. Sometimes legacy systems
and organizations will start amidst delivery on the right loop and will seek to pivot over
to the left as soon as is feasible. This happens because there are important, powerful
questions that need to be asked about Discovery, and the sooner the organizations can
get back to asking those questions, the better.
As we work our way around the Mobius Navigator from the left, we ask the powerful
Why questions. Why are we doing this? Who are we doing this for, and why will it help
them? We discover what's really going on to come up with outcomes. We then generate
ideas, refine and map out how to deliver options, and decide which options we're
going to work on next. We deliver by designing, creating, and launching, measuring the
impact as we learn by reflecting and improving. Finally, we adapt by discovering more,
delivering more, or pivoting completely.
The Mobius Loop allows us to take the Mobius Navigator mindset and drill down into,
for example, a product team and how they use Discover Maps, Options Maps, and
Deliver Maps to continuously evolve products, delivering outcomes that matter with
continuous measurements and learning built in. We saw this in action in PetBattle and
with the many stories dotted throughout this book.
We used a Discover Map at the end of Chapter 10, Setting Outcomes, to summarize
Section 3, Discover It, and capture all the learning harnessed during the Discovery Loop.
We used an Options Map at the end of Chapter 11, The Options Pivot, to summarize
Section 4, Prioritize It, and capture all the learning and decisions made during the
Options Pivot. We used a Delivery Map at the end of Chapter 13, Measure and Learn,
to summarize Section 5, Deliver It, and capture all the delivered impacts and learnings
made during the Delivery Loop. If we connect these maps together, we see how the
Mobius Loop is used to connect these three sections together and summarize the
learning of product evolution.
There is a lot of detail in Figure 18.2, and you can examine individual maps either at
the end of the chapters outlined above or access PDF versions in this book's GitHub
repository.
726 | Sustain It
Now, what if we had 10 different product teams in an organization working this way,
each with their own set of maps for a product or product area they were responsible
for?
This idea becomes plausible through infectious enthusiasm, increased platform
adoption, and an enhanced product mindset at different levels throughout the
organization.
A side note, but the majority of this book was written during the COVID-19 pandemic
when infection and spreading were very negative phenomena. It's nice to be writing
about a much more positive impact of something being infectious!
Infectious Enthusiasm
One of the reasons we advocate starting with one team is to test, learn, inspect, and
adapt with a small slice of the organization or department. Starting the right way with
a single team and showing rapid success can be incredibly powerful before any kind of
spread starts. This team is our Patient Zero.2 By taking this one-team approach first, we
are actually taking an Agile approach to moving toward a more Agile way of working.
2 https://en.wikipedia.org/wiki/Index_case
Infectious Enthusiasm | 727
With that first team, we want to enable them to learn as fast as possible by doing
everything described in this book and by adopting practices in the Open Practice
Library. No training course, YouTube video, or book will ever, on its own, enable a team
to instantly work in the way envisioned and achieve the outcomes anticipated by this
book. We hope you've got a lot of value from this book and are armed with inspiration
and information, but we don't for one moment think your team will instantly transition
to this mode of working, even if they read the whole thing cover to cover.
People and teams can be enabled by going through the actual journey for themselves.
They have to build their own foundation of culture and technology. They have to
travel around the Discovery Loop, Options Pivot, and Delivery Loop to experience
the Mobius Navigator mindset themselves. In doing this, they need an application, a
system, something that delivers real business value to real end users, so they have the
opportunity to Build It, Run It, Own It themselves. We advise starting with something
that has tangible business value but is also relatively small, so that you can experiment
and learn with one small team to start with.
Getting one application into production using all the practices and principles described
in this book is a challenge and your team will almost certainly identify bottlenecks
and blockers throughout the process. The great news is that the Navigator and mental
model is built to address this.
Being able to improve in the way described in Chapter 17, Improve It, enables these
bottlenecks and blockers to be addressed, one by one, and validated through metrics.
Many of the underlying blockers come from silos in the organization that need to be
broken down. The product team, through autonomy, mastery, and purpose, need to be
able to silo-bust their way all the way to production!
The challenges being surfaced by such a team on a mission to show this new mode
of operation need timely resolution, either through platform evolution or leadership
enablement. For example, if there is poor or no Product Ownership in place, this
blocker needs to be raised through a Retrospective event (as explained in Chapter
12, Doing Delivery) or the use of the Real-Time Retrospective or even a Stop-the-
World event (practices described in Chapter 4, Open Culture). It's best to identify
these problems early on and address them with one team, rather than accepting
what we know is sub-optimal and end up scaling inferior ways of working across the
organization.
Right at the beginning of this book, we explained how we have a saying that we like to
show, not tell. Show, not tell, is how we create infectious enthusiasm. We have to show
the world the people, the process, and the technology working together in full glory.
Let's look at a few ways we do this.
728 | Sustain It
Demo Day
In Red Hat, when we run Open Innovation Labs residencies (4-12 immersive
engagements where our customers experience culture and practices on a real business
use case), we always finish the engagement with a Demo Day event. Of course, we run
demos throughout the engagement. We typically use Scrum (as described in Chapter 12,
Doing Delivery), which means we run a Sprint Review showcase event every week (we
tend to run one-week sprints).
While we encourage the idea of inviting the world to these weekly events, practically
it is not possible for every developer, operator, leader, engineer, and anyone else
connected in the organization to attend. We do record them and try other mediums to
get the word out about these weekly showcases, but we find we need an extra-special
Demo Day event.
Demo Day is a showcase of the entire way of working—the people, the processes,
and the technology. In our residency program, it's the people from our customer's
organization (also known as the residents) who deliver nearly all of the Demo Day
showcase. By this stage, the infectious enthusiasm has already spread to them and this
is their opportunity to show (not tell) everything the residents from our customer's
organization have done to build, run, and own the product that's been the focus of their
residency engagement. This is an opportunity to show their peers and colleagues what
they've built, run, and owned; but more importantly, it is an opportunity to show the
how of these stages.
This is a showcase of all of the practices the residents used in this time-boxed period of
working together, including:
• Their cultural foundation and how they've used practices such as social
contracts, Stop-the-World events and Real-Time Retrospectives. They show
the team identity they've achieved, how they use team sentiment tools, and
how they have created an awesome, thriving team workspace. They explain
and show how they used tools such as priority sliders, Definitions of Ready and
Done, and acceptance criteria to drive easy conversation, aligned and shared
understanding, and a strong open culture.
• Their technical foundation and how they use containers and the OpenShift
platform as the root of their technical foundation. Adding technical practices
such as Everything as Code, test-driven development, test automation, CI/
CD, and continuous deployment is what really provides a strong platform and
foundation for continuous delivery. They show the value they get from pair
programming and mob programming to achieve team collective ownership of the
product. They show how they have achieved autonomy, mastery, and purpose
from this foundation.
Infectious Enthusiasm | 729
• They walk the walls and explain the Impact Maps, Empathy Maps, Event
Storms, Non-Functional Maps, and Metrics-Based Process Maps and how they
collectively drove the target measurable outcomes to take the organization
toward its North Star.
• They show how prioritization worked during sprints using User Story Maps,
Value Slice Boards, and other prioritization practices they chose to use.
• They show how the product has evolved, how the architecture has emerged, and
the delivery practices used to iteratively and incrementally deliver.
• They demo the app they've built. They demo the platform capabilities being
utilized and show some of the platform features they've utilized to improve
operations, drive a more experimental approach to feature development, and
radiate real-time metrics.
• They give some of the mind-blowing metrics that this technology, these
practices, and this team have managed to achieve and how they compare to the
legacy, status quo way of working.
• They share directly with their leadership what they need to sustain this way of
working and what their hopes and fears are when they return to their normal
place of work.
• They celebrate with pride what they've achieved and what, as a team, they've
become.
The infectious enthusiasm is hard to ignore when people are sharing with such passion
and energy what they've managed to achieve with the technology and practices they've
adopted. Even the most introverted people have delivered incredible showcases at
Demo Day. It's something you cannot write down or explain. You need to be there to
feel and experience the energy and excitement.
Another aspect we often try to build into Demo Day is to show a typical day in the
life or sprint in the life of the team. To do this, we create lots of videos during our
engagements.
730 | Sustain It
3 https://youtube.com/playlist?list=PLnqlDDtSH2A4e3dTIGHGyhCYmwIoaYxV5
Infectious Enthusiasm | 731
4 https://www.schabell.org/2018/01/inside-my-open-innovation-labs-residen-
cy-part-1.html
732 | Sustain It
We have also incrementally built up the pictures of these people, processes, and
technology in action over the course of several weeks.
These sketches have hung proudly in public areas of the organization's headquarters.
They stimulate interest and provide inspiration and another source of infectious
enthusiasm. These visuals often encompass everything you might see if you came on a
walk-the-walls tour of the team space.
Walk-the-walls tours are not rehearsed and do not follow a script. We simply talk about
what we see and tell stories about how we built what we're seeing: the Big Picture, the
Practice Corner, Impact Maps, Event Storms, the North Star, Metrics-Based Process
Maps, Value Slice Boards, Product Backlogs, Sprint Boards, Retrospective results, build
monitors, test metrics, deployment statistics, operational dashboards, and many, many
more. A great walk-the-walls experience is more than simply Show, Not Tell—it's Feel
and Show, not Tell. A high-performing team has a sense of autonomy, mastery, and
purpose, uses great open cultural practices, and adopts the best technology available on
a world-class platform. When you see such a team practicing all of this, it's hard not to
feel the culture and energy in the room. That is what is truly infectious. Any teams‑to-be
walking the walls and feeling this energy tend to want in—and to be as this team is.
We have tried to bring to life a couple of walk-the-walls experiences for both an
in-person product team5 and a virtual product team.6
Where we can't have every team and every stakeholder visit the team and walk the walls
with them, we look for other mediums to help show the world the team and technology
in action through, for example, what we call Written Showcases.
Written Showcases
Written Showcases are exactly what they say they are—Showcase events but written
down! We started doing these on a weekly basis where we would be frustrated that
the weekly Sprint Review Showcase event would not be as well attended as we'd hoped
for. We'd invited senior leaders, stakeholders, and other teams to come and walk the
walls but, in a busy world, it's hard to get everyone to visit—especially in a time-boxed
engagement of perhaps just a few weeks to enable and kick-start this team.
So, we write down all the things we want to tell our stakeholders in Sprint Review
showcase events and walk the walls. And we show as much as possible through colorful
photos, links, and montage videos.
These written Showcases often take the form of an HTML email or a PDF (so, all the
photos and videos) that includes links to artifacts and information radiators and the
latest metrics collected from the evolving application, software delivery, and platform.
It's difficult to give a written document the same level of visual appeal as an engaging
and immersive walk-the-walls tour. But we can try, and the great advantage is that such
documents can be shared far and wide in the organization and word of mouth becomes
a trigger of infectious enthusiasm.
5 https://youtu.be/70y6SEz6bas
6 https://youtu.be/TtvDJIT6RYo
734 | Sustain It
Word of Mouth
Word of mouth in the field has proven to us time and time again to be a key ingredient
of sustaining and growing the culture that we are seeing with one team. After a few
weeks of being invited to Sprint Review Showcase events and walk-the-walls tours,
receiving written showcase emails, and seeing photos and videos of what looks like
some awesome team and techie stuff happening, the team in question gets talked
about.
We start to hear chats in the corridor such as, "Have you seen those folks up on the 5th
floor? Sticky notes and monitors everywhere…not sure what they're doing but it sure
does look interesting" or "We need to go and see that product team who have been in the
basement the last few weeks; they seem to be doing some awesome techie shizzles and I'm
sure we can learn from them" or even, "I've just done this walk the walls thing with that
team downstairs—they're doing all the stuff you read and learn about on training courses
but they're doing it so much better and for real!"
And the conversation goes on when you see some of the metrics around software
delivery and operability. Word of mouth gets around about some of the amazing
outcomes that one team is achieving. How can others not want to learn more?
Figure 18.5: Moving boards and information radiators to show and inspire others
736 | Sustain It
One more thing. He gave me an updated version of the slide he'd previously
used at Demo Day. The process improvement of getting new functionality
done into production was no longer 34 times better than the legacy
practices and legacy technology. It was now 105 times faster.
delivery to users
31 weeks reduction 20 weeks reduction 158 hours reduction 9.5 hours reduction
Table 18.1: Metrics-based benefits of using new DevOps Culture and Practices
What is great about this story is that the culture and practices were subject to
immediate scaling. We saw the journey of ensuring the product team continued to
perform and improve underpinned by the metrics. I believe several impediments and
blockers had to be fixed by leadership to enable this. The proof point was scaling
the team's application operationally across 89 markets while interest and infectious
enthusiasm grew across the business.
And eventually, some of the key people involved in making that one team, one dream
a success progressed to enabling seven teams to work, behave, and think in the same
way.
738 | Sustain It
So, we've established that every new team needs to go round the Mobius Loop at least
three times to be engaged and immersed in using the platform and DevOps Culture and
Practices. What about the foundation?
Bolster the Foundations | 741
As we explained in Chapter 12, Doing Delivery, we also need to look for when to bring
the guardrails down that some of these practices provide. Just because a product team
started using practices associated with the Scrum framework and worked really well for
a few months, doesn't mean that these practices should be used forever for that team.
Teams seeking to continuously improve need to inspect themselves and recognize
when a practice they've been using successfully for some time might actually be
counterproductive and a bottleneck.
Having a common underpinning foundation of culture and technology is what allows
many teams to successfully travel around the Mobius Loop. This means creating a
culture where teams learn from each other, and infectious enthusiasm continues
to evolve—not just from the first team but by subsequent teams and organizations.
Much of our experience comes from open source communities where sometimes
many thousands of people connect to serve a common purpose and iteratively and
incrementally deliver on this. We're seeing and helping organizations use the principles
of open source to form communities inside organizations. The practice of inner-
sourcing allows collaboration, contribution, and openness across the organization.
They achieve this by forming Communities of Practice and Communities of Interest.
Communities of Practice (CoPs) provide a space where people from different teams
who hold particular specialist skillsets can collaborate, share, ideate, and inspire.
Good examples include UI/UX CoPs, Product Ownership CoPs, or an OpenShift CoP.
Communities of Interest (CoIs) can be on any topic of interest to anyone and related to
something inside or outside the organization. Inner-sourcing and CoPs help to provide
and bolster the underpinning foundation of culture that multiple teams can leverage
sustainably and gain from.
Let's look at some ideas on how we might sustain technology.
7 https://en.wikipedia.org/wiki/Global_Internet_usage
Sustaining the Technology | 743
It is with this backdrop in mind that we evolve our applications and services atop a
platform that has grown from its first commit in 2014, shown in Figure 18.7, to the most
prevalent container orchestration platform today.
For technologists, the challenge is keeping pace with this rate of change. To not change
is one option. However, it can be a fatal option; think Betamax and cassette tapes and
Perl (so sad)! We have said a few times in this book that we are optimizing a certain
thing to minimize for the cost of change. We need to minimize the amount of time it
takes to modify or change a software deployment pipeline, to enhance it, or to give us
faster feedback. We need to minimize the time it takes to get ideas from our customers
and business into working software solutions in production. We need to minimize the
time it takes to update the platform. These drives to minimize time are based on the
realization that time is money in business and organizations, and that it is often better
to be fast than right. Gaining faster feedback allows teams to pivot and change faster
based on their business and customer needs.
So, how can technologists sustain this? We do this by continually learning, evolving the
long-lived team's shared knowledge and understanding of technology and solutions
that are continuously evolving and adapting. Practices such as pair programming and
mobbing bring to life this continual learning experience. It takes technical excellence
that can be shared and understood by the team, combined with the people and
processes practices described in this book, to win.
Businesses small and large are building their business platforms atop a myriad of
vendor platforms. This platform mindset is critical in today's hybrid cloud. One of the
main reasons for the success of the platform mindset is that platforms can become a
place where shared and common best practices are codified among different teams. If
the public cloud has taught us anything, it is that the right platform and infrastructure
is a real sustainable competitive advantage. OpenShift is positioned to be the platform
your organization and teams develop and continuously deploy business application
services upon, wherever your infrastructure is deployed.
744 | Sustain It
Lifecycle managing these platform architectures is a core skill that every organization
has to deal with. Many of the intricacies are managed as a service in the hybrid cloud;
this lowers the operational burden for your product teams. In OpenShift, with the
operator pattern, all of the expert knowledge can be codified, shared, and updated over
time to help automate and manage the individual pieces that make up the platform.
Perhaps most importantly, the operator pattern was used to unify the operating
system with the platform, making the OpenShift platform installation and life cycle far
easier from a user's perspective. Practices have also evolved to take on this life cycle
challenge. Site reliability engineering is a discipline born from traditional IT operations
and infrastructure that codifies best practices for teams managing these platforms in all
their possible guises.
To adopt a learning mentality is to accept that what you know today is not necessarily
going to be enough for the changes and challenges that lie ahead. For us, innovation,
research, and development happens in the open source communities of the world. You
can tap into this and become part of this through commercial open source, which in
turn becomes a competitive advantage to your organization. Commercial open source
software projects are owned by a single company that derives a direct and significant
revenue stream from the software through subscriptions, services, training, and
support.
Being adaptive to change is critical in any DevOps or transformation agenda. So many
times have we come across DevOps done wrong, and often they come down to these
seven items. Watch out for these anti-patterns and know how to respond to them
appropriately as, if you don't, they can lead you down a dark path:
1. DevOps is not a one-person band: "Yeah, we do DevOps, Mary is here Tuesdays and
Thursdays and she manages our pipelines." So often do we come across phrases
like "The DevOps person," but this is an anti-pattern. DevOps is not just one team
either; it's a desire to want to work a certain way. To want to take end-to-end
responsibility for the software's journey and break down walls that impede feedback
loops. It's not some gun for hire that swans in and fixes everything before saddling
up and heading back out of town.
2. DevOps is not a one-time thing: Following on from this gun-for-hire analogy,
DevOps is not a one-time thing. It's not just a matter of configuring a pipeline
once and it's static forever. Like all technology, these things need to be nurtured
and evolved. As new components get added to your architecture or new software
becomes available, your pipeline and process need to evolve too.
3. DevOps is not the cloud: Using services from the cloud or just deploying a
Kubernetes cluster does not mean you're now doing DevOps. It's how you use
the platform as an enabler and view it as a product that can accelerate any
transformational journey.
4. DevOps is not giving root/admin access to everyone: This one should feel cut and
dry, but doing DevOps does not just mean giving the access to production and to
the developers. It's about developers and site reliability engineers working together,
listening to each other's concerns, and optimizing your delivery capabilities with
trusted and reliable pipelines.
5. DevOps is not just the tools: Just because you're using Jenkins or Tekton does
not mean you're doing DevOps. The tools are only a vehicle that can enable
teams to work together on a shared platform and space. The real goal is creating
ownership of the software stack and a feedback loop that connects users through
to development to increase speed to market.
6. DevOps is not a silver bullet: It's all about the people, the process, and the technology
combined and balanced. Doing DevOps is not going to solve all your problems; it's
just one aspect.
7. DevOps is not a distinct team: If your software development life cycle (SDLC)
consists of developers building code and then throwing it over the wall to the
DevOps team to deploy/manage, then you need to reevaluate your SDLC.
Now, if the idea of the DevOps Team is such an anti-pattern, the question needs to be
asked, "Is the same true for the idea of a Platform Team?"
746 | Sustain It
Our first inclination may be that it is. If we truly want to bring down all silos and have
a fully cross-functional product team, surely that should mean platform expertise
is included in the team? Surely one or two of the T-shaped or M-shaped people
we defined in Chapter 1, Introduction – Start with Why, should be specialists in the
platform?
This has been a debate we've been having for several years, and perspectives do start
to change when we start to consider many product teams all working on a common
platform. This is a problem of scale. Of course, one of the goals of having a platform in
the first place is to solve this problem of scaling and reuse. It provides an underpinning
technical foundation for many teams to accelerate their development with. A single
product team like the PetBattle team will get many benefits from a platform. But, if
PetBattle were to get acquired by Pet Planet—a new Pet Tech firm with 50 product
teams—the economies-of-scale benefits offered by the platform grow exponentially.
Recently, we've been inspired by and resonate heavily with the work of Matthew Skelton
and Manuel Pais, who wrote the book Team Topologies. They introduce different team
topologies, which include:
1. The Enabling Team, who help train, coach, and guide a team to learn new practices
and acquire new skills, much like we do on our Open Innovation Labs residency
program.
2. The Stream Aligned Team, which is a team aligned to a single, valuable stream of
work. The Application Product Teams we have referred to throughout this book
align closely to this.
3. The Platform Team, which has a purpose to enable stream-aligned teams to deliver
work with substantial autonomy.
The reason we support the idea of a Platform Team and don't see it as an anti-pattern
like the DevOps team is because the platform is a product and should be treated as
a product. Where Skelton and Pais differentiate between Stream Aligned Teams and
Platform Teams, we see both of these teams as Product Teams.
This book has been all about how Product Teams should operate and what is involved in
taking a product mentality to a team's subject. We visualize any product mental model
using the Mobius Loop and foundation.
The Double Mobius Loop – Platform and Application Product | 747
Figure 18.9: Double Mobius Loop of application products and platform products
A Platform Team can use the same Discovery practices as we introduced in Section
3, Discover It. Rather than discovering a business application, they are using these
practices to discover the platform. For example:
• A Platform Team can have its own North Star metric, as described in Chapter
8, Discovering the Why and Who. This gives a single metric that best captures
the core value that the platform delivers to its customers or, more specifically,
Application Product Teams.
• A Platform Team can have its own Impact Map, as described in Chapter 8,
Discovering the Why and Who, deliverables that can be built into the platform,
and the measurable impact we hypothesize these would have on actors (such as
developers) to help achieve the platform's goal.
• A Platform Team can use Empathy Mapping and other human-centered design
techniques to build empathy with, say, developers in Application Product Teams.
This is very powerful in avoiding a common anti-pattern being mandated across
the organization; it creates a pull from the development community who want a
platform that will help them and solve pain points.
• A Platform Team can use Event Storming as described in Chapter 9, Discovering
the How, to map out the processes developers use or want to use to get best use
out of the platform. This enables the platform architecture to emerge. We can
also use practices such as Non-Functional Mapping and Metrics-Based Process
Mapping to trigger a continuous delivery approach with measures on the
platform.
748 | Sustain It
• A Platform Team can set target measurable outcomes as outlined in Chapter 10,
Setting Outcomes.
• A Platform Team can organize and prioritize its work using practices such as
User Story Mapping, Value Slicing, Impact and Effort Prioritization, How-Now-
Wow Prioritization, and Weight-Shortest-Job-First to build a platform Product
Backlog prioritized by value delivered to Application Product Teams. Design of
experiments, design sprints, and even considering the advanced deployment
concepts to roll out new platform capabilities can all be used by Platform Teams.
• Platforms are complex systems as defined by Cynefin and explained in
Chapter 12, Doing Delivery. This means they benefit from being delivered
iteratively and incrementally with learning and feedback built in. An Agile
delivery approach such as Scrum or Kanban allows platforms to be showcased
to developers and operators iteratively and allows metrics and learning to be
collected.
• Platform Teams need to have their own foundation. Platform Teams need to have
autonomy, mastery, and purpose themselves. They need their own technical
foundation so they can apply practices such as Everything as Code, the Big
Picture, and CI/CD as explained in Section 2, Establishing the Foundation. The
technical foundation of the Platform Team is twice as important; not only does it
provide a foundation for emerging platform development, but it morphs into the
technical foundation of all Application Product Teams.
• Platform Teams can work in spaces full of information radiation, collaboration,
and openness just like Application Product Teams. In fact, the most important
collaboration is with those Application Product Teams.
How do Platform and Application Product Teams interconnect? How do their respective
loops interconnect? Developers and other members of Application Product Teams
are the users and the stakeholders of the Platform Team. The decision and motivation
to use a platform has to come from the power of those users and not a top-down
management decision. Platform Teams evolve the platform and infrastructure to serve
needs and opportunities and address problems that developers and Application Product
Teams are having.
The Double Mobius Loop – Platform and Application Product | 749
Two of the worst anti-patterns we see are where management makes a product
decision without talking to users, and where management makes a platform decision
without talking to developers and operators.
750 | Sustain It
So, what can management take away from this and what can Leadership Teams learn
from the double Mobius mental model to platform and application product continuous
discovery and continuous delivery? What if organizational leadership and strategy was
also treated as a complex product?
If we revisit once again the practices introduced in the different sections of this book,
we can see how they can be applied by a Leadership Team taking a product mindset
toward its strategy. For example:
• Every strategy should have a North Star.
• Impact Mapping has been used by many organizations to drive the organization's
company strategy, articulated by a goal, actors, target impacts, and deliverables.
• Practice such as Event Storming, Empathy Mapping, and other human-
centered design can be used to gain shared understanding and empathy with
shareholders, financial stakeholders, owners, competitors, and other market
forces.
• The Options Pivot practices in Section 5, Deliver It, can be used to prioritize
work, design experiments, and even consider advanced deployment of products
into the market.
• The Delivery practices in Section 6, Build It, Run It, Own It, can be used to deliver
increments of the strategy in an Agile way and run Showcases (to the company
and the market) and Retrospectives to continuously improve. Measure what
matters and continuously learn. Capture and radiate organizational metrics and
drive learning and continuous improvement.
The outcome of Leadership Strategy as a Product is to set the intent for Application
Product teams. Product Teams set the intent to Platform Teams. All teams are informed
by measures, learning, and visualization—both to improve themselves and to radiate up
from platform to product, and then to leadership.
The final practice from the toolbox of practices to help connect these is called
Objectives and Key Results, or OKRs.
OKRs comprise an objective—a clearly defined goal—and 3-5 key results that are
specific measures used to track the achievement of that goal. The goal of OKRs is to
define how to achieve objectives through concrete, specific, and measurable actions.
Key results can be measured on a scale of 0-100%. Objectives should also be supported
by initiatives, which are the plans and activities that help to achieve the objective and
move forward the key results. OKRs can be shared across the organization, with the
intention of providing teams with the visibility of goals and to align and focus effort.
OKRs are typically set at the strategic, team, and personal levels.
752 | Sustain It
A nice way to start OKRs is to link it to the Impact Mapping practice introduced in
Chapter 8, Discovering the Why and Who. The goal statement in the Impact Map
translates to an Objective (O). The measurable impact statement connected to the
goal translates to Key Results (KR). So, the Impact Mapping practice being used in the
Discovery Loop of Strategy and Leadership helps set the intent for products, which can
each have their own Impact Maps and OKRs. A Platform Team can also use the Impact
Mapping practice to capture and radiate their Why and build OKRs to support the
application products and strategy above.
Figure 18.11: Mobius Loops connected by OKRs and information, metrics, and learning
The double loop learning applied throughout this book using Mobius to evolve products
applies to platforms and leadership strategies. They are all connected through OKRs,
which flow down the organization and change based on information, metrics, and
learning flowing up from the platform and products.
Conclusion
In this chapter, we explored how we Sustain It—how we sustain everything we've
covered in this book and potentially even grow it!
A lot of this comes down to infectious enthusiasm—infectious enthusiasm for the
technology and platform, for the way of working and the practices, and for the team
and culture. We looked at different ways that infectious enthusiasm is generated and
some stories on how the one team, one dream mantra evolved to multiple Application
Product Teams and, eventually, the need for a Platform Team.
Conclusion | 753
Whether teams in your organization are the first, second, tenth, or hundredth
Application Product, Platform, or Leadership Team, treating the subject being delivered
as a product is the success factor.
Let's leave you with our top 10 tips for transformation towards DevOps Culture and
Practice with OpenShift:
1. Never forget the foundation—building a foundation of culture and technology
practices enables business agility.
2. Never stop improving—whether it is the foundation, the technology, or the
products.
3. Never stop learning—the entire system of work is focused on being able to learn
at every level. Whether it be a product, a team, a platform, or an organization's
strategy, continuous learning should be central to everything.
4. Never done. We call this #neverdone because we are working in infinite loops—
double loops of learning that are continuous and never end, right up until product,
platform, or organizational shutdown.
5. Show, not tell. This is one that might seem over-said, but one of the true measures
is working software. We need it to get feedback and to go around the loops again.
6. Culture is key. Culture can be hard to define, but it's the sense of camaraderie that a
team has. It's the in-jokes, the banter that leads to pulling together when things are
tough. Or being willing to do that rubbish task this sprint because you know in the
next one, someone else will take their turn.
7. Invest in your people. A culture is hard to build but easy to break. Demotivation can
creep in and erode trust.
8. Listen to the engineers; they also need to listen to and understand the needs of the
business and users, however.
9. Learn by doing—experiment, experiment, experiment. Fail safely, fail fast, and, most
importantly, learn from it. Learn by doing.
10. Have fun!
Appendix A –
OpenShift Sizing
Requirements for
Exercises
OpenShift clusters support many types of footprints –
cloud, datacenter, and local. You can discover the best way
to install OpenShift in these different environments by
browsing to this link: https://cloud.redhat.com/openshift/
create/.
There is a distinction between Red Hat supported and
community versions of OpenShift. You may wish to get
full enterprise support for the Red Hat versions from Red
Hat! Especially if you are doing anything work related,
this is highly recommended. The community versions are
supported by the open source community, you can read about them here https://www.
okd.io/.
We have detailed using CodeReady Containers (CRC) on your local machine as the way
to get started with OpenShift. There is both a supported version of CRC, https://cloud.
redhat.com/openshift/create/local, and a community version, https://www.okd.io/
crc.html, available. To run CRC with the default settings, execute the following:
crc start
756 | Appendix A – OpenShift Sizing Requirements for Exercises
In an effort to constrain the system resources used, CRC does not contain all of
OpenShift’s features. Running CRC locally will take a lot longer if executing the techie
bits of this book. You may find frequent timeouts and have to try some commands
twice. So, either be patient or perhaps use a Cloud hosted OpenShift which has fewer
resource constraints. If you have real money to spend on cloud hosting or you have a
virtualized infrastructure available to you, you can install an OpenShift 4 cluster using
the binary installer:
openshift-install cluster create
This will give you a running cluster with multiple master and worker nodes and you
can configure different options prior to installing. Read the docs here: https://docs.
openshift.com/container-platform/4.7/installing/index.html.
CRC can be started with various options. Importantly, you can set the following:
• -c, --cpus int: The number of CPU cores to allocate to the OpenShift cluster
(the default is 4)
• -m, --memory int: The amount of memory (in MiB) to allocate to the OpenShift
cluster (the default is 9216)
• -d, --disk-size uint: The total size (in GiB) of the disk used by the OpenShift
cluster (the default is 31)
You can also configure the IP address – check out crc start --help for details.
We have tested the following scenarios from the book using CRC and recommend the
following minimum sizing. If you have more resources available locally, we recommend
you use them! This will improve the performance and usability when running through
the code.
| 757
CRC is optimized to run on a single machine and so has metrics and monitoring
disabled by default. This can be enabled by running the following command:
crc config set enable-cluster-monitoring true
Be aware that you will require more memory than listed above to be able to run the
monitoring stack, CI/CD, and PetBattle. For the full documentation, options, and
troubleshooting for CRC, please see https://code-ready.github.io/crc/.
758 | Appendix A – OpenShift Sizing Requirements for Exercises
CRC specifics
Currently, there are two Kustomize overlays for PersistentVolumeClaims used
in OpenShift. This is because there are different storage classes available in each
environment. In Chapter 7, Open Technical Practices – The Midpoint, we bootstrap the
tooling using ArgoCD. The Ubiquitous Journey file tekton/kustomization.yaml needs
to be set accordingly by changing the following commented line:
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
bases:
- persistent-volume-claims/aws
# - persistent-volume-claims/crc
- rolebindings
- tasks
- pipelines
- templates
- triggers
1 https://access.redhat.com/solutions/4969811
2 https://github.com/code-ready/crc/issues/127
| 759
Cloud specifics
In AWS, we make use of aws-efs (https://docs.openshift.com/container-platform/4.4/
storage/persistent_storage/persistent-storage-efs.html) for RWX storage. An updated
EFS driver is in the works. If this is not available in your environment, you will still be
able to run the examples but not all pipeline jobs will function currently if they are run
in parallel.
Edit the tekton/persistent-volume-claims files to use a storage class available in
your cluster (for example, gp2 is the default in AWS). You can find the storage class
name by running the following command:
oc get storageclass
Appendix B –
Additional Learning
Resources
In this book, we've mentioned tools and techniques that we have used while writing it,
with the intention that you will be able to use them in your own domain.
However, this list isn't exhaustive and there are others that we've used that deserve
mention as well:
• https://www.konveyor.io/: The Konveyor community is working on projects to
help users re-host, re-platform, and refactor their applications to Kubernetes.
• https://OpenShift.tv: Red Hat OpenShift streaming. Experience live, unfiltered
demos (with no safety nets).
• https://www.telepresence.io/: This is an open source tool that lets you run
a single service locally while connecting that service to a remote Kubernetes
cluster. Debug your Kubernetes service locally using your favorite debugging
tool.
• https://www.eclipse.org/jkube/: This is a collection of plugins and libraries that
are used for building container images using Docker, JIB, or S2I build strategies.
JKube also provides a set of tools, including watch, debug, and log, to improve
your developer experience.
• https://www.openpolicyagent.org/: This is an open source, general-purpose
policy engine that unifies policy enforcement across the stack. Open Policy
Agent (OPA) provides a high-level declarative language that lets you specify
policy as code and simple APIs to offload policy decision-making from your
software. You can use OPA to enforce policies in microservices, Kubernetes, CI/
CD pipelines, API gateways, and more.
762 | Appendix B – Additional Learning Resources
About
All major keywords used in this book are captured alphabetically in this section. Each one is
accompanied by the page number of where they appear.
Symbols advanced deployment annotations 613, 676, 677
strategies 401 Ansible 138, 153, 699, 715
3-5-3 formation 441, 465
A/B Testing 401 Apache Bench 627
Daily Scrum Event 453, Blue/Green
454 Apache JMeter 628
Deployments 402
Development Team Role Canary Releases 403 APIs 147, 148, 196, 617, 693
444 case studies 408, 409 app chassis 41
Product Backlog Artifact considerations 406 application metrics 669,
445, 446 Dark Launches 404 680
Product Increment Feature Flags 405 measuring 511, 512
Artifact 450 Affinity Mapping 65 Application Products 738,
Product Owner Role
Agile 426, 427 , 431, 712, 748 747
442, 443
ScrumMaster Role 443 history 424 Application Product Teams'
Sprint Backlog Artifact traits 426 approaches 738
447-449 Agile coach 9, 66, 201 App-of-Apps pattern 546-
Sprint Planning Event Agile Manifesto 21, 424 563
451-453 principles 425, 450 architecture 217
Sprint Retrospective versus Waterfall Argo CD 181-183, 537, 543,
Event 458-462 approach 426 544, 545
Sprint Review Event Alberto Brandolini 271
App-of-Apps pattern,
455-458 alert 678 anatomy 546, 547,
@QuarkusTestResource
examples 678 548
616
managing 680 trunk-based
need for 678, 679 development and
A types 679
alerting 678, 701
environments 545,
546
A/B Testing 401, 402, 649 Alertmanager 684 artificial intelligence 600
deploying 652-655 alert fatigue 685 automate 621, 643
experiment 649 automation
Alistair Cockburn 56
Matomo 650-652
Allure 598 improving 713
reference link 402
Amazon 97, 215, 274 Automation Test Pyramid
acceptance criteria 437,
611
438, 451 Amy Edmondson 54
autonomy 58
actors 239, 240-243, 247, analysis paralysis 708
717, 747 aws-efs
Andon Cord 70, 79
adaptability 106, 217 reference link 759
Angular 177, 549, 556
adaptation 96 AngularJS 129
adaptive 510, 744
B branching 165-167, 282, CIO 9, 122
579 circuit breaker 664, 665,
bake 550-557, 591, 622
build 128, 170 685
Barry Boehm 423
BuildConfig 591, 695 circuit breaker pattern
baseline metrics 39, 316, 664, 665
build information 677, 678
319, 333, 515
business agility 711 Clair 636
Behavior-Driven
business analysts 153, 273, clear board 129, 740
Development (BDD)
199-203 274 cloud-native 9, 181, 182
behavioral change 243, Business Model Generation cluster resources 533
322, 349 28 cluster-scoped 688, 696
big bang software releases business outcomes 25, Code Coverage 207, 490,
611 253, 404 639
Big Picture 172, 173 code quality 141, 142, 170
building 184-194 C CodeReady Containers
(CRC) 533, 755-757
creating 179
drawing 174 Canary deployment 404,
specifics 758
example 172 648
CodeReady Containers
reference link 180 Canary Releases 403 (CRC),
Big Picture, CI/CD for reference link 404 troubleshooting
PetBattle 549, 550, catastrophic system failure reference link 757
558 213 co-located 43, 234
BUILD component 551, Ceph Object Storage command 278
552 reference link 762 Common Expression
DEPLOY component Cert-Utils 696, 698 Language (CEL) 594
553, 554
chain builds 41 Communities of Interest
Jenkins 559
pipeline stages 556, 557 chaos 433, 644 (CoIs) 742
PROMOTE component Chaos Engineering 644, Communities of Practice
555 645 (CoPs) 742
SYSTEM TEST Chaos Mesh 645 complex domain 594, 698
component 554 Checkstyle 621, 639 complicated domain 195,
Bitnami 568 428
Chief Information Security
BizDesDevSecOps 23 Officer's (CISO's) 507 ConfigMaps 680, 694
BizDevOps 23 CI/CD, for PetBattle 549 configuration-as-code 128
blast radius 645 Big Picture 549, 550 Conftest 633, 635
Blue/Green Deployments CI/CD pipeline 191, 712, 717 container
402, 403, 656-658 CI/CD tooling landscape history 133, 134
previews 658, 659 596-598 reference link 137
reference link 403
registries 135 Cucumber 202, 620 Delivery Map
working 134-136 Cucumber-style BDD 618 creating 522
Containerfile 552 delivery practices 710, 729
culture 57, 58
Continuous Delivery (CD) Demo Day 728, 729
measuring 510
146, 308
culture and practice dental checkup 611
reference link 149
explaining 34, 36 deployment configuration
Continuous Deployment
customer requirements 16 762
(CD) 149, 151
customers deployment pain
reference link 151
continuous improvement alerting 518 measuring 509
27, 59, 120, 706, 718, Custom Resource (CR) 693 deployment pipeline
722, 737 Custom Resource aim 532
Continuous Integration (CI) Definition (CRD) 694 deployment previews 658,
144 CVE 506, 626 659