Start Up C To Handbook
Start Up C To Handbook
Start Up C To Handbook
## Essential Skills And Best Practices For High Performing Engineering Teams
## By Zach Goldberg
**Disclaimer**:
The publisher and the author make no representations or warranties of any kind with
respect to this book or its contents, and assume no responsibility for errors,
inaccuracies, omissions, or any other inconsistencies herein.
At the time of publication, the URLs displayed in this book refer to existing
websites owned by the author and/or the author's affiliates. WorldChangers Media is
not responsible for, nor should be deemed to endorse or recommend, these websites;
nor is it responsible for any website content other than its own, or any content
available on the Internet not created by WorldChangers Media.
Paperback: 978-1-955811-56-9
Ebook: 978-1-955811-57-6
https://ctohb.com https://startupctohandbook.com
# Dedications
To Max Mintz, for teaching me to learn and value the important things in life.
To every direct report I've ever had, thanks for your patience and looking past
what I'm sure were my many mistakes.
To my wife, for tolerating and supporting my many pursuits, this book included.
# Praise
> Zach Goldberg’s CTO Handbook delivers a compelling daily resource for all
engineering leaders. Whether it’s practical day-to-day frameworks or insightful
perspectives, Goldberg’s book will instantly help you tackle the most complex
issues in developing a high-performing engineering team.
>
> Michael Lopp, randsinrepose.com
> When I was stumbling around in the dark trying to figure this out for myself and
overwhelmed with several tech management books, this is the concise summary of all
the things I needed.
>
> Charlie von Metzradt, cofounder of MetricFire/Hosted Graphite
# Contents
- [Introduction](#introduction)
- [The Author](#the-author)
- [Using this Book](#using-this-book)
- [Business Processes](#business-processes)
- [People \& Culture](#people--culture)
- [Management Fundamentals](#management-fundamentals)
- [The Professional Skill Tree](#the-professional-skill-tree)
- [Kaizen: Continuous Improvement](#kaizen-continuous-improvement)
- [Coaching](#coaching)
- [Find a Management Mentor](#find-a-management-mentor)
- [1:1 Meetings](#11-meetings)
- [Skip-Level Meetings](#skip-level-meetings)
- [Coaching Managers](#coaching-managers)
- [Hiring and Interviewing](#hiring-and-interviewing)
- [Speed is Your Friend!](#speed-is-your-friend)
- [When To Hire: Headcount Planning](#when-to-hire-headcount-planning)
- [Sourcing Candidates](#sourcing-candidates)
- [Onboarding](#onboarding)
- [Performance Management](#performance-management)
- [Team Makeup](#team-makeup)
- [Leadership Responsibilities](#leadership-responsibilities)
- [Which Type of Startup CTO Are You?](#which-type-of-startup-cto-are-you)
- [The Tech-Focused CTO *AKA The Chief Architect*](#the-tech-focused-cto-aka-
the-chief-architect)
- [The People-Focused CTO *AKA the VP of Engineering (VPE)*](#the-people-
focused-cto-aka-the-vp-of-engineering-vpe)
- [The Externally Focused CTO *AKA The Head of Technical Sales/Marketing*]
(#the-externally-focused-cto-aka-the-head-of-technical-salesmarketing)
- [Technical Team Management](#technical-team-management)
- [Tech Culture and General Philosophy](#tech-culture-and-general-philosophy)
- [Tech Debt](#tech-debt)
- [Technology Roadmap](#technology-roadmap)
- [Tech Process](#tech-process)
- [Workflow](#workflow)
- [Developer Experience (DX)](#developer-experience-dx)
- [Tech Architecture](#tech-architecture)
- [Architecture](#architecture)
- [Tools](#tools)
- [Boring Technology](#boring-technology)
- [DevOps](#devops)
- [Testing](#testing)
- [Source Control](#source-control)
- [Production Escalations](#production-escalations)
- [Root Cause Analysis (RCA) Exercises](#root-cause-analysis-rca-exercises)
- [IT](#it)
- [Security and Compliance](#security-and-compliance)
- [Conclusion: Measuring Success](#conclusion-measuring-success)
- [Book References](#book-references)
- [Digital References](#digital-references)
- [Glossary](#glossary)
- [About the author](#about-the-author)
- [About the publisher](#about-the-publisher)
# Introduction
Always Be Learning
Fast forward another few years to the summer before my freshman year at the
University of Pennsylvania. I knew for sure I wanted to study computer science as
an undergraduate, but I also had the idea in my head that I liked business. My
father had started his own business, my brother had just graduated business school,
so business seemed like a great idea. Penn is known for their dual degree programs
that let students graduate with degrees in multiple fields, like engineering and
business.
On the chosen day, having driven three hours from New York to Philadelphia, I sat
down across from Dr. Mintz, eager to hear how to game the system. I figured it was
a matter of choosing the right classes and getting sufficiently good grades to
qualify. Dr Mintz, however, had other ideas.
We had several more coffees over the coming months, and any time I'd ask Max about
an application or a résumé, he'd steer me right back into real science. Max wanted
me to *learn* not just to absorb whatever topic he was lecturing about at the
moment, but to get really good at learning, and learning hard things at that. Max
couldn't care less what piece of paper his students were given at the end of their
four years as long as each of them was prepared to continue learning for the rest
of their lives.
By the time I graduated college, Max had become a close friend and confidant, and
he had fundamentally shaped the path of my education. Rather than give me fish, Max
handed me a fishing pole and taught me how to attach bait and cast a line.
No single book can give you the experience Max gave me as an undergraduate student.
I make no such promises for the book you're reading now. Instead, I tell this story
to emphasize the value and impact of a focus on the fundamentals of learning
itself.
The most wonderful thing about tech is that our field is continuously evolving. The
people you work with will change. The tools you use will be updated or deprecated,
and new techniques for doing your work will come and go. As you embark on your
adventure in technical leadership, the only way to manage this change is to expect
it, accept it, and embrace the opportunity to learn and grow with your team and the
field itself.
> I want people to be serious about learning. I want them to dig in. I want them to
gain, most importantly. It's not RSA, it's not [nuanced algorithms]; those aren't
important. It's that confidence in themselves they can grow and learn outside of
academia: that means they don't need me. All they need is to be able to sit down
with books or perhaps today the internet and go off and learn things on their own.
>
> Dr. Max Mintz, 1942–2022
Most startups have a technical cofounder. This person writes the bulk of the
initial codebase, hires the first few engineers, and runs the technical show for
the startup at least through their first one or two rounds of funding.
Somewhere between hiring the third and tenth engineer, this person will stop being
hands on keyboard and start spending all their time managing the team. At this
point, problems often arise: the team begins shipping features more slowly, the
defect rate begins to climb, system stability may suffer, overall costs go up, and
the other founders start to worry.
Chances are the technical cofounder, or any technical leader, has spent their
entire career up to this point investing their time and effort in becoming a great
programmer, not into developing leadership skills. It should come as no surprise
then, with their leadership skills at level 1, that they're making mistakes and
costing the company time and money.
Regardless of your title and when you joined the company, if you've devoted most of
your career and experience to technology and you're now assuming responsibility for
people or a department, it's critical you realize that you're now in a leadership
role; your technical background and talents won't be enough on their own to be
successful. While some technical skills are table stakes for running a software
engineering team, the reality is, to do a good job as a leader you need to focus on
people leadership, management, architecture, and general decision-making skills.
People leadership isn't for everyone. I'm sure you've heard stories of technical
founders who stepped aside as their companies grew. Steve Wozniak, cofounder of
Apple, is perhaps the most famous example of this pattern. There is no shame in
stepping aside; Wozniak recognized that technical work was what he loved, and that
is where he wanted to spend his time. You would do well to at least consider the
same for yourself: decide if programming is your zone of genius and the work that
gives you the most joy. If it is, you'll have a great career ahead climbing to the
ranks of the most senior technical staff.
If, however, you or your circumstances have led you to conclude that managing or
leading a team is the role you aspire to, then this handbook will provide a good
starting point on how to broaden your skills on the journey to becoming a
successful technical leader.
## The Author
I had my very first startup experience during the summer after my freshman year of
college. I have no memory of why I sought out an internship at Eduware, or why they
accepted my application. What I do remember is commuting every morning to work in a
tiny room in the back of a first- floor office with four other young software
engineers. The oldest of us must have been twenty-five; I was nineteen. It was just
the five of us sitting around a horseshoe-shaped table, working shoulder to
shoulder on a .NET education application. I was probably useless as an engineer,
but I was fortunate that the oldest and most senior engineer of the group took the
time to teach me and help me understand the tools, and I gradually became more
productive.
Something about that experience in a stuffy room at the back of the office must
have left a good impression on me, as I've chosen to work at seven more startups
since: Invite Media, WiFast (now Adentro), SoChat, AutoLotto, Trellis Technologies,
GrowFlow, and Equi. At Invite Media, a display advertising and exchange bidding
company, I partnered with the CTO to lead a rapid growth phase that culminated in
its 2010 acquisition by Google for $81 million. At Google I took over site
reliability responsibilities for Invite's departing CTO and oversaw the company's
integration into Google's stack.
From there I went on to cofound WiFast, a tech company focused on democratizing and
monetizing Wi-Fi usage, serving as both CEO and CTO through our first two major
funding rounds. I've also served as Entrepreneur-in-Residence at Tencent in
Guangzhou, China, and cofounded SoChat, a cross-platform messaging app. Since then,
I've served as CTO at Lottery.com, Trellis Technologies, GrowFlow (acquired by Dama
Financial), and Equi.
I've approached each of these roles with a founder's mentality, working to
establish creative environments and advance the idea that engineering software
should be more science than art.
I've also been fortunate to learn from others throughout this journey, including
seven teams of fantastic engineers, countless consulting/coaching clients, and many
brilliant cofounders. I've also proactively sought to elevate my own leadership via
years of management coaching from one of Silicon Valley's top coaches, as well as
tapping countless mentors and reading hundreds of relevant books.
Through my reading, it's become clear to me that while there are hundreds of how-to
books for programmers and people working with specific tech or tools, and dozens of
helpful books for CEOs and CFOs on the financial side of entrepreneurship, one
thing our industry is missing is a thorough, practical resource for startup tech
leaders. We need a resource that covers all the topics in between the core skills,
and addresses the range of leadership challenges and skills so critical to our
role.
There are also plenty of blogs on how to write good code, or how to run user
surveys, or on finding product market fit. This is a book on technical team
building; it addresses all the skills a leader needs to build a company that they
didn't learn in traditional tech education or experience.
It would seem that every technical leader faces these issues at one time or
another, and yet advice on how to handle them is inconveniently left out of nearly
every business or technical curriculum.
This book is written primarily for anybody who is presently or may in the future
find themselves managing a software engineering team, particularly as the driving
force of a venture-backed startup. It may also prove useful for individual
contributors non-manager software engineers as a means to gain perspective into the
kinds of tasks and demands placed on managers that may not be obvious at first
glance.
I've formatted this book as a collection of independent chapters covering a broad
spectrum of topics. It is intended to be used as a reference guide, for the reader
to pick up a chapter as it becomes helpful and not necessarily read sequentially
from start to finish. For this reason, some material is repeated in various
chapters to ensure that each chapter can stand on its own without the benefit of
the prior sections as context.
At the end of the day this book is a synthesis of my personal experience and the
resources I've found helpful, interspersed with advice and input from peers,
mentors, and advisors. If there are things in this book you disagree with or
believe are incorrect that you'd like to let me know about, or if you find this
book helpful and would like to communicate with me directly, feel free to reach out
at [email protected]. I'm also happy to discuss advising, coaching, and mentorship
opportunities at the same address.
# Business Processes
Throughout this book you'll find many descriptions of business processes. My goal
in outlining these processes is to provide a starting point for how you might
implement a solution to a problem you are facing.
Depending on the size of your team and company, what is described here might appear
overly burdening and cumbersome, or it might seem too sparse and unsophisticated to
address your needs. The reality is that, as your company and team grow, you will
need to reinvent the ways you do business. Your company of five people will operate
very differently when it grows to twenty or fifty or a hundred or a thousand. I've
highlighted the core principles that matter and left it to you to adapt them to
your team as it's constituted now, and also to scale your approach as the needs and
constraints of your business change in the future.
## Management Fundamentals
The golden rule of management: do what it takes to get the best out of your team.
In technical leadership as in any other leadership role, the best measure of your
performance as a manager is the performance of the team itself. That means you
should be thinking about and spending time doing everything necessary to help
individual team members do their best work, both independently and collectively.
Helping your team succeed requires humility, as it entails consistently putting the
needs of your direct reports above your own. You will need to adjust and tweak your
style, behavior, thinking, and actions to suit the needs of members of your
engineering team. That will include being willing to be wrong, being open-minded,
and learning from your direct reports.
If you buy into this journey, know that you will make mistakes. Own those mistakes
with your team and they will trust you more for it. Also know that being a perfect
manager is not an achievable goal; the best you can hope for is to always be
improving in small ways. After a career spent managing people, you'll have learned
a lifetime of lessons about technology and human beings that will make you a more
competent manager.
*Every single person with whom you work has a vastly different set of needs.
Fulfilling these needs is one way to make them content and productive. It is your
full-time job to listen to these people and mentally document how they are built.
This is your most important job. I know the senior VP of engineering is telling you
that hitting the date for the project is job number one, but you are not going to
write the code, test the product, or document the features. The team is going to do
these things, and your job is to manage the team.*
In that one succinct paragraph, Lopp hits on all the key points of management.
First and foremost, you are a listener, a personal and career development coach,
and a shield against external forces in the world which might distract, stress, or
otherwise prevent your team from doing their best work.
Many video games involve a concept of a skill tree. For those unfamiliar, a skill
tree is a sequence of skills or abilities that are unlocked as the player
progresses through the game. Each skill is unlocked by spending skill points.
Here's the rub: at any given time, there are more skills to unlock than you have
skill points to spend. The skill tree forces you to choose some skills before
others. The skill tree provides a reasonable model for your career as well. At any
given job, you're likely accumulating skill points toward some skills and not
others.
In your journey to tech leadership, you've already invested many skill points into
the technical/engineering branch of the skill tree. My key insight for you is that
the management branch of the skill tree is equally vast, and if you've not been
investing points in that area up to now, even if you're a Level 100 engineer you'll
start your new leadership position as a Level 1 manager staring at a mighty oak
tree of yet-to-be-unlocked crucial skills. Once your company has more than a small
handful of engineers, these skills will make the difference in your ability to
scale up with the team.
*Kaizen* is the Japanese word for improvement. The phrase was popularized as part
of the Toyota Production System. At Toyota, all personnel are given a (literal or
metaphorical) red handle to pull that stops the entire production line. If a worker
identifies a problem with production, the idea is for them to pull the red handle,
gather coworkers and resources to diagnose the issue, and then resolve it before
work can continue. By empowering everyone on the team to improve the process and to
be invested in its efficacy, Toyota can cost-effectively build higher-quality cars.
I'm not the first to suggest that software engineering has much in common with
traditional manufacturing (see *The Phoenix Project* by Gene Kim). In this case,
make the metaphor real: provide your team with a digital red handle and encourage
them to focus on continuously improving everything you do. Members of great teams
understand that, over time, the team will change, customer requirements will
change, tools will change, and the team will need to revisit past decisions and
make improvements.
*Kaizen* applies not only to your team's process but also to individuals. Your best
team members will embrace the idea of continuous education and continuous
improvement, and treat mistakes not as failures but as opportunities for
improvement.
### Coaching
Your principal role as a manager is to get the best out of the people on your team,
so in many scenarios it's more appropriate to describe your role as that of a coach
rather than a manager. A coach is somebody who is on your side, a source of wisdom
and guidance to everyone on their team. A coach is quick to provide critical
feedback, but also the first to celebrate and praise success.
Your goal in your interactions with your direct reports, whether they are
individual contributing engineers or managers themselves, is to be the best coach
they've ever had.
The story goes that one of our investors, First Round Capital, was hosting a
management summit in San Francisco, about a thirty-minute drive from our office.
Thus far I'd found First Round folks were high quality, so when I came across the
invitation, their support temporarily muted my ever-present fluff allergy and I
signed up.
When I drove up to the summit, I was encouraged that the audience was relatively
small, only about thirty people enough to fit into what felt like a high-school
classroom. I sat down at the high-school folding-tray top desk, opened my notebook,
took out a pen, and wrote the date and First Round Capital Management Summit at the
top of the page. Sadly, that would be the only thing I wrote all morning.
The first half of the day had three or four speakers talk about various topics,
every one of them lacking in any actionable advice or insight -- in other words,
fluff. As we broke for lunch, I contemplated driving back early and getting in half
a workday at the office. I checked the agenda and noticed that we did have an
entirely different roster of afternoon speakers, so I decided I'd at least hear out
the first one.
The first speaker after lunch was Jonathan. Unlike prior presenters, he had no
slides and he seemed a little rushed, perhaps a tiny bit unprepared, or maybe just
nervous, as he walked to the front of the class. The first words out of his mouth,
however, told a different story:
I hung on every word Jonathan had to say. As the half-hour session wrapped,
Jonathan said he had to catch a flight, and somewhat hurriedly ran out of the room.
I looked down at my notebook, processed that I had taken three pages of notes in
the last thirty minutes compared to none in the first four hours, then stood up
from my chair and ran after him.
I managed to catch him just as he was getting into a yellow cab. Somewhat
exasperated, Jonathan asked me what I wanted. I asked if he did private coaching.
He replied, "Ask the organizers to connect us", then got into the cab and took off.
Cleverly, he didn't commit one way or another to coaching on the spot, he left
himself the opportunity to do diligence on me via the organizers before deciding if
I was worth his time. Luckily for me, when I asked the organizers to put us in
touch, the contact said nice enough things about me that Jonathan agreed to an
introductory coaching session. Jonathan and I would go on to work together for many
years, across multiple companies, and I can confidently say that his mentorship has
been invaluable in my leadership journey
A 1:1 meeting is a private meeting between you and a direct report. It's tempting
to treat 1:1s as status check-in meetings, and for the agenda to focus entirely on
business or technical topics immediately at hand. It's all right if the agenda
includes those topics, but this is your opportunity to establish a coaching
relationship with your direct report. You should use this time to really get to
know and understand how your report thinks, draw out and identify their strengths,
and recognize weaknesses you can address to help the person do their best work.
* Put the employee at ease by making sure they know the purpose of the meeting that
you're not there to problem-solve or make decisions that are better handled by
their actual manager.
* Let them know that you want to build a relationship and hear their insights on
leadership, culture, strategy, and company direction.
* There are many good actual templates/agendas for skip-levels on the internet.
Here's one from managementcenter.org that I recommend:
[ctohb.com/skip](https://ctohb.com/skip).
As your organization grows, you'll likely get to the point where you no longer have
any individual contributor direct reports. Every direct contributor who actually
writes code is managed by a middle manager. It should be obvious, then, that
effective middle managers are critical to the performance of your organization.
It's your job to make sure that your managers have the support, resources,
training, and mentorship they need to enable them to do their best work coaching
the engineers on their team.
- Be clear with your managers about your expectations for what management means,
for expectations on coaching, 1:1s, performance management, etc.
- Codify your management expectations unambiguously in internal documentation and
make it part of management hiring and onboarding.
- Supply resources for your managers to pursue ongoing learning and professional
development. This might include purchasing company subscriptions to learning
programs, sponsoring employees to attend conferences, hiring management coaches, or
formalizing internal or external mentorship programs.
- Consider the cost for these training materials in your regular budgeting
process for every member of your team.
- Encourage your managers to become thought leaders in your industry. This could
take the form of a company blog, participating as a guest on technical or
management podcasts, or speaking at conferences.
Your engineers should be venting at you regularly, so if they are, don't panic this
is totally normal, and in fact highly desirable. You should have a 1:1 meeting with
every member of your team at least once every two weeks, if not weekly. Your goal
in these meetings is to create a safe space for your engineers to tell you what's
on their mind, and for you to actively listen and engage on these topics.
With strong engineers, that will mean they're aware of imperfections in the world
around them and they want to tell you about them. Your job is not to solve every
problem they bring up; your job is to listen, to ask questions to clarify your
understanding, and to convince them that you do understand, and then steer them
toward solutions. From time to time, there may be a direct ask, or something you
can directly help with, but that's not the norm.
The value you're providing here is making your direct reports feel heard and
coaching them to productively handle issues themselves.
The easiest way to bridge your agenda and theirs is to have a shared document,
perhaps with some structure/template, to elicit the kinds of discussion topics you
think are important. Having this document available prior to your meeting also
gives you and your employee a shared place to capture ideas in between meetings, to
structure thoughts in advance of a meeting, all of which help make the meeting time
more productive and efficient.
There are several SaaS tools that help facilitate 1:1 conversations as well.
Notable examples include Culture Amp and 15Five. You don't need a tool though; a
simple document works equally well. The template I use is available at
[ctohb.com/templates](https://ctohb.com/templates); it includes prompts for
discussing liked/ wished for items at a personal, departmental, and company level,
as well as bidirectional feedback between manager and employee.
Establishing a playbook for these engineering 1:1s is another useful way to make
sure these meetings address a consistent set of topics and don't go off track. Your
playbook should ensure that your 1:1s touch on the following:
**Performance and Development:** Often it's your engineers seeking advice on how
they can improve something
**Clarity:** Engineers may have general thoughts about something and are looking
for your perspective, or to see if you have different information than they do
about something
**Context:** What's going on more broadly at the company, and how does a
contributor's work relate to those goals/objectives
The phrase *Radical Candor* was defined by Kim Scott in her book *Radical Candor*.
The book defines Radical Candor as communication that incorporates both praise and
criticism, and ensures that the delivery involves both caring personally while
challenging directly. I think the point is best made in contrast to three other
kinds of communication outlined in Scott's book:
**Obnoxious Aggression:** Sometimes referred to as brutal honesty or front-
stabbing, characterized by direct challenge but lack of individual caring, perhaps
demonstrated by insincere praise or unkind criticism
**Ruinous Empathy:** Communication that comes from a place of caring personally but
lacks a direct challenge
I encourage you to read Scott's book, but if you don't, then at least be aware of
these terms and use them as a coaching tool to move your team toward regularly
practicing Radical Candor.
There's nothing worse for an employee than feeling like their manager doesn't
communicate enough with them. In the absence of information, it's a natural
instinct to assume the worst-case scenario; a lack of information can also be a
prime source of anxiety and confusion.
Overcommunication, by contrast, has very few consequences. The worst that can be
said of overcommunication is that it can prove a distraction or become redundant,
which are problems easily remedied with a bit of thoughtfulness as to the form of
overcommunication. It's no surprise, then, that most startups invest heavily in
building overcommunication into their culture, often including the phrase as a
company core value.
Pretty much anyone you interact with nowadays has either been using email for
twenty-five years, or since they were in early grade school, so of course this
means they know how to use it effectively, right? Unfortunately, effective use of
email at work is not necessarily common sense. So, it comes to you to help
encourage best practices. Here is some general advice for using email effectively:
* Rather than having email open all day or monitoring it continuously, check
email a fixed times each day.
* Disable email notifications on your phone. Though this one in particular may
seem blasphemous, I encourage you to try it. Not only does it significantly reduce
the number of notifications you receive, but you'll find yourself building a new
habit of proactively checking email when you're ready to engage. This makes email
an intentional activity instead of a continual background nuisance.
Chances are high that your company has already adopted some form of synchronous
chat platform; in the early 2000s it was commonly Google Chat or an MSN messenger
product, while in the 2020s it's more commonly Slack, Microsoft Teams, or Workspace
from Meta. If you're not presently on one of these platforms, it's worth
considering their adoption.
The vast majority of companies ranging from day one startups to goliath companies
of 100,000-plus have adopted them with great success.
Achieving that success means being mindful and planning around some inherent flaws:
synchronous chat programs require both parties to stop what they're doing and
engage, and they result in conversations that are poorly organized and do not
produce lasting artifacts for your team to reference. You can and should recognize
these downsides and compensate for them by setting up basic etiquette and
expectations for your team in how to use these tools.
Slack's own blog includes a great article with some common best practices at
[ctohb.com/slack](https://ctohb.com/slack).
Here are a few recommendations for working with synchronous chat tools:
* Lean into notification schedules and do-not-disturb features. You should also
encourage members of your team to set up a do-not-disturb schedule in any
synchronous chat program to minimize interruptions in focus/flow time.
A trite example is the dreaded "the feature is broken" bug report. In nearly all
cases, bug reports should go to a ticketing system rather than a direct message. An
engineer receiving a bug report in a message does not have the context to know
which feature is broken or in what way it's failing to meet expectations. So, the
reply from the engineer will likely consist of a handful of questions, requiring
more round trips with the reporter, costing time and creating frustration.
Contrast that with a bug report that includes full written reproduction steps as
well as a video of a user trying to use the feature and demonstrating the failure.
More than likely, this approach will enable the engineer to produce a fix without
requiring any further follow-up.
The bottom line is this: any time you send a message to somebody in an asynchronous
format, give that person all the information they need so they can understand,
process, and reply in a way that advances the conversation.
#### Documentation
* Live the value yourself and set an example for the team. Once I moved a team from
writing zero internal wiki articles per week to writing several per day in the
course of about eight weeks. Literally the only thing I did to encourage this
cultural change was to start writing articles myself. Everything I did that made
sense to share with the team I wrote up as an article, and I'd make a point of
sharing links to those articles whenever appropriate. Very quickly, other managers
started doing the same, and within two months everyone on the team was contributing
every week.
* Encourage the team to practice the Boy Scout Rule (always leave the campsite
cleaner/code better than you found it). If they find documentation that is
inaccurate, they should either update it themselves or explicitly mark the document
as deprecated.
One key area of documentation you should pay special attention to is how a
developer gets started writing code within a particular project or repository. I
recommend every repository have a README.md file that explains a minimum of four
things:
Every organization has its own distinct culture, and its own style of internal and
external communications. One of a leader's key responsibilities is to make sure
that culture always supports the goals of the organization rather than impeding
them.
One element of the internal culture of a technical organization that tends to get
out of hand is the generation of made-up acronyms that can multiply over time and
obscure and over-complicate the communication they were intended to streamline. It
may seem like a minor annoyance, but it's symptomatic of poor communications
strategy that can spiral out of control, particularly as it can place barriers
between those in the know and team members who have no clue what the acronyms stand
for. As an organization's technical leader, it's your job to set the tone and
define the culture, and although the proliferation of made-up acronyms most likely
won't start with you, it's your job to recognize when it's happening and shut it
down before it gets out of hand.
In a January 2018 memo to SpaceX employees, Elon Musk called for a No Acronyms
policy. I've put that same policy into practice ever since, and I wholeheartedly
endorse it. The below came from an email titled Acronyms Seriously Suck
(ctohb.com/acronyms):
> There is a creeping tendency to use made-up acronyms at SpaceX. Excessive use of
made-up acronyms is a significant impediment to communication and keeping
communication good as we grow is incredibly important. Individually, a few acronyms
here and there may not seem so bad, but if a thousand people are making these up,
over time the result will be a huge glossary that we have to issue to new
employees. [...] This is particularly tough on new employees. [...] The key test
for an acronym is to ask whether it helps or hurts communication. An acronym that
most engineers outside of SpaceX already know, such as GUI, is fine to use. In
practice, most acronyms act as a barrier and not a benefit to clear communication.
It makes it harder for new employees to understand what's being discussed. It
requires effort for a team to maintain a list of acronym definitions someplace, and
overall, it's less of a timesaver to both write and speak than it may seem at first
glance.*
This may seem blasphemous, or an overbearing and silly rule to try and enforce in a
culture. I'm not proposing you punish people for using acronyms or write it on the
walls in the halls of your office. Quite the opposite; especially at a smaller
organization, it takes only a very light touch to make no-new-acronyms a part of
your culture. Get buy-in from your executive team to not create acronyms, and then
encourage them to issue a gentle reminder to their managers to do the same, and
you'll be amazed how quickly everyone kicks the habit. A sentence or two in your
onboarding documentation is often a sufficient nudge for new employees who, due to
the gentle note in onboarding and witnessing the lack of acronyms surrounding them,
will be far less likely to create them themselves.
Regardless of the type of meeting, any meeting you set should have a clear
objective that is known to invitees in advance. Ideally, everyone will have enough
information before the meeting to know if it'll be valuable for them to attend.
Just as importantly, your culture should empower people to make the decision not to
attend if they judge it not a good use of their time.
As a leader at a startup, you'll quickly find your time is split between many kinds
of work, and potentially dozens of hours of meetings every week. If you don't yet
have a system in place that works for you, now is the time to invest in some good
habits and get organized. I recommend both Stephen Covey's *The 7 Habits of Highly
Effective People* and David Allen's *Getting things Done* as places to start on
this journey.
One of the ways you can enable productivity in your team is by creating, or
allowing for, large blocks of free time for your engineers. Context switching (our
tendency to shift from one unrelated task to another) is expensive (see The
Multitasking Myth at ctohb.com/myth), so the more time you can create for engineers
to do the work of engineering without switching to other tasks (email, phone calls,
meetings), the less total context switching penalty you pay.
I'm a fan of declaring an informal meeting hours window for the team. Encourage the
engineering team, and cross-functional teams, to schedule meetings during this two
or three hour window every day and try not to schedule engineers outside that
window. That leaves a healthy amount of time *every single day* for your
engineering team to focus on the core of their work and also make space for
necessary informational and conflict resolution meetings. If your team is in more
than three hours of meetings per day (fifteen hours per week, nearly half of their
time!), you should take a close look at those meetings and ask yourself if they can
be consolidated and reduced.
Other teams have found success with no-meeting days setting aside one or more days
each week when nobody schedules any recurring meetings. Just keep in mind that, in
a forty-hour work week, your goal is to reserve as many of those hours as possible
for your engineering team as contiguous blocks of focus time. A single no-meeting
day implies an eight-hour block of focus time, but there are still thirty-two other
hours to consider, so it doesn't solve the whole problem.
* Two hours: average time spent in other agile ceremonies (sprint planning,
retrospectives, etc.)
In total, that's about thirteen to seventeen hours of the week used up for meetings
and communication. If you add another few hours on top of that for time spent
context switching and unplanned miscellaneous interruptions, quickly you're looking
at at best half of a forty-hour work week available for actual focus time. If
you're not careful about when meetings are scheduled, then not only will your
engineers have only twenty hours left for their core tasks, but also they won't
have them in contiguous blocks, further reducing productivity.
I present this contrived example to drive home the point that providing engineers
with large blocks of focus time to do engineering does not happen by accident. It's
up to you as the leader who determines how their time is spent to develop a culture
and process that consolidates and minimizes these distractions and maximizes time
available for individual contributors to do actual engineering.
HIPPO is a casually used industry acronym, short for Highest-Paid Person's Opinion.
Whether you're the highest-paid person or not, your title will imply that you are.
The thing about the HIPPO is, most employees are reluctant to challenge the HIPPO.
I strongly encourage you to minimize this effect in discussions by regularly
opening the door for challenges, being overtly open to being wrong, and then acting
on and championing ideas other than your own. you'll know you're doing this often
enough when you feel like you're doing it too often. By the time you feel you're
overdoing it, you've probably reached the minimum that most employees need to
actually believe you.
In spite of your best efforts to come across as approachable and open to being
convinced of other approaches, your presence in a meeting will often still have a
subconscious effect on other attendees, especially if they're more than one level
below you in the organization chart. Be mindful of this effect and do your best to
attend meetings only when you truly add value and the team needs you there. For
everything else you can get the notes/ recording after the meeting is over.
In general, to-do lists are a very unsophisticated form of task management: they
lack in structure, any sense of prioritization or a time component. Instead I
recommend using a calendar-based to-do list. Rather than putting work items in a
generic list, slot them in your actual calendar.
Using your calendar as a to-do list has several advantages. It blocks off dedicated
time for actually doing items on your to-do list and ensures you're not
overcommitting your time. It allows for prioritization by moving items around, and
it also makes it easier to predict when things will get done. Most calendaring
systems also have built-in reminder mechanisms that will notify you when you're
scheduled to do a particular task.
Every now and then say, once a month I encourage you to do a historic review of
your calendar and measure how you've spent your time. For example, Google Calendar
has built-in analytics and requires only very minimal adaptation of your
calendaring habits to provide accurate summaries of how time was spent. When
reviewing this data, ask yourself if the ratio of time spent on various types of
activities makes sense for the goals you're trying to achieve. It's also good to
check in and confirm that you're spending your time in ways that play to your
strengths and bring you personal satisfaction. Often just having this kind of data
presented matter-of-factly can provide good motivation for organization and
productive change.
When presented with a large, ambiguous challenge, such as taking over managing a
new team, or diagnosing and improving an individual's underperformance, I like to
use a three-step process. These steps should be done sequentially to come up with a
plan to address a particular problem.
1. Ingest
* You know you've ingested a sufficient amount of data when you start to see the
same thing multiple times and you stop seeing new patterns.
* Example: multiple people comment about a member of your team's performance, and
after a handful of sessions of active listening and getting curious, you've stopped
getting further new information on the performance issue.
2. Synthesize
* Once you've collected a sufficient body of data related to your problem, take a
step back from collecting information and give your brain time to process. I
recommend allocating at least a few days at this stage. Try to deliberately stop
taking in new information and spend this time looking at the problem from different
angles. Take notes, draw diagrams, play golf, take a shower, or whatever helps you
think through the problem and come up with an analysis that fits the data and is
actionable.
* To continue the above example, at this point you might try to come up with
various hypotheses for why that individual is underperforming: How are they
spending their time? Is it a skills mismatch, an expectations mismatch, or is
something going wrong in their personal life, etc.?
3. Act
* Once you've got a thesis, it's time to actually put a plan into place. When
you're taking action, it's important to validate that your plan is achieving the
desired results. Whenever possible, test, validate, and if necessary start the loop
over again.
There are three models for for making decisions with your team. As the manager,
you can make the decisions entirely yourself and present the result as a *fait
accompli* to the team. There's also the opposite approach: you start from scratch
and entirely co-develop the material with some or all of the team. The third
approach is a compromise between the first two: develop a draft yourself and
present it as a straw man to the team as a starting point for collecting feedback
and iterating to get to a final version. The key differences between these
techniques are the amount of time they take, and how much buy-in you get from the
team. And I encourage you to optimize for buy-in: ensuring everyone on your team
understands decisions and can be a champion for those decisions is the only way to
ensure you're all marching in the same direction. As Marty Cagan of Silicon Valley
Product Group calls it, you want your team to behave like missionaries, not
mercenaries.
**The Straw Man Model:** Codeveloping decisions starting with a straw man takes a
medium amount of time, and depending on execution can produce healthy buy-in.
Which model you use for any given decision is up to you, and I encourage you to be
thoughtful and deliberate about that choice. Don't be afraid to revisit it if you
feel you've chosen the wrong model.
Type 2 decisions are the opposite. They can and should be made quickly by high-
judgment individuals or small groups. Which exact shade of gray a button is can be
a Type 2 decision, as it's easily changed down the line.
Bezos's advice, which I'll echo here, is that using Type 1 decision-making for Type
2 decisions leads to slowness and failure to experiment and innovate.
Most of your day-to-day technical decisions are Type 2 and are best made quickly
and revisited or confirmed after you've collected more data via a prototype or MVP
implementation. This is because the most expensive element of most startup
technical decisions is the engineering team's time invested in the solution. If you
deliberately constrain the time (and thus cost) invested into validating a
reversible decision, you're out only a small bit. Most Type 2 technical decisions
become irreversible only after you've invested considerable time and new
engineering on top of them, so be rigorous about evaluating progress early on. When
in doubt, make the decision to reverse early.
As the leader of your team, ultimately you are accountable for achieving your
team's objectives. If the team as a whole fails to meet milestones, that's on you.
So, when there is a conflict or disagreement within the team, you need to engage
thoughtfully, and then be prepared to make a decision by following one of these
three narratives:
1. We'll go with your way because you've made a clear and convincing argument that
it's superior.
1. We'll go with my way because it's superior, and I'll explain why using the
additional context I have as a result of being a manager with a broader scope of
responsibility.
1. We'll go with my way because we can't identify any objective reason why one way
is better than another. In other words, it's a tie, and since ultimately I'm the
accountable party for success here, we'll go with my approach. I will own the
success or failure of this decision.
In *The 7 Habits of Highly Effective People*, Dr. Stephen Covey offers the
Urgent/Important Matrix, adapted from a concept introduced by President Dwight
Eisenhower in a 1955 speech. In the Urgent/ Important dichotomy, work is classified
by both its urgency (i.e., time sensitivity) and importance (i.e., impact). The
result is a four-quadrant chart:
I provide this framework here as a reminder to consider the value of various tasks
that arise. The tech leader is regularly bombarded by feature requests, debt to
prioritize, defects, etc., and having the perspective to ask whether any given item
is important and/or urgent is a very useful and quick triage mechanism.
There are two ways you join a team as a technical leader: either you start on a
team in a non-leadership position and grow or are promoted into the role, or you're
hired to lead a team. If you're promoted into a role, presumably you have good
business context and have demonstrated technical competency but have not yet proven
yourself in management and leadership. Conversely, if you're being hired into
leadership, you likely have a track record in management but lack context, history,
and background on the organization's business, technology, and people. It follows,
then, that your approach when starting in the role should differ to compensate for
the respective weaknesses.
If you're being promoted into a management role and have invested time in
developing management skills, or if you have experience and a track record as a
manager, then it's likely you won't find this transition very scary, and your goal
will be to continue leveling up as a manager. If this is your first management
role, however, you'll have a larger mountain to climb.
People management is an entirely new skill set from the technical skills that got
you your promotion. Technical skills are of course a prerequisite to being a good
technical manager, but they're far from sufficient. Understanding this, and
developing the additional skill set that your new role requires, will be key to
succeeding as a manager.
If you were promoted from backend engineer writing code in C to frontend engineer
writing code in TypeScript, what kind of things would you do? You might read a book
on TypeScript, do some TypeScript coding exercises, join a TypeScript user group,
read some TypeScript blogs, etc.
Keeping in mind German Chancellor Otto von Bismarck's dictum, **Fools learn from
experience; I prefer to learn from the experience of others**, here are some
actionable tips for new managers:
* Find a management mentor or coach. Talking through people's problems and gaining
additional perspective is invaluable. (See the sidebar Find a Management Mentor,
page 13.)
* Pay attention to burnout signs and take a break before it happens. Managing
people does not have to be an eighty-hour-a-week activity.
See the recommended reading list at the end of this book to find more resources for
developing your management skills.
If you've been hired to lead a team (as opposed to being promoted into a leadership
job or assuming a leadership role as a first-time startup cofounder), presumably
you have both technical and management experience. Your challenge as a hired leader
of an existing team is to integrate yourself into your new team as smoothly as
possible and build trust with your new peers. It should come as no surprise that my
advice is to focus more on the people than the technology when managing your new
team.
Below, I outline some short-term goals for an externally hired tech leader, plus
some questions you should try to answer very early.
Goals:
* Build trust with the technical team. Listen and be thoughtful about when/how
quickly you start adding value or changing things.
* Build trust with other teams/leaders, making reasonable commitments and following
through on them.
* Learn about the people you're working with and their history with the
company/product/technology.
* Diagnose the highest-impact people-specific challenges within the team. Are there
staff members who are inappropriately leveled, either underperforming or
overperforming their roles? Do any cultural challenges need course correction?
Taking decisive action early on to course-correct for culture is a great way to
build trust with team members who likely consciously or subconsciously were
suffering from the culture issue.
* Diagnose the highest-impact technical problem areas for the team as a whole and
put together a set of short-term, medium-term, and long-term objectives for the
team.
Questions:
* Who was running technology before? Is that person still on the team?
* It's common for tech leaders to discover that they want to grow into people
management. Where this has happened, you'll be stepping into a people management
void. The other common scenario is that either there was no prior tech leader or
they've left the job due to underperformance, and you are inheriting a large amount
of tech debt.
* What problem does the CEO say you were hired to solve? What problem do you think
you were hired to solve?
* What pain points exist between the technical team and the rest of the company
today?
* What pain points are the highest priority within the technical team?
After you've had some form of technical leadership on your résumé for a while,
you'll likely start to get friends, or friends of friends, reaching out for advice.
For the most part, I recommend taking these phone calls, not only because the
networking is valuable, but because the questions you are asked may force you to
think through and put words to ideas you're subconsciously working on. They say
teaching others is the best way to really learn something yourself.
**You can either be a shit funnel or a shit umbrella.** - Todd Jackson, Gmail
Product Manager (see ctohb.com/umbrella and ctohb.com/keytogmail)
Questions, concerns, and ideas about your product, absent any strict process for
directing them elsewhere, will find their way to management. That includes not just
you but everyone in management in your organization. Managers are the default
inbox, and the crux of Jackson's statement is that your team is the default outbox.
You hear, Hey, There's a bug in X, and you think, OK, engineer Y wrote that
feature, go send them the bug. That would be an example of funneling inbound
directly at your team.
A better strategy is instead to act as an umbrella for the team. Rather than
directing all the inbound in real time to the team, a good manager organizes,
prioritizes, and gives the team a structured queue to work with. Your goal is to
help the team focus, limit distraction, and provide a place for where inbound
should go so it can be efficiently processed.
Management should be monitoring that bug queue process to ensure the queue stays at
a manageable length, and adjusting staffing or process if product quality isn't
meeting targets.
You should prioritize your queues based on importance and urgency. If something of
critical importance with extreme time pressure arises, it should be put into the
queue and escalated to the top. Then apply common sense as to how you handle it. If
you need to call somebody to ensure they know it's there, then so be it, as long as
this is the exception and not the rule.
As a technical leader, your job is not just to manage the technical team; it also
includes serving as the technical representative in the C-suite. Your role is to
represent engineering and technology in all of the company's highest-level
strategic discussions and to ensure that engineering and technology are on the
right course for the business. At times, this means having difficult conversations
with other leaders that require vulnerability and humility, and conversations that
enable you to work through conflict and are grounded in mutual trust. These
conversations are key to the job, and for them to be possible you must be deeply
engaged with your leadership team, thinking of them as your first team.
You're likely a highly technical person. You probably enjoy your technical meetings
the most, or at least find solving technical problems with your team familiar and
highly satisfying. So, it's easy to fall into a pattern where you spend most of
your time with technical teams, and to adopt a my team is the technical team
mindset. This is a great way to build rapport within the tech team, and there are
times when that is the most critical relationship to invest in.
My guidance is simple: do not let the technical shiny objects distract you or limit
your investment in relationships with non-technical leaders. Your relationships
with other leaders and your trust with your non-technical peers will give you
credibility and enable you to guide the business to making good technical decisions
as a whole. Building trust outside the technical team is built the same way any
other trust relationship is built: great communication, regular expectation setting
by making and meeting commitments, and owning up to mistakes/failures if and when
they occur.
The technical leader or CTO who spends all their time deep in code with engineering
and barely participates in the leadership team will have little credibility when
trying to convince the other C-levels to invest further in engineering. Or worse,
they won't even be asked for their input when the time comes to make a hard call.
Other leaders will not have the context to understand the value of what the tech
team is asking for or the perspective on how engineering is operating at that
moment, and they'll lack the shared vision for what it can be in the future. Only
by regularly engaging with the rest of the leadership team, sharing that context,
and being part of the conversation along the way can you, as the technical leader,
ensure that the leadership team has a shared understanding of how engineering helps
the organization and how it needs to grow over time.
Every CTO CEO relationship is different, though there are a few elements that are
common to all such partnerships, as well as some key prerequisites for a healthy
relationship.
The CEO must have complete trust in your ability to lead the technical team to meet
business objectives. Building that trust means you need to be able to communicate
well with the CEO, both through proactive communication and making sure the CEO
always has enough context to ask good questions.
Communicating well means learning to speak business language and avoiding lapses
into tech jargon during leadership conversations. You want to empower the CEO to
communicate with you, and the more you speak their language (if they are not
technical), the more information you can get out of your interactions, and the
better the two of you will get along.
You and the CEO need to have a shared understanding of the direction of the
business and be able to engage in constructive and perhaps even contentious
conversations to ensure the depth of that understanding. Trust applies to overall
business direction as well as specific objectives.
There are many ways to build this, but you need to establish trust regardless of
the approach you take. Engage in shared non-work activities, find areas where you
share personal values, and use specific tools and exercises to build that trust
(see Brene Brown's BRAVING Inventory at ctohb.com/braving).
As with all C-levels, the CTO and CEO should have strong alignment on company
culture and values. It's particularly important that you focus on building a
positive culture within engineering, and between engineering and the rest of the
company. The technical team is often a very large if not the single largest line
item in a startup company's budget. Technical staff are also often the most
competitive roles to hire for, making recruiting and inevitable involuntary
employee turnover more expensive in engineering than in other departments.
Strong alignment between the CTO and other C-level executives on culture and values
is a key factor in ensuring the technical team feels respected and included in the
company, which should in turn help with retention.
One general piece of advice for working with other leaders or executives is to not
shy away from delivering bad news to your fellow leaders, especially the CEO. Since
you are accountable for the performance of the team, you may be tempted to
sugarcoat reality, to advertise that everything is fine. The problems with this
approach are numerous:
* If things are not fine, meaning deadlines are consistently missed or quality is
falling below expectations, your peers will know that and will wonder why you're
not owning those failures and explaining how you'll improve. This disparity between
reality and how you're representing it undermines trust in your leadership.
* From time to time, you'll need to make time for the software engineering team to
do non-user-facing engineering as an investment, either in tech debt or future
architecture. You'll need to have the trust of other leaders and the credibility so
that others believe you and understand the ROI on that time.
See the Principle section of Chapter 1 of Jocko Willink and Leif Babin's *Extreme
Ownership: How U.S. Navy SEALs Lead and Win* for more on the importance of owning
failure.
Technical topics are often highly nuanced; the details matter. Technical jargon
does a great job at helping to convey that nuance, so it's no surprise that when
engineers explain technical subjects they often use language that is unintelligible
to members of other departments. I'm sure you've witnessed the over-eager engineer
trying, with energy, passion, and excitement, to explain their project to a non-
engineer only to be met with a blank stare and no new shared understanding being
created. As a technical leader, you must do better.
If, for example, you have a major area of tech debt and you want to advocate within
the executive team for taking an entire month to re-architect that area of code,
you need to communicate your reasons in a comprehensible way. If you enter that
conversation discussing latencies, RPC calls, dependency injection, and acronyms
from your cloud service provider, chances are your CEO and CFO will tune you out
almost immediately.
If, on the other hand, you frame that conversation around developer productivity
and team morale, and explain the debt paydown in the context of team velocity over
the next six months, your argument will be much more convincing.
A few general tips for ensuring more successful discussions and presentations
particularly with non-engineers when talking tech:
Establish a shared language/vocabulary upfront. If you need to use any words the
average high-school student wouldn't understand, make sure that your audience
already knows them, or clearly define them before launching into the explanation.
Use relatable concepts. Technical challenges are often compared to other technical
challenges: that does not work when talking to non-technical staff. Rather than
describing your slow data transfer in bytes per second, compare it to traffic on a
highway.
Confirm understanding along the way. Ask questions of your audience during your
explanation. If they can say the punchline before you then you know you're on the
right track.
Don't assume you are in any way superior due to your mastery of technical language,
or worse, make your audience feel inferior for their lack of knowledge. I don't
want to get too technical for you is a great way to turn off an audience.
In general, try to keep your explanations as simple and concise as you can. Avoid
going down rabbit holes that might be a distraction or otherwise disengage your
audience.
In that way, hiring is as much a sales activity (where candidates qualify you/your
company) as it is a filtering process (you/your company qualifies the candidates).
It's important to keep this in mind every step of the way as you define your team's
hiring processes.
This section of the book covers the various sections of the hiring and interviewing
journey, sequentially, from headcount planning to onboarding.
As a startup, you have several key advantages in hiring, and it's critical you
leverage those in the process to ensure you can attract top talent. A few features
that give you an edge:
* You're smaller, meaning you should also be more nimble, higher-touch, and faster
to hire than large competition.
* You can sell a highly compelling company and personal growth trajectory.
* You can sell the impact that successful candidates will make.
* You can offer meaningful equity ownership in the company and thus the ability to
share in the upside of the company's success.
* To move fast, train and enlist coworkers for interviews before posting the job
description. Make sure that everyone understands the scheduling process, the
interview scripts, and scoring criteria, and how to use the Applicant Tracking
Software (ATS) to read and leave feedback before you even get started (see Sourcing
Candidates, page 59).
* Schedule all the interviews with each candidate upfront. If your process includes
four interviews, get all four on the calendar at the start, ideally within five
business days. If your team is quick about leaving their interview feedback (and
you should insist that they are), then you can give any candidates that fail out
early notice and cancel any pending calendar invitations. The alternative
scheduling subsequent interviews only after a candidate passes each round adds
multiple days in between each step, easily turning what could be a week-long
process into three weeks or more.
* A good rule of thumb, especially at a startup: nobody is too busy or too
important to make themselves available to meet with candidates if it makes sense
for them to do so, especially for more senior hires. If a strong candidate asks to
talk to your CEO and COO, then you should schedule meetings with them.
* Ensure that each interviewer has a unique script or guide that covers different
material, or material from a different angle than other interviews. (For more
detail on this, see the Ask Only New Questions section of Interviewing Best
Practices, page 63.)
As ever, there is no free lunch, and the benefits of hiring as a startup also come
with tradeoffs, primarily in the form of risk. Candidates are almost certain to ask
you questions about your company's product market fit, cash on hand or runway,
company culture, and work/life balance. I encourage you to be candid with
candidates about these factors, work with your executive team on the facts on the
ground, and have good answers to these questions when asked.
Remember, speed is your friend when recruiting top talent. If you can move somebody
through your full process in a week, you give your startup a major advantage
against larger companies whose processes often take months to arrive at a decision.
The first step towards deciding to hire is identifying a gap in the team. Gaps come
in several forms. Commonly, at an early stage in a company's development, it's
simply a skill gap. For example, your business decides that mobile apps are going
to be a key element of your go-to-market strategy, and your founding team has never
worked in mobile before. Certainly, they could learn and become effective over
time, but it would be far more efficient in both the short and long term to hire a
senior engineer who has experience in and desire to work on mobile to build and
maintain that project.
Other kinds of gaps include seniority gaps (not enough senior experience to make
good decisions, or not enough junior talent to handle less complex tasks),
management gaps (one manager responsible for too many people), or subject matter
expertise gaps (no one on the team who understands an area of the industry well
enough to guide decision-making).
The other major justification for a hire is to increase total bandwidth on a team.
These kinds of hires should be aligned with some kind of business objective, or
product roadmap, that justifies bringing on a new permanent team member at a given
time.
Once you've identified a gap, the next question to ask is when that gap needs to be
filled. Taking into account the lead time required to get a great hire, when does
it make sense to start the hiring process?
There's generally a pressure to hire ASAP. Try to fend off that pressure, as it's
not always the right answer. Every new person you hire adds complexity and overhead
to your team. Assuming the pain of having the gap isn't severe, if you can get away
with a smaller team for another six months and delay the hire, that can be a good
idea as it both reduces cost and gives you more time to build a case for the hire.
Headcount or hiring requests from your team will often have to compete with
requests from other teams, so it's useful to develop a common language across your
company for discussing how urgent or important a hire is. This doesn't have to be
very sophisticated; it could be a 0-5 ranking system, where a 0 represents an
urgent need, and a 5 a hire that would be nice to have but can wait a few months or
quarters before becoming urgent.
However, it's essential that you maintain a clear picture of your department's
contribution to that model, which will primarily come in the form of headcount
expenses (current and future). This model should provide some level of constraint,
in the form of either an annualized budget or an expense run-rate, that will guide
the timing of your hires.
Just as with designing software systems, when you sit down to design your interview
process, you should begin by considering your requirements and goals. While every
company should and will have its own requirements and perspective, here are some of
the things I consider when designing an interview process:
**Efficiency:** How much time and cost does it take to hire a candidate?
**Success rate:** How successful on the job are hired candidates and how long do
they stay with the firm?
**Equitable opportunities:** Have you ensured that every person has a fair shot at
being hired, and avoided unconscious bias as much as possible?
**Scalability:** Can people other than you run the process and be as effective/have
a success rate similar to yours?
#### Efficiency
Hiring well is an expensive undertaking for your company. That cost comes in actual
dollars, be it for recruiters, job board listings, or job advertisements. It also
costs time, primarily in employee time spent conducting interviews. As you design
your interview process, consider what your intention is with each step, what you
are filtering for, and what is the most efficient way to accomplish that filtering.
One way to reduce or at least spread out the time investment is to include other
team members in the hiring process. Depending on the subject matter of a given
interview, you don't always need your most senior engineers in the room. A hiring
coordinator, with appropriate training, can do a phone screen, a culture interview,
or a reference check just as effectively as a senior engineer or executive.
Not every hire is going to be a home run for your company. Some will be leveled
incorrectly, some won't be a culture fit, and others will be fired or quit in the
first year. Especially as you scale an interview process, you need to measure how
many hires are successful. This is one of the few opportunities as a technical
leader where you can calculate clear, consistent, and indicative metrics, so take
advantage and ensure your process is top notch. Consider tracking time to hire
(from posting a job description until a new hire start date), overall employee
retention, new hire attrition (or down-leveling), as well as how many new hires are
promoted in their first two years.
As Andy Grove discusses in *High Output Management*, even a world- class interview
process is successful only about 70 percent of the time. Fundamentally, there are
many risks in hiring: you're trying to predict how someone will perform forty hours
a week, week in and week out, based on just a few conversations and data points
gathered in an interview process.
The best leaders track their success rate, are not afraid of admitting hiring
mistakes, and will hire slow, fire fast.
There's no getting around it: firing a new employee who isn't working out shortly
after they were hired is socially awkward and uncomfortable for everyone. It is,
however, the responsible thing to do for your team. Some practices to help provide
transparency to new employees and assist managers in making good decisions include
implementing a formal ninety-day probationary or introductory period and required
new-employee/ manager check-ins every fifteen or thirty days, or using a contract-
to-hire employment structure.
Candidate experience is how candidates feel about your company during and after
they go through your hiring process. Many candidates will do due diligence on your
company before applying or interviewing. They are likely to look at online forums
and social media and see what other candidates or employees who went through your
process have to say about you.
You can't always control what people say about you, but nonetheless, you want to
provide the sort of candidate experience that makes them more likely to read good
things online, have a great experience themselves, and thus be more inclined to
continue with your interviews and accept your offers.
There's a saying that people tend to hire people who look like themselves. This is
often the result of unsophisticated interview scoring methods that simply rely on
an interviewer's gut feeling, and gut feelings are often strongly influenced by
unconscious bias. This bias can disadvantage candidates of other races, genders,
ethnicities, etc.
As you design your interview process, you should focus on evaluations based on a
rubric that aligns with requirements from a job description, not just an
interviewer's gut feeling. See the Avoiding Bias section in Interviewing Best
Practices, page 63, for more about avoiding biases.
#### Scalability
It's all well and good if you, individually, are capable of hiring effectively. At
some point, there will be more open roles than you can hire yourself, and you'll
need to scale the process and bring in other people. To do so effectively, you must
build a repeatable system that others can leverage to identify top talent and hire
with the same efficacy and success rate as you.
That means that somebody else will need to be able to conduct the same interviews
and draw the same conclusions at the end of that process that you likely would have
if you'd conducted the interviews yourself.
The goal of the remainder of this section is to help you create a scalable system
for interviewing and hiring that fits your organization's goals and can work
seamlessly *without* your direct involvement (once you've taken the time to
calibrate it). By defining and deploying this kind of structure, creating
thoughtful documentation, and templating, you enable others to conduct interviews
and produce candidate ranking scores that would closely mirror your own.
Many companies underestimate the value of a great job description. A really good
job description does two main things for you: it helps you create clarity and
alignment internally with your company on what the role does and the value it can
offer, and it advertises your company and attracts the kind of applicants that
could be a good fit.
Often these posts can provide good inspiration and calibration, especially when
you're hiring for less common roles that may not be as well addressed by the
scalable system you've put in place.
The traditional job description has a brief description of what the role will do,
followed by a bulleted list of requirements for the candidate. I encourage you to
write more than that. Rather than focusing on what a successful candidate will do
in a particular role, think through the purpose the role serves. What outcomes does
the role drive? What kind of impact do you expect from this role in three, six, or
twelve months? You may or may not want to publish the answer to these questions as
part of the job description, but the exercise of going into detail on expectations
will prove valuable nonetheless.
Socialize the answers to these questions with other leaders at your organization
and ensure they agree with the answers. Don't be surprised if you get significant
feedback on the first version of the responsibilities and outcomes of a role. At
most startups, before a headcount is formally opened, there is a high-level,
unstructured conversation around a specific title. Oh, we need to hire a senior
JavaScript backend engineer. The act of writing and socializing the job description
enables your team to get precise about what the company really needs, so it's
natural that you'll need to do a few revisions.
Everything your company posts publicly is a reflection of your culture and brand. A
job description is no exception. The job description targets the single most
valuable customer of that culture and brand: your employees, present and future.
Ideally, the right candidate someone who not only meets your job requirements but
who is also a great culture fit reads your job description and is excited about
both the role and the company itself.
* Include your company's core values, mission, or vision whatever you have front
and center.
* Include the impact that the role will have on your team, company, and customers.
* Include location, on-site requirements, and whether remote work is allowed and to
what extent.
Inbound recruiting is about marketing your job opening and collecting voluntary
candidate applications. Much like any other marketing exercise, a one-channel
approach may not be enough to drive results.
As such, posting a job description on a job board is the bare minimum. Depending on
the state of the market, how many roles you're hiring for, and the quality/clarity
of your job description, the posting alone may be sufficient. Often you'll need to
do more to draw in top talent, including actively promoting your roles in
specialized tech communities and/or marketing your brand via conference
attendance/sponsorship, a company blog, social media outreach, etc.
There is no universal best venue for placing classified ads that great employees
turn to. Keep your ear to the ground for whatever platform/job site seems to be
most common and post your job description accordingly. This is something a good
Applicant Tracking System (ATS) will help you with, as it can track a referral
source for every candidate and provide metrics around which job boards bring in
better/more candidates that make it deeper into your process than others. When
hiring designers in particular, it's important to talk to some working designers
about where the most popular portfolio hosting sites are and maintain a presence on
those job boards to find the best candidates.
You'll also want to monitor how many applications you're receiving for each role.
At a minimum, your hiring manager(s) should be looking at the state of the funnel
for their roles on a weekly basis and adjusting their approach accordingly. If a
role isn't getting enough applicants (or is attracting the wrong applicants), then
change something! Try tweaking the job title or posting the job description to new
channels. A key element of a strong hiring process is the same as any other process
you build for your team: a humble willingness to revisit past decisions and improve
over time.
Not all external recruiters are the same. You want somebody who meets all of these
criteria:
* Highly organized
* Able to effectively sell your role (it's your job to train and hold them
accountable to do this well)
* Inclined to value the relationship with both you and the candidate more than the
commission for a single placement
#### Referrals
The highest return on investment in hiring comes from internal referrals, i.e.,
referrals from your existing employees. People are much more likely to want to do
business with a company that is spoken highly of by a current team member, and it's
often easier to find a cultural fit when the candidate has already been vetted by
someone familiar with your culture. You can encourage internal referrals by
providing cash incentives (see sidebar) or having good communication with referees
as to the status of their referrals.
Given that referrals have such a high chance of success, you want to provide the
best possible candidate experience. You may also want to consider an abbreviated
(but fair) hiring process. Skipping or compressing any top-of-funnel coarse
filters, such as phone screens or qualification forms, may be appropriate. You may
also want to encourage the referrer to contribute a paragraph or two, in writing,
justifying their referral.
Based on the data you likely already have, it's relatively easy to approximate the
cost (in both time and actual dollars spent) to hire a new engineer for your
company. If you consider that referrals often have a substantially higher
conversion rate to hire, it becomes clear that referrals save thousands to tens of
thousands of dollars, which can help you justify a multi-thousand-dollar bonus to
any employee who refers a candidate who is ultimately hired and stays in their role
for more than a few months.
The interview flow is where the rubber meets the road on your ability to determine
how well a candidate fits the role you're hiring for. Keep in mind that there is no
perfect interview. The amount of data an interviewer collects in a sparse few hours
with a candidate, of course, cannot perfectly predict how well somebody will do
full-time on the job for months and years to come.
In this section, I cover some high-level interviewing best practices, and then
provide some background and context on the various steps of interviewing, including
candidate/résumé intake forms, phone screens, culture interviews, technical
interviews, coding assignments, or take-home assignments, executive interviews, and
finally reference checks.
When designing your interview process, your candidate experience should be top of
mind and a top priority. Even if you choose not to hire a candidate, that person
will walk away with an impression good or bad of you and your company. That
impression may lead to them singing your praises to those in their professional
network who may someday apply for your roles. Or that impression could lead them to
rant negatively about you every chance they get.
Job boards and Google reviews are littered with the evidence of interviews running
amok, and it's very difficult to undo the damage to your reputation once it's been
done. While it's true that, for some candidates, no amount of respect and
consideration on your part will prevent the bitter sting of rejection from
poisoning their takeaway opinion of you, those people are in the minority. For most
candidates who get to the interview stage, a respectful and thoughtful interviewing
process will leave them with a neutral-to-positive feeling about your company and
help you avoid negative press online.
Ideally, you/your team will communicate the steps and scope of your hiring process
to candidates upfront and leverage an easy, reliable solution for scheduling those
steps in real time. For example, you can choose to (A) designate a hiring manager
to handle all of the scheduling during business hours, (B) schedule all of the
interviews in advance, or (C) provide an online tool that candidates can use to
schedule their interviews asynchronously on their own time.
Truly, anything is better than requiring each interviewer to email each candidate
before each interview to set up schedules sequentially, which can drag out an
interview process over weeks or months.
Every interview touchpoint should feel to the candidate like a continuation of the
conversation, rather than a rehashing of details that were discussed in prior
sessions. Avoiding the latter requires thoughtful structuring and careful planning
in advance of your interviews.
Ideally, subsequent interviews should be used to dive deeper and explore areas
specific to a candidate or role, where both parties are looking to fully understand
key strengths and weaknesses. Sharing suggested areas to focus on or new questions
to ask with subsequent interviewers via an Applicant Tracking System (ATS) is a
great way to ensure continuity, efficiency, and a great candidate experience that
can reveal whether or not your potential hires are a true fit for your team.
If you're unfamiliar with the phrase unconscious bias, I encourage you to read
*Thinking Fast and Slow* by Daniel Kahneman. It's my go-to book for understanding
many types of systematic errors our brains make.
Bias takes many forms. Most biases are unconscious and can surround gender, race,
alumni status, or socioeconomic background. But bias can also mean that the
conclusions drawn by an interviewer about a candidate ahead of an interview are
based solely on ranking scores from a prior interviewer. There's no system that
ensures eliminating all harmful biases, but there are certain steps you can take to
minimize unconscious bias, such as blanking out candidate names or photos (which
often hint at gender and ethnicity) during a résumé screen.
Most of the interview feedback should consist of detailed notes and scores against
the job-specific scoring guide which has your interview questions planned out in
advance (for more, see Technical Interviews, page 76). This feedback should ideally
not be read by subsequent team members in advance of their interview to avoid bias.
For example, if you know the prior interviewer scored the candidate poorly, you may
experience confirmation bias and overvalue any areas where a candidate does poorly
in your interview.
When interviewing more than two or three candidates simultaneously, it can require
a substantial effort to manage the logistics of where candidates stand in the
process, coordinate notes from interviewers, and communicate consistently and
promptly with candidates as they move through the funnel. Without a finely tuned
system to manage all of these logistics, it's easy for candidate experience to
suffer and for hiring costs to rise. This is a universal problem, and several high-
quality, off-the-shelf Applicant Tracking System (ATS) solutions have been
developed at various price points and levels of sophistication to address this
problem.
The guidance here is simple: choose and onboard an ATS early. Don't wait until your
process is already underwater to take action. Train your team, require widespread
adoption of the system, and set expectations for its use with HR, hiring managers,
and interviewers.
* During the sales process with a customer, you're always focused on selling the
prospect on the product, even when you're qualifying the customer. A good sales
process regards qualifying candidates as a funnel, with light-touch qualification
at the top and progressively more nuanced/time-intensive qualification down-funnel,
along with a progressively more customized and tailored sales pitch.
* You should always be selling your candidates on the advantages and positive
benefits of joining your company and the role/opportunity you're offering. By the
time they jump through all your interviewing hoops, they should be eager to work at
your company and excited to take your job offer over others they have received (or
may yet receive).
* Always leave candidates some time (five or ten minutes) to ask questions at the
end of the interview. Most qualified candidates come to interviews armed with
questions, and you can learn a lot about what somebody cares about by what they
choose to ask. This is a good opportunity for your interviewer to sell the benefits
of your company in their responses.
* Along the way, ensure candidates feel respected and are progressively exposed to
more of your company. Your best candidates need to feel like they were
intelligently vetted *and* like they've learned enough about the company to get
excited. Ideally, you want even rejected candidates to be able to leave positive
reviews on Google and Glassdoor. You can accomplish that by selling your company's
benefits throughout the interview, respecting people's time as if it were your own,
having consistent and timely communication, and ensuring that everyone feels the
process was as fair and transparent as possible.
The beginning of the interview funnel is a form that achieves two goals: it
provides the candidate with some information about your company and its hiring
process, and thus a sample of its culture; and it takes in a bunch of information
from the candidate to act as an inexpensive, coarse-grained filter.
At the top of your intake form, you should outline several key pieces of
information for candidates:
* Reiterate the role they are applying for and its key requirements and impact.
* Reiterate your company's core values and provide a sample of your culture.
* Set expectations for the hiring process, how long it will take, how many steps
there are, and generally what the process looks like.
The questionnaire should include a request for the candidate's résumé (or LinkedIn
profile URL), ask some questions required by legal and HR with respect to
employment eligibility, and then ideally ask a few qualifying questions of the
candidate. The qualifying questions should be light-touch, generally freeform, and
possibly even technical questions to ensure the candidate is in the right ballpark
for the role. For example, for a role that requires experience in JavaScript, it's
not unreasonable to confirm that experience in the questionnaire with a question
like, Rate your comfort level working with JavaScript on a scale from not
comfortable to extremely comfortable.
This may seem redundant to the requirements listed in the job description, and it
is, though you'd be surprised how many résumés will come through lacking basic
qualifications. These questions are quick/trivial for the candidate to answer and
just as quick for a hiring manager to use to filter out applicants.
If you're inundated with candidates and want to do a bit more filtering at this
stage, the questionnaire can also include one or two more interesting or difficult
questions. If you include these, be sure to still keep them brief; you don't want
to lose candidates in this form because the questions were too arduous. If you're
overwhelmed with applicants then bias towards more data to filter with here,
otherwise maybe it's best to save more nuanced qualifications for further down the
funnel.
Some example questions for an intake form covering broad compatibility and self-
identified technical familiarity (I've included a sample at (ctohb.com/templates)
[https://ctohb.com/templates]):
* What are deal makers and deal breakers in your next move?
* What gives you energy in your work? What taxes your energy?
* How familiar are you with basic technical qualifiers: rank familiarity with
[relevant programming language or tool] on a scale of 1-10?
The initial phone screen, like everything in the interview process, serves a dual
function: it's an opportunity to learn more about the candidate, and it's the
candidate's first interaction with (and evaluation of) a human at your company.
Given that this is the first person the candidate will have an interaction with at
your company, it's worth thinking carefully about who conducts the interview. The
questions at this point should not be very technical in nature and so it's not
necessary that the interview be conducted by a member of the technical team. Often
it is done by HR or a dedicated recruiting team.
Regardless of who runs the phone screen, ensure that person is a good cultural
representative for your team/company and is equipped with the information technical
candidates are likely to ask for at this stage, including:
1. What the software stack looks like: including key languages, tools, and target
clients (e.g., mobile, desktop, etc.). The interviewer should have a rudimentary
understanding of the words they are using here, and not just reading off a list.
2. The size of the technical team, both at large in the company and that the
candidate would be working with. This should also include general hiring forecasts
and roughly how many people are being added over time.
3. Who the candidate would be reporting to. Provide some basic background on that
manager, including their tenure at the company, maybe what they did before working
at the company.
4. A great sense of the company's core values/culture and way of doing work.
The interviewer's goal should be to introduce the candidate to the company, its
culture, the role, and the hiring process. They will also ask some high-level
questions of the candidate to confirm their structural fit for the role. You want a
candidate to walk away from this interview motivated to do well in the rest of the
interview process and excited at the idea of working for your company.
The exact questions asked in a phone screen are thus not super important. Here is
an outline of some areas you may want to cover:
* Do they have anything constraining their hiring timeline (e.g., other job
offers)?
* Where is the candidate located, and are they willing to relocate if necessary?
* Roughly when can they start or are they looking to start?
* Confirm compensation expectations are aligned and explain benefits/perks.
In addition to good answers to the questions, the interviewer should gauge their
general fit for the role. Does the candidate communicate clearly, do they seem like
a culture fit, does their claimed experience match what they have on their résumé,
and are they interested in the company and opportunity?
One of the major criteria you're looking for in your interview process is culture
fit. Culture fit is all the elements of a candidate's personality, beyond their
experience and skills, that will enable them to be successful in your organization.
In order to effectively screen candidates for culture fit, your company should have
a fairly clear idea of what its culture is. This can look like many things, for
example, a list of core values, a mission statement, a vision statement, guiding
principles. Whatever they are, they should be authentic and true to the company. If
you're struggling on this, I would refer you to *Team of Teams* by Stanley
McChrystal, *Work Rules!* by Laszlo Bock, and *Good Authority* by Jonathan Raymond.
Currently, there are few formally structured interview programs that are widely
used. The one that does come up fairly regularly is called *topgrading* which
refers to at least two different things: the topgrading method and the topgrading
interview. The topgrading method
([ctohb.com/topgrading](https://ctohb.com/topgrading)) is a hiring methodology that
was purportedly developed by General Electric in the 1980s/90s and written about in
Verne Harnish's *Scaling Up*. The topgrading interview ([ctohb.com/interview]
(ctohb.com/interview)), which I call the culture interview, is a specific interview
agenda, style, and structure designed to learn about a candidate's background and
cultural fit.
For each role, topgrading has the interviewer ask the following questions:
* What do you think the supervisor's honest assessment of your strengths and
weaknesses would be?
Whether you're using one or two interviewers, taking notes is critically important.
To review candidates fairly, you will want to create a scorecard which evaluates a
candidate's answers, looking for alignment to your company culture. For example, if
respectful challenge is a company core value, ask the candidate if they could
identify any instances of challenging respectfully. Or did they speak
disrespectfully about any past coworkers? Using notes after the interview to
complete and justify scores on a scorecard is essential.
**Minimize Drop Off:** Employers desire to have candidates actually complete coding
assignments and not fall out in the funnel.
There are several styles of coding interview or assignment. Assignments range from
take-home projects with a prompt, to using an online platform for programming
exercises (also sometimes known as code katas), to live pair programming. Absent
any empirical data about the predictive capability of these styles, I encourage you
to design an exercise that looks as much like regular day-to-day work at your
company as possible. If you don't do any pair programming at your company, then
gauging how a candidate performs in an interview setting pair programming,
intuitively, doesn't feel highly correlated/predictive. At the very least it's
collecting tangential signals.
As a manager, your aim is to get the best out of the people you work with. With
that in mind, try and recall the last time you were interviewed and exercise your
empathy muscle when designing your coding assignment. Being interviewed is, for
most, a very stressful process, and being asked to be creative or problem-solve in
that scenario doesn't always bring out the best performance. Some ways to help
candidates do their best work on a coding assignment are:
* Be explicit about what you're looking for in the candidate's output. For example,
if your scorecard measures how well they've documented their code, then ensure the
prompt the candidate is given tells them to include documentation. Or if you plan
to run the code, let the candidate know whether you'll just be evaluating
correctness, or if other elements matter, such as performance, negative cases, etc.
Candidates are more likely to complete your take-home coding assignment if they
find it interesting and easy to get started. The best assignments are topically
related to your business and ideally expose the candidate to the kind of problems
your company actually faces on a daily basis.
**Bad example:** You're a web SaaS platform, and you assign a candidate to do a
challenge related to mobile phone development.
**Good example:** Your company integrates with many legacy third-party APIs, and
your challenge is to build a limited integration with a Sandbox API with similar
domain nouns/verbs to the real business.
Providing candidates with existing code repositories that have working build
systems/tests to start with can save the candidate time bootstrapping a build
themselves.
The classic technical interview, practiced by many of the largest tech companies,
involves some form of shared whiteboard experience where the candidate is asked to
solve a technical problem in real time. The problems range from the academic, sort
an array with some special conditions, to high-level/hand-wavy architecture, design
a system to handle 100 million users posting news feed updates. The classic
interview approach must work for the big companies, as they continue to use it year
after year, but I don't see how they work at a startup. They re often overly broad,
or overly narrow, and thus difficult to score fairly. The academic questions are
rarely correlated with the types of problems one solves on a daily basis on the
job.
#### Methodology
To find out where a candidate's strengths and weaknesses are, and how much that
matters in the role you are hiring for, first you need to decide what topic areas
matter for your role. You do this by creating a technical focus interview guide,
which should include a list of anywhere from four to eight technical areas, and
within each area a set of sample questions, best practice answers, and a scoring
guide.
The sample answers and scoring guide are included to ensure fairness and uniformity
in scoring across multiple interviewers and across candidates. You're trying to
differentiate where any given candidate has gaps vs. true expertise, so your
questions should be designed to elicit one of three kinds of answers: bad, good,
and amazing. Thus, they should lend themselves to being scored as such. When it
comes to scoring a question, to make the difference between a knowledge gap and
true expertise obvious, I recommend that a bad answer gets a score of 0‒2, a good
answer gets a score of 3‒6, and only an amazing answer gets between 7‒10.
When I say a bad answer, I mean a response to the question that demonstrates either
little to no experience or expertise with the topic at hand. A good answer
demonstrates competency, maybe even a very high level of competency, in the topic.
An amazing answer demonstrates not only competency but true understanding and
intellectual depth on the topic. For example, if the question concerns how the
candidate thinks about designing a unit test suite, and their answer is they've
never thought about it, that's a 0 and you've found a gap. If their answer includes
a description of some test suites they've designed and some justification for it,
that's good, perhaps a 5 or 6. If their answer includes a full outline of test
suite design philosophies and the pros and cons of each and how to apply them in
different scenarios, now you're looking at real expertise and a 7‒10 score.
In the spirit of giving candidates the best chance at success, I don't recommend
scoring every question. Instead, provide a score on a topic area. This way you can
try multiple questions within a topic, looking for areas of expertise with a
candidate and scoring the net result for that topic.
Make no mistake, writing these questions, sample answers, and scoring guides is a
lot of work. The good news is that any given question is useful across multiple
roles and can be reused over a long period of time. In fact, I encourage you to
maintain a central repository of questions (and associated sample answers/scoring
guides). When it comes time to write the next technical focus interview guide,
you'll find your job much easier by being able to reuse questions from the
repository as appropriate.
The qualities you're looking for in a junior hire, with say one to two years of
coding experience, should be very different from a senior hire with ten-plus years.
The ideal junior hire should be curious, eager to learn, and have solid programming
fundamentals and be prepared to work on incremental feature development. A senior
hire, by contrast, should come with not just programming fundamentals but deep
thinking on architecture, opinions, and best practices across a wide range of tools
and problems. They should be able to develop trust that they can not only build
incremental features but own and make good decisions in architecture for new
greenfield projects. Since the key value these two types of roles offer is so
different, it should follow that your interviews for them should be different.
For a senior hire, the focus interview where you deeply explore the candidate's
decision-making skills, understanding of concepts, and architectural know-how is
critical and should be weighed heavily. For a junior role, that knowledge deep-dive
should be shorter, and weighed less heavily than a practical coding exercise.
Start the interview informally with some light conversation. After a minute or two,
begin describing the agenda/plan for the meeting. Let the candidate know you have a
document with an interview guide in it, and your goal is to get the candidate to
discuss the topics in that guide over the next sixty to seventy-five minutes,
leaving fifteen minutes at the end for them to ask you questions.
After the preamble you'll jump into the first section of the interview guide. Your
goal in every section of the guide is not to ask every single question. You're
looking first to determine which of the three categories the candidate falls into
for that subject area bad, good, or amazing and then to narrow down a score from
there. You should have a pretty good idea of where to categorize the candidate
after the first question or two, then use follow-up questions to probe further to
narrow in on a score.
If a candidate completely misses, or admits they are not familiar with a topic,
there is no need to keep going to every question; you've got your score and you can
move on.
On the other hand, if a candidate nails the first question, they may well be a true
expert in that area, but you likely won't be confident of their mastery until
they've provided insightful answers to multiple questions across the subject.
Typically, it takes more time and questioning to identify mastery than a lack of
qualification.
Don't hesitate to politely cut off a candidate's answer and move on to the next
category when you know you've heard enough. Your goal is to help the candidate
demonstrate their skill and knowledge across all the topics that you've decided are
important for this role and chosen to evaluate in this interview. Letting a
candidate rabbit-hole and consume time on a single topic when you already have all
the information you need for a score robs them of the opportunity to demonstrate
their capabilities in other topics if you run out of time in the interview. It is
your job, not the candidate's, to manage the pace of the interview.
By the time a candidate gets to an executive round interview, you should have
already confirmed that they have the skills required in your job description and
will be a suitable culture fit for your company. The executive interview, in most
scenarios, is less about an executive screening a candidate and more a chance for
the candidate to meet and ask questions of the executive.
If, however, the candidate is applying for a very senior role, or is going to be
reporting directly to the executive, then it may be appropriate for this last
interview to be longer or more thorough than simply candidate Q&A.
With reference checks, you need to strike a balance between scheduling them early
enough in the interview process to ensure that they don't create a bottleneck and
not wasting time on reference checks for candidates who will not get offers. Keep
in mind that candidates, rightfully, may be hesitant to provide references until
they're at the end of a process to protect their own relationships with the
references.
#### Timing
It follows then that reference checks almost always happen last in an interview
process. To avoid having to delay an offer on completing reference checks, here are
a few tips:
* Begin scheduling meetings, in parallel, with all references as soon as they are
provided. Given their brief nature, the most efficient strategy may be able to call
references without scheduling.
#### Content
* Qualify how credible the reference is: have you managed many other engineers in
your career?
* What were [name of candidate]'s biggest areas for improvement back then?
* How would you rate their job performance in that job on a 1-10 scale?
* In what environment and under what management style would [name of candidate] be
most successful?
By the time you're ready to make somebody an offer you should have a strong
opinion, based on the job description and feedback from your focus interviews, on
the level at which you would be bringing in the candidate. From there it should be
relatively straightforward to identify a salary/ bonus/equity amount using your
predefined leveling bands. (See 1.4.2 Compensation and Leveling, for more on this.)
Once you've calibrated your offer amounts, you should decide how to present the
offer. Especially if your offer includes equity compensation, you should seriously
consider providing a spreadsheet that provides context to the offer amounts. The
value of a number of shares on its own is impossible for a candidate to assess.
They need additional data points to value what you're offering, including numbers
like total shares outstanding, share strike price, latest company valuation, etc.
The moment when you present the offer is when you need to be in super sales mode.
Ideally, you've been selling candidates all along the way so they're already very
excited about the company and the opportunity for them. Regardless, this is a big
deal for the candidate, so make sure to give the occasion the respect it deserves.
Throughout the process of explaining the offer, remember to be especially upbeat,
congratulate the candidate, and emphasize the fun you'll have and the great things
you'll build together. It's also critical that you're transparent and outline all
the key points of the offer upfront, especially anything they may not be expecting
or used to, such as equity compensation or probation/trial periods.
I recommend making the offer in three parts: a phone call, an email, and a dinner.
For the phone call, I suggest calling the candidate without prior scheduling. At
this point you'll have already done a whole bunch of scheduling with the candidate,
so there's no need to build up their anxiety further by scheduling yet another
meeting. Alternatively, you could tell them in writing that you intend to extend an
offer and schedule from there, but you lose the impact of being on the line with
them when they get the news. I find it's just simpler to call the person and share
the news all at once.
On the call, you should express excitement, convey the key points of the offer, and
answer any initial questions. Explain that, subsequent to the call, you'll email
them written materials to help provide context on the equity and, of course, a
formal written offer letter will be coming from the company. And finally, if
logistically possible, schedule a meal with the candidate to have a more personal,
in-depth conversation.
## Onboarding
Onboarding new engineers to the team, in most cases, doesn't strictly require a
large investment from the team; a good engineer will figure it out eventually. That
said, doing nothing will lead to a poor experience for your newest hire. It will
slow down their time to productivity, and it may also make it harder to identify
how well you've hired. Stated another way, good onboarding optimizes for three
goals:
1. **It respects the employee:** A good onboarding experience helps a new hire to
feel integrated into your company and culture and become productive as quickly as
possible.
1. **It helps evaluate the quality of the hire:** Good onboarding provides
structure for both the new employee and their manager, including clear goals that,
when achieved, demonstrate that you've hired well for the role.
1. **It builds your culture:** Good onboarding emphasizes a culture of continuous
improvement, helping to streamline the process for future hires and enhance the
scalability of your overall processes.
There are many right ways to do this. What follows are some relatively simple and
inexpensive techniques and practices that I've used myself. Feel free to expand on
or deviate from these ideas.
I encourage you to emphasize to your managers, your new employees, and in your
onboarding documentation that successful onboarding is the shared responsibility of
all members of your team(s), recent hires included. Depending on how often you are
hiring, onboarding documentation has a tendency to get stale. If a new employee
encounters something that is unclear, incorrect, or missing entirely from their
materials, make it clear to them that you expect them to put in the effort to
clarify and improve the documentation for the next person.
There are elements of onboarding any engineer new to your company that should be
consistent across all hires. This includes the high-level process, the emphasis on
organizational culture, the types of documentation that new hires receive, and the
structure of sharing documentation and setting onboarding milestones. You wouldn't
want your frontend teams to have a rockstar-smooth onboarding process but your
backend teams to be clueless. First impressions count, and onboarding is your
opportunity to ensure that all team members get a great first experience of your
organization and are introduced in a consistent way to your company's values and
your team's best practices.
That's not to say that the nuts and bolts of onboarding will be identical across
teams. You can and should have different materials for different teams when it
makes sense, and every team and individual hire should have a customized onboarding
plan and milestones.
There are two key elements of getting a new engineer onboarded: teaching them about
your culture and best practices, and also giving them something to do by way of
structure and instructions. I prefer to break these out into two written artifacts:
The Engineering Guidebook and the Welcome to [Your Company Name] Engineering, Day 1
Guide.
The Engineering Guidebook gathers in a single document all of the opinions, best
practices, structural elements, and business operations of your engineering team.
It should be the single source any engineer can rely on to learn about choices and
decisions that are expected to be consistent across the engineering organization.
Be deliberate and thoughtful about exactly what practices should remain uniform
across the organization.
The larger your team becomes, the more it will make sense for pieces of the team to
develop their own specialized way of getting work done. That said, for most
small/medium startups of, say, less than seventy-five to one hundred developers,
there is a ton of value and efficiency to be unlocked by adhering to a healthy and
consistent set of best practices.
Software Engineering
- Choice of programming languages
- Opinions/requirements around CI/CD
- Standards for naming (casing in code, casing in contracts)
- Standards for data processing, protection, backup, security
- Opinions on how to use source control (Git Flow, GitHub Flow)
- Opinions on testing (kinds, tools, how much to do)
- Standard patterns for frontend and backend authentication and authorization
- Wire protocol standards (REST, gRPC, GraphQL, etc.)
- Universal requirements (Do we support mobile, responsive, translation?)
- Certification frameworks and related training (e.g., PCI, SOC2, GDPR)
- Other coding logistics: accessing private repos, linting, static code analysis,
commit message format/style.
Engineering Process
People Management
- Expectations for how performance reviews are conducted, how individuals are
evaluated/promoted
- Expectations for contribution to onboarding/hiring processes.
Distributing a Day 1 Guide is your opportunity to provide some structure for new
employees, giving them a concrete list of things to do on their first day with your
organization that will introduce them to the company culture, their teammates, your
process, and your software stack. The Day 1 Guide should, of course, reference The
Engineering Guidebook as required Day 1 reading. In addition, your Day 1 guide
should cover the following:
- Source control
- Ticket management
- Any dev/stage/prod logging
- Error tracking
- Any design tools (Figma, Sketch)
- Documentation/wiki (Confluence, Notion, etc.)
- Internal communications (Slack, email)
- Information about company hardware (including whether new hires get to choose a
laptop/phone), and expectations for using that hardware
- Instructions on how to set up a local development environment
- An introduction to the team and company organization chart: who their manager is,
relevant cross-functional leaders, direct reports, and relevant VPs or executives
- Expectations around transparency and reaching out across the organization chart
for help or escalation of concerns
- An introduction to the technical architecture
- Relevant books, blogs, and other written resources you encourage all team members
to read
As discussed in the hiring chapter, hiring is very hard. Even the most thoughtful
hiring processes will not achieve a 100 percent success rate. Said another way,
mis-hires are inevitable.
The best way to handle the potential for unsuccessful hires is first to have the
humility to acknowledge that your hiring process isn't perfect, and then to be
thoughtful about how to measure the success of the new employees and take swift
action to correct any mistakes. The process should be transparent upfront to new
employees, clearly explaining expectations. Managers should work with new employees
to make sure their role is a mutual fit, that the new employee is starting to feel
at home in the role, and that they are delivering at a level commensurate with what
they were hired for. At sixty or ninety days, it should be clear to both the new
employee and the manager whether those expectations are being met.
* Establish clear and transparent expectations between the manager and the new
employee.
* Provide guidance for the new employee on what they will do and how they'll be
measured in their first ninety days.
* Provide obvious criteria for meeting or not meeting the expectations of their
role.
The scorecard doesn't have to be lengthy or highly nuanced. The key thing is that,
whatever form it takes, after ninety days the employee and manager can look at the
scorecard and agree on how the employee has performed and have a shared feeling of
confidence on whether this is going to be a good long-term fit.
A quick word on the ninety-day length: ninety days is a commonly used timeframe for
onboarding new employees, but it is not a hard rule. A thirty-day interval is
generally too short in engineering, where there is a significant learning curve to
mastering your technology, tools, and product. On the contrary, waiting a full
performance cycle e.g., six or twelve months leaves a potentially poor fit in the
role for too long, preventing them from getting the remediation they need to
achieve productivity, and costing the company lost time and productivity. The right
answer is likely in between, and the exact amount of time is up to you and your
managers.
If, after ninety days, the manager and the employee agree things are not meeting
expectations, or there isn't agreement on whether expectations are being met,
something has to change. This doesn't mean you have to fire the new employee, but
it does mean you have to do something. Consider the following options in this
scenario:
* Is the problem the manager? Would this person be more successful on another team
or with a different manager?
* If you suspect this is the case, consider a lateral move before moving to
termination.
* Is there a cultural misalignment?
* Realigning an employee to your culture after a misalignment is identified is
challenging and rarely successful. If you're concerned you may be in this scenario
at ninety days, almost certainly the right option is to part ways, and more likely
than not the candidate will be just as relieved as the manager.
In general, if it's not clear after ninety days that a hire is going to work out,
it likely won't magically become better after 120 or 150 days, and it's best to let
them go. You should terminate this employee the same as any other, with a full
severance package and as much kindness as possible.
I encourage you to take full ownership of the mis-hire. If you hired them, take
responsibility; it means your hiring process isn't perfect. Don't penalize the
employee for it. An industry-standard severance package at a startup is four weeks
salary, benefits if you can extend them, and assistance finding another job in any
way you're comfortable offering.
Onboarding begins the second somebody agrees to work at your company and signs
their offer letter. You should be thinking about how to make your new employee
successful even before their first day. Not every new employee will be eager to
spend their own time learning about the company or their role in advance of their
start date, but depending on the task or what's offered, many will volunteer to do
so.
I encourage you to send candidates your Day 1 Guide as well as your guidebook the
day they sign their offer letter. If you have a company reading list, now is a good
time to order those books and have them either shipped to the candidate or offer
them in eBook/audiobook format. Most candidates are not at all interested in
reading/writing code before Day 1, but learning about your culture or reading high-
caliber books on business/culture/ engineering is rarely perceived as a burden. You
shouldn't require this activity, but by making it available you'll likely get
fairly healthy volunteer participation.
On their actual start date, the candidate should meet with their new manager first
thing in the morning and check in. If they haven't read through the materials you
sent them in advance of their arrival, set the expectation that they are to do so
on Day 1. They should schedule follow-up time to review the ninety-day scorecard
after the candidate has had a chance to review the introductory materials and set
up their environment/logins. This is also a good time to reinforce the idea of
continuous improvement and encourage the candidate to take ownership of any hiccups
in their onboarding and contribute to improving the documentation and process for
whoever follows them next onto the team.
## Performance Management
One of the keys to improving employee morale and promoting a positive workplace
culture is ensuring that everyone has a clear understanding of how they are
perceived in the workplace and has reliable guidance on how to level up within the
organization. The goal of any performance management system is to, as objectively
and fairly as possible, provide that transparency and structure to employees. A bad
performance management system will result in unwanted surprises or awkward and
demotivating situations, while a strong performance management system motivates
your team and encourages everybody to level up together.
Poor performance management often results in negative outcomes. Here are two
examples:
Person X, having been at Level 4 for too long, feels exasperated and demoralized
not knowing how to make it to Level 5 and get the associated raise.
Your performance management system should give everyone clarity on exactly where
they stand, what they need to improve upon (and how), when they'll be evaluated,
and how those evaluations are considered for promotions and compensation
adjustments.
Performance management and compensation design should not be done entirely by you,
the technical leader. There are plenty of ways to make mistakes here that could
expose your company to legal liability. These are easily avoided by ensuring that
your HR lead is heavily involved in the process. In fact, ideally, your HR lead
would do most of the blueprinting and lean on you only for help defining technical
competencies. Regardless of who takes the lead, HR is your partner here.
##### Overview
Once you've got the descriptions for each level written out, all that's left is to
publish a formula to summarize rankings of individual skills into a single job
level. With that, you've got yourself a transparent, objective, measurable system
any employee can use to understand their on-the-job performance and exactly where
they can improve to level up. I provide a sample formula for this summation process
in Performance Reviews, page 101.
Keep in mind that different roles should be evaluated for different contributions
to the team and should have different (though perhaps overlapping) competency
matrices. It is especially important to create a separate matrix for management as
distinct from individual contributing engineers to encourage managers to grow their
skills beyond coding.
The details of the competency matrix impact every member of the team, so it stands
to reason that the team should be included in specifying those details. Referring
to the Team-Based Decisioning Models section of Mini Management Frameworks, page
33, this is definitely a job for either the straw man or the codevelopment model.
I recommend the straw man model: Outline the key skills and impact areas you'd like
to see for any given role and take a stab at filling out most of the competency
matrix. Then introduce the idea to your tech team and let them know you'd like
their input on how to flesh out that first draft. Set aside fixed time as a team
and make it safe and encouraged to workshop the matrix together, perhaps using
breakout groups to workshop individual categories.
Whatever structure you choose, make it explicit, provide at least a few hours of
safeguarded time for working on it, and set a deadline by which to receive final
feedback to incorporate and turn into a candidate final draft for the team.
Aim for at most five general categories, and no more than three areas within those
categories. Any more than roughly fifteen skill/impact areas will make the matrix
too unwieldy to use as an effective performance evaluation tool (and would
certainly prove too cumbersome to collect timely team feedback).
Each level should have its rating system and performance expectations clearly
described and/or can share a description with an adjacent level, where appropriate.
Regarding fairness, if any two employees in the same role and level are compensated
equally, then it stands to reason that the fairness of that compensation will
depend on how fair the leveling is. If the team as a whole contributed to and
believes in the fairness of the competency matrix, then by and large they will also
believe that the compensation tied to that matrix is fair.
Translating levels into fair compensation is slightly more nuanced than most might
assume. The easiest thing to do is create a transparent spreadsheet that says
everyone at Level X gets paid $Y per year, but a few issues arise from such a
strict system: cost of living adjustments (also known as local rates) and non-
performance-based compensation bonuses.
GitLab published a great blog post explaining why they pay local rates (see
ctohb.com/local. eir compensation calculator is also public at ctohb.
com/gitlabcompcalc). That said, There's no one correct way to handle local rates,
and you should consider whether or not paying them makes sense for your business.
If it does, calculate those rates in a way that is both transparent and data-
driven.
Having a performance level translate to a specific pay range, rather than an exact
compensation amount, solves many compensation problems. Any given job will want to
be calibrated to market rate, but how are market rates determined? Generally
speaking, the tools and data that are available to determine a market rate will be
somewhat imprecise, and at best give a range within 10-20 percent. The reason it's
not more precise is simple: a software engineering role at your company is unlikely
to be 100 percent identical in requirements to the same role at a different
company. After all, your codebase and tooling aren't 100 percent the same.
Having a pay band also leaves room to increase compensation outside of the
performance management system. Non-performance changes include tenure-based
increases and inflation-based adjustments. You can also use pay bands as a rough
stand-in and leave space for cost-of-living adjustments before your organization
formalizes a more sophisticated local rate system.
So, you've designed your competency matrix and decided to translate levels into pay
bands. Now, you just need to calculate your pay bands. This is an area you'll
definitely want your HR lead partnering with you on closely to ensure you're
meeting any regulatory requirements that may exist. Defining pay bands should be as
data-driven as possible, and thanks to a handful of existing platforms (both paid
and those that just require data-sharing), that data is relatively straightforward.
Platforms like Pave, Option Impact by Advanced-HR, Levels.fyi, and Glassdoor can
provide rich data sets that can be filtered to match the size/shape/stage of your
company to determine a relevant pay band for a particular role.
Many startup founders will tell you their organization is very flat and that titles
don't mean anything. That may actually be true from time to time in isolation, but
it's the exception, not the norm. At the vast majority of companies, startups
included, there are consistent trends in how titles are used. Assigning titles
creates an expectation for level and scope of responsibility. Titles are also
easily given and hard to take away, so it's worth being thoughtful and considerate
about exactly what title you put on a job description or a promotion.
For non-executive roles, before you decide on titles I first encourage you to
decide what your levels are using only numbers, e.g., Level 1, Level 2, etc., via a
competency matrix (See Competency Matrix and Leveling, page 95). Once you know what
to expect from each of those levels, you can map levels to titles. Don't be afraid
to add a numeric suffix to titles as well; it's easier and clearer to use titles
like Junior Engineer 1 and Junior Engineer 2 than it is to invent a new adjective
that means slightly more experienced than junior but not yet mid-level.
Senior individual contributors often have the informal title of tech lead. Tech
lead implies that some of the individual contributor's time is spent on management-
style responsibilities, but their primary responsibility is still doing
engineering. Rarely is the notion of a tech lead something that is noted in a title
on a résumé or organization chart; it's simply an added responsibility for more
senior employees and is part of the expectation at that level of seniority. If a
tech lead's primary output is management, not code, then they should be on a
management track with a manager's expectations, title, training, coaching, etc.
Management titling has more nuanced implications than individual contributors. The
most common titles are software development manager (SDM) or software engineering
manager (SEM), with appropriate seniority decorations e.g., mid-level software
engineering manager or senior software engineering manager. An SDM or SEM is
usually responsible for a single team of engineers, who in turn work on a single
feature or product.
Beyond director is the role of vice president of engineering (VPE). There isn't a
universal implementation of VPE. It varies from being the organizational lead of
all engineers at the company (in place of a CTO) to being the strategic technical
lead across multiple product areas. Sometimes the VPE reports to the CTO, and other
times the CEO. What VPEs have in common, though, is the expectation of being
technically very senior, experienced, and skilled at people management a great
communicator and strategic thinker.
The challenge is to collect qualitative feedback in such a way that it can be used
to align with your leveling. In this chapter, I present a methodology I've used to
collect feedback that ultimately results in a relatively easy-to-understand scoring
system that can produce individual performance levels.
The who reviews whom question has no easy answer. Many companies simply have the
manager do the review, and that's it. While the manager's feedback is valuable, it
can also leave too much room for bias and neglect the equally important perspective
of peers and direct reports. The easiest (though slightly more costly) way to run a
fair and comprehensive process is to have every employee receive multiple reviews
that include these other perspectives, often called a 360 review.
Here I recommend using the straw man technique (see the Team-Based Decisioning
Models section of Mini Management Frameworks, page 33): Each manager should create
a list of direct reports and peers who have enough exposure to that team member to
write a valuable review, then have a conversation with the employee and get
feedback before finalizing the list. Managers should also keep track of how many
reviews each employee is being asked to complete to keep the requests manageable.
Your questionnaire should mirror your competency matrix, and reviewers should be
encouraged to make explicit references to the matrix.
* Where does this person demonstrate room for improvement in this area? What level
do you think this person is performing at in this area?
Note that from an unconscious bias perspective, it's better to ask the reviewer to
enumerate the examples before asking for a level. The alternative may encourage
reviewers to choose a level, then cherry-pick examples to justify the level they've
already chosen.
* How eager and excited are you to work with this person? (Scale: Not excited at
all to Very excited. This question is from Netflix's Keeper Test
[ctohb.com/keeper])
* This person is currently at Level X. Do you feel they are ready for promotion to
Level X+1?
* Are there any other strengths this person brings that you want to highlight?
* Are there any other areas for improvement that you want to highlight?
##### Review Format
You can conduct reviews with or without the aid of a formal review tool (also known
as performance or culture management tools, like Culture Amp and 15Five). Of
course, a purpose-built tool will save time and scale this process quickly for
larger teams. It's crucial to keep all individual feedback anonymous, with the
exception of noting which scores came from management (we'll use those scores
separately as a sanity check against peer reviews later in the process).
Once the reviews have been submitted, a set of scores should be reflected for each
person in each category of the competency matrix, ideally broken out between scores
from peers, direct reports, and managers. Here is an example matrix of scores:
The challenge now is how to aggregate those scores into a final job level
calculation. Some key considerations in this calculation:
* Protect the integrity of the process (e.g., that an employee didn't collude with
their peers to artificially inflate or deflate anyone's scores)
* Confirm that the manager's perspective of the employee's impact aligns with
peer/subordinate feedback
* Decide whether all categories in the matrix are weighted equally or unequally.
Here's the method I recommend to determine a level: Assign the level at which the
Cumulative Score is 66 percent or higher. The cumulative score for a given level is
the percent of all scores that are at that level or higher. The lowest level will
always have a cumulative score of 100 percent, Level 2 will be 100 percent minus
the percentage of votes from Level 1. Level 3 is 100 percent minus the percentage
from Level 2 and Level 1, and so on.
1. Discuss any strengths or weaknesses that were identified that are unexpected or
otherwise not regularly covered in 1:1s.
1. Synthesize a small list of focus areas to work on before the next review period.
Many leaders advocate for a single focus area, but I've seen several individuals
grow in more than one way during the given period, so two or three focus areas is a
reasonable upper limit, as applicable.
1. Establish a schedule for the manager and employee to regularly check in on those
focus areas and ensure there's advancement before the next review.
* PIPs should allow a reasonable time period to demonstrate improvement say, thirty
days for individual contributors and sixty days for senior staff or managers.
PIPs should always include a complete written version, not only to ensure clarity
between employee and manager but also to provide documentation for HR/legal to have
on record for any subsequent inquiries.
There are a few situations in which you should bypass a PIP process and proceed
straight to termination:
### Firing
Your company should have a clear procedure for how to actually terminate an
employee. My best advice: follow it. This is an area that, if handled incorrectly,
can become a substantial liability to the company. Key considerations include:
**Timing:** Once you've decided to let somebody go, do so as soon as possible. The
common wisdom is that, after letting somebody go, managers typically worry less
about whether or not that person should have been let go and more about whether
they waited too long to do so. Mechanically, it doesn't make much of a difference
what day of the week you decide is best to let somebody go, but if you are able to
save it for the first day of the month instead of the last, you are offering the
employee the benefit of an extra month on the company healthcare plan.
**Witnesses:** Ensure that the manager and HR are present to witness the actual
termination meeting. The meeting should be very short and to the point, and HR
should answer most follow-up questions concerning termination logistics.
**Offboarding:** Develop a plan in advance of termination for how and when to turn
off the employee's access to company systems and recover any company hardware.
## Team Makeup
The key difference in impact between junior and senior talent is the consistency
with which they can reliably solve different kinds of problems. As engineers become
more experienced, their judgment and decision-making improve on larger and larger
surface areas. Similarly, you should expect that more senior talent will develop
solutions that have fewer defects, last longer, and are more durable to
requirements change along the way. That's not to say everyone has to be senior; in
fact, it's rare that a majority of projects involve architecting brand-new
greenfield solutions.
The right blend for any given team considers the types of work to be done and
staffs the team thoughtfully as a result.
Your team should be more heavily weighted with senior engineering talent if your
codebase is:
* Very new, requiring lots of architecture and foundational contract creation
* Is very old, poorly maintained, or poorly thought up and considered difficult to
work in - in short, a brownfield codebase
* Is meaningfully changing in requirements, especially if new requirements do not
look very much like old requirements
* Is using new tools, techniques, or patterns that require validation for your
problem
* Requires establishing new patterns/ways of doing work, especially with ecosystems
that don't provide tight guardrails that encourage healthy patterns.
On Day 1 of most startups, the team will consist of a small handful of engineers,
typically two or three. Having a team of three leaves limited opportunity for
specialization. There are twenty categories of technical work to do and only three
people, so by the pigeonhole principle, at least one person and more likely all
three will be doing many types of technical work. Said another way, early on at
your company, everyone is expected to wear many technical hats. As your company
grows and you add more people to the team, you and your employees will find more
opportunities for specialization.
So, you've raised a round of funding and you're looking to expand your team for the
first time. How do you decide whether you need frontend engineers, backend
engineers, DevOps engineers, etc.? Here are some general guidelines:
**Listen to your team:** The people currently doing engineering are very likely to
be vocal about where the biggest sources of inefficiency are, and where the most
help is needed. Your job as a manager is to take in that perspective and
extrapolate going forward. Is the team pointing out a problem that will disappear
in two months, in which case hiring somebody wouldn't be appropriate? Or is the
issue systemic in nature and likely to continue for the long haul?
**Look for factors that are hurting productivity:** If your team is mostly frontend
engineers and you're struggling with backend reliability, then that should be a
sign that you need backend or DevOps engineering help. This same principle applies
to testing, developer experience, etc.
**Specialize with scale:** Until your team is north of a dozen people, chances are
high that you're better off with a team of primarily generalists.
Here are some rough numbers for team composition based on startup experience:
* Team size 1 5
* Your team is all generalists, specialized at most between frontend, backend,
and mobile.
* Team size 5 15
* Your team is specialized by product or general skill area such as backend,
frontend architecture, frontend design, DevOps, and testing.
* You'll likely want to start thinking about dedicated resources in testing and
DevOps when you've grown to (or past) fifteen engineers.
* Team size 15 30
* You should have real specialization by this point and be hiring only people
with expertise in a subfield of software engineering.
* At this point, any inefficiencies in how work gets done are likely to start
getting very expensive across the team, so make sure you're investing either
headcount or time in ensuring that developers are able to get work done, their
tools work, and operational logistics are streamlined.
If you are shipping end-user software, your engineering team has to strike a
balance between doing new work and handling support tickets that come in from
active customers. Left unchecked, the need to handle support tickets can become a
major distraction to the team, hurting efficiency, draining morale, and burning out
your best people. There are many right answers to solving this problem; the
important thing is that you recognize its effect on your team and architect a
solution to help them be productive and drive great customer outcomes.
Microsoft has published a great article on this topic titled Building productive
teams (ctohb.com/teams) describing what they call the two crews model. The two
crews model outlines a feature crew and a customer crew.
The feature crew focuses on the future, building new features. The customer crew
focuses on the present, working on active customer issues, diagnosing bugs, and
prioritizing site health.
Other names for customer crew might be a maintenance team, or a Tier 2 support team
(where Tier 1 is your non-technical customer support staff).
Splitting maintenance work off into its own team has many benefits:
* It allows for a dedicated team to monitor the customer queue at all times,
triaging and resolving anything important and urgent.
* It allows your feature team to remain 100 percent focused on the future,
undistracted by customer support work.
The first question I get asked about the two crews approach is: how long does
somebody stay in the customer crew? There are four approaches to determining
customer crew tenure:
* You have published job descriptions for engineers focusing on support and
debugging. Note that for many engineers, a job focused only on debugging may sound
undesirable. To me, a job description that emphasizes data entry or accounting
sounds very unpleasant, and yet there are many people who enjoy and even pursue
those jobs.
* Don't assume that just because you wouldn't do that job, there are not others
who might be excited by the prospect. In particular, working on a customer crew
exposes an engineer to a huge amount of code, often offers opportunities to talk to
customers, and involves less product-driven deadline pressure all things that might
appeal to the right candidate.
* Engineers rotate between the crews on a regular basis. The Microsoft blog post
referenced above recommends swapping some team members between the two crews every
week.
* Define the customer crew as a temporary team. This can mean either that the
customer crew itself doesn't exist full-time (perhaps for only one week per month),
or that team members are constantly rotating between the customer and feature
crews.
In general, you'll find technical organization charts organized in one of two ways:
functional organization and product organization. A functional group is organized
around the type of work team members do, such as frontend engineering, testing, or
a particular internal service. Product groupings, sometimes called business units,
are organized around a particular business-related/product focus, such as the
enterprise core application team or the consumer mobile app team. How you choose to
organize your team can have a significant impact on team collaboration,
productivity, and morale. Product-organized teams, often called pods, are the right
answer most of the time.
When you're designing an organization chart you should consider what you're
optimizing for. For a startup, the primary goal of an organization chart is to
ensure that different people who need to collaborate closely with one another are
enabled and encouraged to do so by the organizational structure. The best way to
achieve that is to have all the people who directly contribute to the feasibility
and success of a product be organized together, to hold them accountable to a
common set of goals, and to develop a shared sense of ownership over that product.
When your team is small, and you have just one product, the question of how to
organize is moot it's one cross-functional team all working on the same product. By
the time your team has grown to 12 or more people, you'll need to start being more
deliberate about defining what a pod is, and finding a method that is easy to
understand and grounded in your product reality, before breaking your team out into
pods.
There are a sufficient number of successful 100 percent remote software engineering
teams that it is incontrovertible that remote organizations can work. That's not to
say all remote teams are successful, or that it's necessarily easy to build an
entirely remote culture. I've spent nearly a decade managing remote teams. Below
are some recommendations that should apply to most remote management scenarios.
#### Documentation
Being remote means that there's nobody sitting next to you to answer questions, but
that doesn't mean the questions go away. Those questions still get asked, only now
social context is lost and so perhaps the question goes unanswered for a while. Or,
rather than holding questions until the other person takes a break, now they get
asked immediately, prompting various kinds of notifications and distracting from
focus work. Having a robust set of internal documentation with an effective search
feature can speed up time-to-answer and reduce the number of one-on-one context
switches that become barriers to getting work done.
A great way to turn remote work into an asset rather than a liability is to lean
into asynchronous working practices. A strong asynchronous culture reduces the
burden of time zone mismatches and reduces the amount of time spent in remote
meetings/dealing with remote context switches. (See Benefits of Overcommunication,
page 20, for more on asynchronous communication and the value of asynchronous
work.)
As useful as they are, video calls are no substitute for sitting down and having a
meal as a team. The bonds that are formed by in-person meetings tend to endure over
a long period of time, so an investment in even infrequent in-person meetings can
improve the quality of social relationships between team members for months. As a
general rule, it's healthy for a team to meet in person once per quarter to
maintain these relationships and minimize remote social frustration.
My overall recommendation is for teams that are working on the same project to have
a minimum of four working hours of overlap. That leaves a sufficient window for any
regularly scheduled meetings and an opportunity for ad-hoc conversation and
questions.
#### Create Social Opportunities
In general, people's default mode with video calls is to multitask during the call
and then get off the call as fast as possible. Once a work conversation is over,
everyone hangs up. That isn't how people interact in person; for example, the
meeting ends and then you talk about sports in the hallway when walking back to
your desks. That bit of social time before/after meetings is a valuable way for
people to build relationships and trust, and it won't happen unless the leadership
and culture actively support it. A cheap and easy way to do that on a regular basis
is to ask a lighthearted, round-the-room, ice- breaker style question on a regular
basis, such as What's one bit of good news personally, and professionally, you can
share?
You can support remote social building with remote happy hours, or virtual team
dinners and social activities. Post-COVID-19, there are many online social
facilitators who run remote events ranging from digital casino nights to virtual
escape rooms.
#### Camera On
## Leadership Responsibilities
Design teams would ideally like their work to be implemented faithfully, pixel-
perfect, by the software engineering team. Absent any sort of structure or set of
constraints, actually achieving pixel perfection is expensive or even impossible;
however, a bit of shared understanding and a design system can make the task
dramatically cheaper.
A design system is a set of standards for managing design at scale using reusable
components and patterns. Large companies tend to create their own design systems.
Atlassian, for example, makes theirs available publicly (ctohb.com/design). As a
small startup, innovating on design systems is likely not key to your success, so
it follows that you should use an off-the-shelf design system. Nowadays you can
choose from a plethora of off-the-shelf systems with rich feature sets and a
variety of aesthetic styles that, more likely than not, can be customized to match
your brand.
Not only do design systems provide a time-saving set of guardrails and components
used by designers, they often come with out-of-the-box support for various frontend
languages and frameworks. Material Design, for example, has a published design
system in Figma (ctohb.com/figma), as well as a set of JavaScript react components
(mui.com) and angular components (material.angular.io).
By adopting a system like Material (or AntD, Chakra UI, Blueprint, Bootstrap,
Semantic UI, etc.) you not only get a suite of prebuilt technical components but
you also get a system that integrates neatly with designer tools. By integrating
the same system with the tools of both engineers and designers, you'll ensure that
what your designers create will map cleanly to the components available to
engineers. This cuts down or even eliminates the need for engineering to do custom
styling or frontend UI, and makes it easy to match a design down to the pixel.
Beyond the efficiency gained by having consistency between design and engineering,
most design system component implementations also take into account or
automatically solve for other design priorities, such as accessibility (both for
screen readers and color contrast management for the colorblind), adherence to UI
standards, and even out-of-the-box dark mode support.
### EPD
EPD is an industry acronym for Engineering Product and Design. The implication is
that all three departments are essential to the product development lifecycle and
need a healthy way of working together to produce great results. From the
business's perspective, all three departments together are responsible for
producing a product the customers love, so it's helpful to have a single person
with a single set of business goals setting the direction for all three units.
Product and project management are industry terms with distinct meanings. A product
manager is accountable for the design and creation of the product as well as the
key performance indicators (KPIs) the product should meet for the business. Melissa
Perri's *Escaping the Build Trap* is a phenomenal resource that dives deeply into
the role and impact a great product manager can have on your organization. A
project manager is accountable for guiding the internal organization of the team,
managing internal communication, and adhering to the roadmap and deadlines.
Early on in your startup, as CTO you may be filling both of these roles. Very early
in your hiring roadmap you should plan to have a great product manager who can take
some of these responsibilities from you. Some product managers excel at project
management, while others don't find joy in it and therefore don't put time and
attention into it. Arguably, early on at your startup, it's okay either way; with a
small team, the consequences of lax project management are minimal. However, as you
get larger, formalizing project management becomes more important, and if your
product manager isn't filling the role, you should aim to augment them with a
project manager.
There's an industry expression that your company is ultimately run by your middle
managers. The implication is that, despite whatever strategies and processes
executives put in place, it's the middle managers who ultimately have the highest
impact on the quantity and quality of output. Middle managers hire individual
contributors, set their day-to-day objectives, and hold them accountable for
quality standards. The best middle managers are mini executives, focused on
culture, building collaborative teams, and working to enable them to do their best
work. It follows then that, as an executive, you should put a lot of effort into
hiring, managing, and training those middle managers.
The complicated topic of training managers deserves an entire book on its own, and
it's a skill that takes time to master. That said, the top two lessons I can offer
you that will provide leverage to all other skills here is to set an example
yourself and build a culture of continuous management learning. The minute you hire
or promote somebody into management you should make it clear that your expectation
is that they will commit to putting time and effort into refining their craft of
management. you'll go out of your way to make that easy for them by including
management training in their personal goals and development plan, providing them
resources to level up on management skills, helping them work through management
problems, and teaching them everything you know.
Part of your role is likely shouldering accountability for the present and future
cost of the software engineering department (and sometimes design and product
departments). As a responsible steward of that budget, you should know how much was
spent on various items in the past, and how much your company as a whole has
allocated in the future. Most importantly, you'll need a plan to justify spending
that allocation wisely.
In general, you'll find the two largest line items in a technical department are
people (payroll) and infrastructure/SaaS. Keep in mind that the actual cash cost of
an employee is higher than just their salary, as it will include benefits and
payroll taxes. A good rule of thumb is that the cash cost of an employee is 20
percent above their salary; this percentage is referred to as the burden rate.
Most finance teams use sophisticated accounting and budgeting software to manage
the company's books. If not, they'll have a company-wide spreadsheet that's far
larger and more complex than you need as CTO. In my experience, finance departments
are not usually willing to let people outside of finance make changes to the core
budgeting system, so unless they give you something to start with, it's on you to
make a financial model for your department.
Given that your department's costs are fairly predictable, and centralized in a few
line items, the model you make doesn't have to be very sophisticated. My
recommendation is that you maintain a spreadsheet that includes the following:
* Payroll tab
* SaaS/Costs of Goods Sold (CoGS) tab
* Infrastructure tab
* Other tab (including travel, hardware) Summary tab
If your company is like most startups, you'll be the most expensive department, and
your CFO should be well aware of that. Some things to consider that are helpful for
you and will make your CFO your best friend:
* Help your CFO know how money is and will be spent. Avoid surprises wherever
possible. Provide guidance for things like hardware cost, travel/conference cost,
and cloud cost upfront.
* Maintain a budget for your department and keep it up to date with changes in your
forecast.
* Update your budget regularly with actuals from the finance department and ensure
the delta between forecast and actual is understood and managed.
* Establish a plan for hiring and include estimated salaries.
* SaaS bills are often a burden to track. Consider either using a credit card
statement analysis tool (i.e., a SaaS Management Platform, aka SMP) or hiring an
assistant to regularly categorize and reconcile these expenses with your budget.
* Finance departments often care a lot about cost attribution to differentiate; for
example, costs that are part of Costs of Goods Sold (COGS). Indicating in your
budget a very coarse-grained why for each line item can win you friends in finance.
I have yet to encounter a technical team or leader who has managed to consistently
and usefully quantify actual engineering output. This includes measuring velocity
as the sum of estimates of completed tasks. (See the Estimates section of Workflow,
page 160, for a discussion on the unreliability of technical estimates.)
I have, however, seen many companies effectively measure engineering health and
contributing factors to engineering velocity namely, cycle time and work time
allocation.
In general, low-cycle teams are more efficient and able to iterate, innovate, and
deliver value to customers faster. There are many tools that facilitate measuring
cycle time, including LinearB (linearb.io) and Code Climate (codeclimate.com).
LinearB has published a set of benchmarks, using data from thousands of engineering
teams, on metrics associated with cycle time.
The idea of measuring work time allocation is that, while measuring how much work
is done is difficult, it's comparatively easy to measure what types of work team
members are spending their time on. This resulting information is directionally
useful. For example, if a team is spending the majority of its time addressing
bugs, then it's a good hypothesis that improving software quality and bringing down
that time percentage will result in more time allocated to developing new features,
and thus an overall improvement in health and velocity.
Generally speaking, as CTO, your role in fundraising and due diligence is fairly
minimal. At most startups, the CEO and perhaps the CFO do the lion's share of that
work. Your involvement likely comes toward the end of the process as investors do
due diligence on the company. Many (though sadly not all) of the requests made in
due diligence are for information that a well-organized engineering team already
maintains as part of doing business. Keep an eye on on the following, and be ready
to produce for due diligence:
* Organization chart
* Department budget
* Full description of all products engineering has created and maintains
* Engineering roadmaps (usually they're looking for short/medium-term roadmaps)
* A list of major areas of tech debt, what I've labeled a tech debt balance sheet
(see Tech Debt, page 145)
* High-level system architecture diagrams
* Full description of how software is distributed and updated, either as SaaS or as
versioned desktop/mobile software packages
* A high-level description of systems, how they're hosted, and your security
practices
* Information about software licensing, including code scans of company code
confirming no license violations or unlicensed proprietary software are present
It's not uncommon for an investor to hire a third-party firm to conduct a technical
diligence audit. These audits may involve interviews or meetings with you and
perhaps a few other senior members of the team. Be prepared to discuss your
engineering process, assess the general productivity of the team, and do a code
walkthrough of parts of your system.
In general, the responsibility for sourcing, negotiating with, and managing third-
party technical vendors will fall to you as the CTO. For most, this is an
unenjoyable but necessary part of the job, and consequently, it doesn't get much
thought. Performing this responsibility well, however, can provide significant cost
savings for a business. Here, I'll walk you step by step through a typical SaaS
negotiation/signing process and provide some tips for how to add efficiency and
save cost.
Typically, either you or a member of your engineering team will highlight a need
for a type of tool, and if it's one of your engineers or managers they'll ask
either for you to approve the expenditure or for you to sign up for the tool. Many
tools are inconsequential in cost and have trivial self-service signup flows. I
recommend you work with your budget and finance team and set a threshold below
which your managers are authorized to simply sign up for the tool independently.
For tools that are above the cost threshold or don't have self-service signup,
you'll need to enter into a discovery and negotiation process with an enterprise
vendor, which typically has four steps.
You're now facing an enterprise sales process. Before reaching out to the vendor's
sales team, be sure that this particular company is the one best suited to solve
your problem. In most instances, I recommend starting a spreadsheet, doing some
diligence on all the vendors in the space, and filtering down to a top two or
three. Even if one of them is by far your preferred choice, it doesn't hurt to have
some deeper context on the space and for negotiating knowledge/leverage/BATNA (best
alternative to a negotiated agreement).
##### Sales Qualification
When you first reach out to an enterprise vendor you'll almost certainly enter what
is called a sales qualification process. At this step, especially as a technical
executive, your needs are not yet aligned with the vendor. You're probably looking
for pricing, a contract, and the shortest path to getting started. The vendor is
looking to make sure you're their target customer and are likely to sign up and not
churn for at least a few years.
Most SaaS sales companies will have frontline sales representatives take the
initial phone call, and their primary objective is to learn about your company and
see if you fit their profile. They might not have much technical knowledge and
often they won't be able to discuss pricing with you. As a result, you're not
likely to get much value from this first meeting. My advice is to either delegate
these intro meetings or try and get the sales qualification questions from the rep
via email and answer them sufficiently to skip the intro meeting altogether.
##### Negotiation
Once the vendor has validated that you fit the mold of their target customer,
they'll schedule a second meeting with more senior resources on their side.
Typically, this would be a sales manager or perhaps a technical sales
representative. This is the stage at which you'll start to get answers to more of
the technical questions you have about the solution, and you'll get some
transparency into the pricing models the vendor is authorized to use.
* Keep in mind that, at the end of the day, a salesperson's job is to sell you the
product, so you're on the same team when it comes to landing on terms that are
mutually agreeable. The more transparent you can be about what matters to you, the
better they'll be able to craft a sales agreement that meets those needs.
* Don't undervalue factors other than total cost, such as total contract length
(longer contracts should come with steep discounts), payment frequency, payment
terms (e.g., net 30, net 90), or what happens to contracts in the event of a change
in control of either company.
* In general, look for contracts that grow as you grow. Ideally, the initial cost
is low and grows over time as the tool is used more heavily and provides more
value. Encouraging a we grow together mindset can help reduce those initial costs.
* Keep your finance and compliance teams in the loop. If your CFO is great at
negotiating, let them handle this part of the process.
* If your total SaaS budgets are large or you're negotiating a lot of these deals,
consider using a third-party negotiator, such as Vendr. Often these negotiators
will charge based on how much they save you, and they have data from many more
contracts than you do to understand pricing.
* Be aware that end-of-month/quarter quotas are real and discounts around that time
are very common.
##### Signing
Once you've agreed on terms and exchanged paperwork, the next step is to find out
who the authorized signers are for your company (assuming you're one of them), get
the documentation signed, and keep it organized.
##### Post-Sales
After the deal is signed you will likely be handed off to a different
representative at your vendor, someone whose title is similar to either post-sales
support or customer service manager (CSM). These individuals are incentivized by,
measured by, and focused on customer retention or product upsells. They're
knowledgeable about the product, and at least somewhat technical. Going forward,
they'll serve as your advocate for new features and getting defects resolved.
Whether you're the CTO, a CEO playing the role of CTO, or a founder hoping to hire
a CTO, it's helpful to have a clear understanding of what exactly it is a CTO does
in a startup, and how that role differs from executive and leadership roles in more
established companies. As with most things, the answer depends on context and will
change over time. Calvin French-Owen has
a great article (ctohb.com/founder2cto) breaking down the CTO into four archetypes:
People Leader, Architect, R&D, and Marketing/Consumer- facing. I like this
breakdown and will refine it a bit here to three types:
* Tech-Focused
* People-Focused
* Externally Focused
The tech-focused CTO may also be the office of the CTO, leading an internal
technical skunkworks whose primary output is forward-thinking strategy,
architecture, and sometimes proof-of-concept implementation on how to help the
business down the road. This CTO will have fewer reports, with the bulk of the
engineering organization reporting to a separate vice president of engineering. In
this case, it's not uncommon for the vice president of engineering (VPE) to report
to somebody other than the CTO most commonly, directly to a CEO or to a chief
product officer (CPO).
The internal tech-focused CTO may also be the chief technical process architect,
setting up tools, systems, and processes for how technical work gets done.
As an internal tech-focused CTO, you may also function as a product manager if your
company's product is highly technical in nature (i.e., developer tools, API-as-a-
service, etc.)
A typical startup will not have both a VPE and a CTO, and so it often falls on the
CTO to fill the VPE role. This is often also the hardest role for a cofounder CTO
to fill, as the responsibilities of the day one technical cofounder don't much
overlap with the responsibilities of the internally people-focused technical
leader.
The people-focused CTO is responsible for setting internal technical culture, the
hiring process, and overseeing internal processes. This CTO spends much of their
time actually managing either independent contributors or other technical managers.
This is the most critical of the three focus areas to get right. If a company isn't
hiring well, or its technical staff is poorly managed, unmotivated, unfocused, or
not aligned, it can impact productivity, and even lead to bad decision-making that
will hurt the organization in the long term.
### The Externally Focused CTO *AKA The Head of Technical Sales/Marketing*
This is likely the least common focus for a startup CTO, but no less critical at
the right time or place. Most often, you see this CTO at companies that build
products for a technical audience developer tools, for example. These CTOs spend
lots of time writing blog articles or speaking at conferences. Perhaps they are
brought into sales meetings to act as an executive technical representative to
close large deals. Note that building a brand around the technical team not only
has a positive effect on product sales, but also can be a great recruiting tool and
can reduce your time and cost of hires.
This is perhaps the easiest role for the founder CTO to step into, as they have the
historical context for the company and can most genuinely and passionately tell the
company story and evangelize its product and value.
Ideally, a startup CTO will prove adept in all three areas, though most people will
specialize in just one or two. If your business needs a focus that isn't your
expertise, it may be worth asking yourself if you can execute that task more
effectively by delegating it to a coworker. Especially at an early stage startup,
most technical cofounder/cofounder CTOs will be internal tech-focused. Usually
There's not much other work to be done at this stage!
It's a very common pattern for that person to find themselves stretching into
internal people or external focuses as the company grows, and that's not always a
desirable transition for that person. It's not a personal failure to admit that
your specialty or your motivation is in tech and that people management isn't a
great fit, but quite the opposite. Identifying your superpowers and architecting a
role in the company to leverage that superpower is how you add the most value, and
the company should hire somebody else whose superpower is, for example, people
management to fill that role.
If you find yourself stuck in a role you're unhappy with, it's vital that you
acknowledge the mismatch both to yourself and to your CEO. This doesn't mean giving
up your position as a founder or leader, or as a respected and high-impact person
within the company.
* External Focus: If you don't already have somebody in-house who is doing this
better than you, then hire a developer evangelist and empower them to fulfill the
role. Otherwise, promote from within and ensure it's clear what the role entails.
* Internal Tech: If you have a senior engineer or architect on the team whom
you can empower/elevate to a position of technical leadership, that's worth
considering. Otherwise, a technical architect should be near the top of your hiring
priorities.
* External Focus: If you don't already have somebody in-house who is doing this
better than you, then hire a developer evangelist and empower them to fulfill the
role. Otherwise, promote from within and ensure it's clear what the role entails.
In many cases, the end outcome may mean hiring a CTO whose superpower is better
aligned with what the company needs at the time. In other cases, it might mean
hiring a very capable people-focused VP of Engineering to complement a highly
technical CTO.
it falls on you to ensure the team as a whole is running well. What running well
means will vary by team and circumstance, but there are general patterns and trends
that correlate with high performance. In this section, my goal is to provide
guidance on situations common to all tech leaders.
I encourage all leaders to adopt the general leadership style known as *servant
leadership*. As a servant leader your main focus is on serving the needs of your
team. This means focusing on empowering others, building
In general, you want your team to be spending time on things that move the needle
for the business, and it's your responsibility as technical leader to create an
environment where your engineers can focus their efforts in this way with
consistency and minimal distractions. This frame may seem obvious, but when used
properly, it's a powerful tool for decision-making.
For example, say your team is debating between using an off-the-shelf framework for
the backend vs. writing something from scratch. There's a nuanced list of pros and
cons one could make that would discuss things like added flexibility from building
from scratch vs. shorter time to first deployment with the framework. In this
hypothetical reality, what your business really needs is to iterate on the frontend
and optimize the customer journey, so every moment spent on the backend is one
moment less spent moving the frontend forward. If the out-of-the-box solution is
good enough to power that frontend iteration, even if you have to throw away the
framework and rewrite the backend in eighteen months with a custom solution,
iterating quickly on the frontend now and finding product market fit earns your
business the right to do that rewrite down the line.
*2. Use Reliable Tools Where You Can and Innovate Where it Counts*
Also referred to as don't reinvent the wheel, and standing on the shoulders of
giants, the idea here is to use off-the-shelf components (libraries, cloud
services, applications, packages) whenever possible. There's an inherent tradeoff
for using off-the-shelf components between ease of getting started and customizing
a solution to match your exact problem. I contend that in most circumstances where
a pre-existing implementation already exists, the tradeoff leans heavily towards
using the off-the-shelf service, library, or application.
In the rare circumstance where you will ultimately need to rewrite or heavily
customize the dependency, the experience you had with the off-the-shelf tool will
be very valuable in influencing and speeding up the design of the custom build.
Your goal as the architect of your team is to ensure that your team spends as much
of its time working on code that produces value for the business as possible. There
are a number of time-consuming tasks developers do on a regular basis that are
necessary to writing code but don't themselves add business value (e.g.,
replicating production environments, getting code to run locally, figuring out how
to run test suites, provisioning feature branch environments in the cloud, tracking
down bugs that are not covered by tests, etc.).
These types of tasks impose a constant tax on overall productivity. You can avoid
paying the tax by making an investment in automating these tasks whenever they are
identified. Encourage your team to document whenever they are spending more than
thirty minutes on nonproductive technical work and provide a space in your process
for automating those tasks so nobody loses that hour ever again.
Under the heading of Frequency Reduces Difficulty, Martin Fowler expounds on the
phrase, If it hurts, do it more often (ctohb.com/fowler). Any process or task that
is painful, error-prone, or otherwise costly for your team, Fowler contends, is a
symptom of that task being underdeveloped. Without pressure from you, painful
technical tasks tend to be the last ones volunteered for. As a result, they're
neglected, and the pain gets worse over time. Alternatively, if your team culture
emphasizes prioritizing these painful tasks, then more effort will go into
automating, documenting, and improving those tasks, making them ultimately less
painful or even entirely automated. As Fowler points out, doing tasks more
frequently also provides more feedback on them and builds skill with practice, all
of which further reduce the difficulty and pain of the task.
For many teams, releasing code to production is an example of a task that is often
painful. Releasing code can require hours of time to actually deploy and so it's
done infrequently. If, however, your team subscribes to the If it hurts, do it more
often philosophy, then you as the technical leader will push for regular releases
at faster intervals. If you start with a weekly release, then the first week the
team will feel the pain, and the second week will be just as painful as the first,
but perhaps the engineers will notice an opportunity to automate a piece of the
deployment. By the third, fourth, or fifth release, you'll likely have an entirely
new set of scripts and infrastructure to get code out the door, enabling you to
accelerate the release frequency to twice a week. After a while, you'll be able to
get code out the door with the push of a button. Releasing more than once per day
is referred to as Continuous Deployment (see Continuous Deployment, page 216).
I recommend that you formalize an RFC process and provide some guidance as to what
kinds of decisions should be put through the RFC process.
A formal process can look like a simple checklist and a template document that
includes where to put your copy of that document, how feedback/comments are
collected, and what the process for voting, finalizing, and standardizing what the
document looks like.
My suggestion is to lean into the tools you have and, for example, create a
markdown document in source control that acts as the RFC template. A new RFC would
then be a pull request on that repository introducing a new markdown file with the
proposal. It is then somewhat natural to collect votes as approvals on that pull
request, and finalization of the RFC is when the pull request is ultimately merged.
Alternatively, if you've set up an internal wiki, you can create an RFC there, and
use a wiki's comment system to collect feedback.
You should also set clear expectations for what kinds of decisions get put to RFC.
I recommend limiting it to high-level processes, technical opinions, and culture,
and not using it for tool choice or system architecture. Some good examples of
topics to RFC:
I encourage you to include a section in your RFC template on how the results of an
RFC are institutionalized. For example, if the RFC proposes standardizing a
technical opinion for your team, once the RFC is ratified, that opinion should be
incorporated into your engineering guidebook and onboarding documentation, so it
becomes canon for current and future employees as well.
If the only benefit of attending a conference were that the employee learned a bit
more about a relevant technical topic that they're passionate about, the cost would
be worth it. There are, however, many more benefits.
Have you ever had the experience of trying to work through a problem and, while
explaining it to a coworker, figured out the solution? Rubber Duck Debugging
(ctohb.com/rdd, ctohb.com/tpp) is a simple practice that attempts to replicate this
phenomenon without involving or context-switching a coworker.
Rubber Duck Debugging is the process of working through a problem by first speaking
the problem out loud, perhaps at a real physical rubber duck that is sitting on
your desk. The idea is that by speaking the problem out loud, often one will hear
the flaw, or see a solution to the problem, that they hadn't considered when the
problem was only in their head. The rubber duck approach potentially saves a
coworker an interruption and gets you an answer faster.
As the technical leader of your team, I encourage you to regularly make these mini
video explainers, especially at times when you're making changes to the
architecture of the platform. Organize all the videos in a library in your internal
wiki. Make sure they're complete and standalone but short and to the point; if you
don't get around to critical content until the end of an overlong video, your team
members might never see it. Also, make the approach as consistent as possible so
viewers know what to expect.
I guarantee you'll get relatively poor watch rates upfront as you make and
distribute the videos, but over time that library will provide tremendous value as
a source of reference knowledge and will get views that save you and your team
meaningful amounts of time.
## Tech Debt
San Francisco's Golden Gate Bridge is made out of steel, which is not actually
golden in color. The bridge is painted, and maintaining the iconic color of the
bridge is so important to San Franciscans that they paint it continuously
(ctohb.com/painting). Once repainting is finished, the process immediately
restarts. This form of continuous investment or perpetual maintenance is what's
required to keep the most important and sophisticated systems performing to
expectations, from the Golden Gate Bridge to your team's software project. Only for
your project, the maintenance doesn't require paint buckets; it comes in the form
of technical debt.
Every feature a software development team delivers brings with it some level of
need for future work, or debt. That debt can take the form of bugs that need
fixing, fast-follows to the feature to deliver incremental customer value, or
sloppiness in the code that should be fixed to improve maintainability,
performance, or security. A certain amount of debt naturally accrues even if your
team is out on vacation: security vulnerabilities in dependent software are found,
packages go out of date, new versions of tools are released, third-party APIs are
deprecated or changed, etc. Debt is unavoidable and you need to account for it.
Another way to think of tech debt is like financial debt, such as a mortgage on a
house. When you take out a mortgage to buy a house, you're making a deliberate
decision to take on debt, knowing the consequences (interest), to enable you to do
something you want now (get a house). Then you pay down that debt on a consistent
basis over an extended period of time (monthly payments).
The same happens with technology debt. Your startup may accumulate it deliberately
as part of a conscious tradeoff, and part of that tradeoff is establishing a
realistic plan for paying it down. You should apply the same kind of logic you
would to pay down financial debt to addressing your technical debt: either pay it
off upfront because you have extra resources (and no better place to put those
resources), pay it off continuously over time, or pay it all off down the road but
perhaps at a higher total price that includes interest.
However you choose to pay down your tech debt, the key to doing so successfully is
to recognize that debt is an inevitable part of the software engineering process,
and proactively paying down debt is a necessary investment in overall engineering
health.
Don't be afraid of debt. It can serve a purpose. For example, when building Version
1 of a product that's not yet been validated in the market, a technical team may
decide on an architecture that will not scale past a hundred users. If that
decision allows the team to rapidly validate the product and determine whether or
not a hundred users will ever use the product, that path may be worthwhile
especially given the fact that it may take several versions of these prototypes to
find one that users love.
If your startup neglects or ignores tech debt for long enough, it can become a
major impediment to future progress. Teams can unintentionally find themselves
spending 80 or even 100 percent of their time sorting through system problems or
inefficiencies as a result of tech debt, a state known as tech debt bankruptcy.
* You are regularly dealing with production outages to the point of material
business impact.
* You receive constant pushback or exaggerated timelines on new features due to the
need to deal with debt.
* The team complains that a codebase is too complex to get work done.
* New features cannot be shipped without accidentally breaking old features or
introducing an unacceptably high level of defects.
If you find yourself in tech debt bankruptcy, it's time to raise the alarm, reset
expectations with stakeholders, devise a plan to consolidate the debt, and begin
paying it down immediately.
If you've been honest with your peers in leadership (see Delivering Bad News, page
46), you should have the necessary credibility to explain the tech debt problem and
develop a common understanding of the ROI for an investment in resolving tech debt.
Unlike with a mortgage or car loan, There's no website you can visit that will give
you a statement of your exact amount of tech debt and remaining payments. Some
forms of debt can be measured quantitatively, but most of the analysis is
qualitative. For healthy and responsible debt management at scale, I recommend a
debt inventory survey.
The survey should be taken at regular intervals. Somewhere from one to four times
per year, do a sober analysis across the varying kinds of debt, producing an honest
assessment of where the team is operating. Don't take the survey independently;
rather, do so in collaboration with other engineers on the team who are working in
the code every day and interacting with the debt on a regular basis.
A survey can be as simple as this: for each of the following types of debt, rate
how much we have on a scale of 1 to 10, then provide a few sentences justifying the
score.
Use the results of the survey to inform how your team spends its energy paying down
debt, and compare results between surveys over time to ensure debt stays at a
reasonable level and your team is regularly solving its biggest debt pain points.
In order to decide when to pay down tech debt, you should first consider how much
of your team's time is worth spending on it. Key considerations include:
* How much debt exists, as indicated by your most recent debt inventory survey
* How much it is hindering your company's ability to run day to day i.e., via
outages, customer churn, or defect rates
* How much it is hurting your team's ability to deliver on new projects How
difficult it will be to pay down the debt
If you are not in tech debt bankruptcy and your goal is to maintain a healthy level
of debt, I recommend allocating somewhere between 10 20 percent of your team's time
to investments in non-feature engineering (e.g., paying down debt, exploring new
patterns/proofs of concept, improving developer experience, etc.). The more severe
your current debt impact, and the higher the effort to pay down the debt, the
higher the percentage of your team's time you should allocate.
The most common way to handle tech debt is to pay it off on a Just-in- Time basis,
meaning the debt is paid off as part of a business-driven project. This will often
look like a team adding tech debt tickets that relate to the stories that have been
selected for a sprint in a planning meeting. This is a low-overhead and low-
planning-effort approach, and it can work out well. But be mindful of some
potential pitfalls:
* Just-in-time payments, by virtue of being less visible to the broader team, can
lead to systemic underinvestment in tech debt. Make sure you are being honest and
transparent with the team as you do just-in- time-payment about what total
percentage of team time you're expecting to be in debt.
* Adding tech debt as part of a sprint can imply that investing in tech debt is a
secondary objective to the sprint goals, and thus likely to get cut from scope if a
team runs low on time.
* Tackling tech debt in a sprint may be perceived as slowing down the sprint or
causing delays, rather than as an investment in velocity and overall system health.
Periodic paydown is akin to how one might pay down a car loan or a mortgage. The
team makes space to pay off debt on a fixed interval (e.g., a day per sprint, a
couple of days per month, or a couple of weeks per quarter). Google famously
allowed their engineers 20 percent time to work on whatever they wanted, including
paying down debt or innovating on new projects and tools. The idea here is the
same: as a manager, you explicitly make time and encourage the team to make
investments into the tools and processes used to do engineering.
For example, the Shape Up method (see Tech Process, page 157) describes a two-week
cooldown period after a six-week cycle, or 25 percent of an eight-week period, for
making technical investments. Keep in mind that 25 percent isn't a magic number;
the right percentage will depend on your team's debt inventory.
Depending on how expensive debt is for your team, you may want to dedicate more
resources to overall system quality than a periodic strategy allows. This looks
like having a dedicated team what I call a customer crew in a two-crew scenario
(see Project Maintenance: TheTwo Crews Philosophy, page 113) pay down tech debt as
part of their everyday work and objectives.
It's important to ensure that any team whose primary objective is internal
efficiency, such as a tech debt team or a customer crew, has clear and measurable
goals for their work. For example, if your debt inventory ranks test debt as your
highest debt category, then measure defect rates and code coverage and hold the
customer crew accountable for improving those metrics. If your infrastructure debt
is the largest, then focus on uptime and Mean Time to Recovery metrics.
When it comes to debt, that means clearly communicating your strategy for keeping
debt at a manageable level, and also providing upfront and honest communication
about when debt may get in the way of business goals, as well as your strategy for
paying it down so it's no longer a blocker.
## Technology Roadmap
### Timeframes
Your short-term roadmap is what your team is working on now. This includes any in-
flight features, actively worked-on defects, tech debt, or urgent work items. For
more detail on managing your short-term roadmap, see Workflow in the Tech Process
section, page 157.
If you're the only technical manager on the team, then you are responsible for both
the medium- and long-term roadmaps. If you have directors or senior managers
reporting to you, you'll likely be collaborating on the medium-term roadmap. The
medium-term roadmap is a very useful artifact not only for your own planning and
organization but also as a tool to communicate with other departments/stakeholders
on what the engineering team is doing.
You can and should expect to update the actual duration of any given activity as
engineering progresses. Updating the number of weeks on a given task is a great
point in time to evaluate whether continued investment in a project makes sense,
and also to update external stakeholders on current completion estimates. Finally,
the roadmap is helpful as a retrospective tool for tracking how long major
initiatives took, and also to assess where a team is investing time at a very high
level.
As the leader of your team, it falls on you to focus on the long-term health and
productivity of the team. You should spend time designing these goals and producing
a well-thought-out, clear document (or slide deck, video, wiki article, etc.) that
explains the goals to the team. Once you've set initial goals, revisit them
infrequently as changing strategic goals causes churn in an organization. Just as
problematically, frequent changes in direction can be confusing and demotivating
for the team. I encourage you to provide an update on progress towards long-term
initiatives on a quarterly basis, both to the entire engineering team as well as to
other executive leaders.
*Language debt*
*Platform/architecture adoption*
*Hiring plans*
Every leader in sales, marketing, product, or support I've worked with has been
appreciative of transparency in the technical process and technical roadmap. By
contrast, I've spoken to leaders at some companies who describe their technical
teams as a black hole. It goes without saying that you don't want to be called a
black hole. Not being a black hole is simple; it looks like somewhere in your
organization having a regular process to provide transparency to other leaders.
Ideally, you're also helping other departments feel heard by having a forum, or a
mechanism, to take input and incorporate that into the roadmap process. You can and
should close the loop with other departments in your process by communicating back
to stakeholders where their request is in the development process and manage
expectations for when it will be completed.
## Tech Process
Conway's Law states that Organizations, who design systems, are constrained to
produce designs which are copies of the communication structures of these
organizations. Said another way, how you structure your teams, and importantly the
process of work within and between those teams, will have a significant impact on
the product you make. Teams working in information silos are unlikely to produce
products that beautifully integrate with another team's designs, so it's up to you,
as the overseer and ultimate architect of these communication structures, to ensure
those structures meet the needs of the product you're developing.
### Workflow
Technical work is a highly nuanced matter with thousands of minute decisions that
will affect how things ultimately interoperate and behave. To have any hope of
maintaining productivity within your organization you need a set of standards and
guiding principles to ensure the everyday technical decisions are broadly
consistent and thus manageable for the team. That means you need to actually set
those standards, train the team on them and have a day-to-day process to enforce
and modify them as necessary.
The pattern a team follows to determine how to decide what to build and how work
gets done is referred to as a workflow. The five most popular workflow patterns
are:
* Agile
* SCRUM
* Kanban
* Waterfall
* Shape Up
There are entire books written on these patterns, and my favorite is *Scrum: The
Art of Doing Twice the Work in Half the Time* by Jeff Sutherland.
There are some fundamental strengths and weaknesses of these approaches that I'll
discuss in this chapter; however, in the real world, the differences between the
processes are dwarfed by the impact of how well the manager implements the chosen
process. Your job as tech leader is to pick a process and ensure it's implemented
well and iterated on.
* Nobody can perfectly predict how long it will take to complete any given
engineering task.
* Engineering is rarely a straight line; building feature X may require putting
time into problem Y before X can be built.
* There is no such thing as a perfect specification; there are always gaps and
things to be discovered along the way in building technology.
#### Waterfall
The oldest workflow process, dating back to the 1950s, is waterfall (see
ctohb.com/waterfall). The waterfall model breaks down project activities into
sequential steps, where each step is dependent on and starts after the prior step
is completed. In software engineering that looks something like first having a
product vision, then doing product concepting, then product design, then software
development, and finally testing, deployment, and maintenance. The most common
criticism of waterfall is that this structure is rigid, inflexible, and doesn't
promote iterative development.
#### Agile/Scrum
Agile and SCRUM process is a more nuanced and prescriptive methodology than
waterfall. There are many great resources that cover these nuances in detail,
including Sutherland's *Scrum*, *Agile Estimating and Planning* by Mike Cohn, *e
Art of Agile Development* by James Shore & Shane Warders.
The key thing to realize about these processes is that they are guidelines, not
scripture. To get the best out of your engineering team, start with a process and
see how well it works for your particular group of people with your particular type
of technical challenges. Some teams have work that lends itself much more to
estimation and story pointing, while others have much more ambiguous brownfield
projects where estimation is near impossible. Pay attention to whether any
particular ceremony from the process is really adding value to the engineering
team, or if it's just a lengthy meeting everyone dreads.
Do not hesitate to skip ceremonies that are not obvious wins for the team. For
example, I find SCRUM's prescription for planning poker to be inefficient for most
teams.
#### Shape Up
As this description should help you visualize, something can be accurate without
being precise (a broad grouping of darts around or near the bullseye but not
hitting it), precise without being accurate (a tight grouping of darts that misses
the bullseye), and, of course, both accurate and precise (a tight grouping of darts
that does hit the bullseye).
You should expect and hold your team accountable for accurate but not necessarily
precise estimates for completing software development tasks. If today is the first
of the month, reasonable guidance from your team is, We'll ship the feature this
month. If the team says, We'll ship the feature on the 23rd, they're more likely to
miss that deadline.
There's no need to try to estimate hours or days per ticket if you plan your
work/resource allocation out by week, month, or quarter. Pay attention over time to
whether or not your estimates are actually giving you the planning capability you
hope for. If they're not, don't punish the team by continuing the process, or
worse, using it as a contributing factor in performance reviews. Instead, adjust
the estimates so they help instead of hurting you. Change your expectations, and
instead of reacting to missing estimates, react to the challenges the team is
facing as they struggle to meet the estimates.
A final note on estimates: don't conflate missing estimates with poor total
output/velocity. Some teams will be highly effective, have high output, and still
miss estimates. Velocity is the more important metric, and a high output but
imperfectly estimating team should not be punished. Conversely, a team that
regularly misses estimates *and* has trouble delivering new value is
underperforming and needs to change.
SCRUM burndown charts show team progress against estimates and can be a great tool
for measuring sprint productivity. However, estimates are imperfect and, for
various reasons, a burndown chart may show a flat line or even burn up. This can be
because a team legitimately is not making progress, or it could be an artifact of
estimation issues or bad data collection.
A burndown chart that burns up consistently, despite teams shipping and doing good
work, is demoralizing and not achieving the intended benefit. If there are easy
adjustments that will help you better capture data and fix the chart, make that
change. But if you find a particular way of measuring output still isn't working,
just get rid of it. It's okay to admit that your method of estimating these
particular types of stories with this team isn't precise enough and move on to
other methods of monitoring and improving performance.
I contend that which workflow you choose will not be a significant factor in the
ultimate success and velocity of your engineering team. The key factor is that you
are paying attention to your team's workflow and continuously iterating on the
workflow itself to ensure your patterns are adding value and are a good match, both
for your team and the types of problems your team faces.
That said, here is a rough model for thinking about which type of workflow is
likely to be a better starting place: well-understood work (i.e., tasks that are
concrete, greenfield, and easy to explain) is easier to manage and will generate
more benefit with a more nuanced or prescriptive planning process. Said another
way, if your work is ambiguous and hard to estimate, it's likely better managed
with Kanban than SCRUM.
* Greenfield that is, new code that doesn't depend on perhaps legacy or difficult-
to-work-with external modules
* Not dependent on new patterns/tools/technologies, relying instead on the existing
(boring) tech stack
* Easily broken down from stories to smaller tasks Familiar to the team from
previous work
A common practice for periodically paying down debt is the notion of a cooldown
sprint. Sometimes called a tech debt sprint, or innovation sprint, the idea is the
same: give the team time to clean up their digital workspace, do some code-
housekeeping and ensure that they and the code are in a good place for high
velocity work going forward. As discussed in Strategies for Paying Down Debt, page
150, it's reasonable to dedicate anywhere from 5 20 percent of your total
development time on cooldown work.
If you're doing two-week sprints, that might mean that one in four or five sprints
is dedicated to cooldown.
When discussing the process for writing tech specs with engineers, I'm often asked,
How do you have time to write specs? Usually I counter with, How do you have time
*not* to write tech specs? Implied in my response is that taking time to think
through what you're building before you build it is a net time saver.
I recommend that you designate a lead for any project that needs planning: a single
person to be accountable for producing the technical specification and getting that
specification through your approval workflow.
That doesn't mean they're the only contributor. On the contrary, if other team
members are available during the planning window and have helpful knowledge, they
can and should contribute.
Planning can be synchronous (i.e., everyone is in a room for the whole time period)
or asynchronous. I recommend asynchronous planning as much of the work in planning
will involve research (e.g., reading product documentation, reading code,
prototyping/proof of concepting, evaluating tools and APIs, etc.) which can be done
fine independently.
Your tech planning process should save you time in your initial implementation. It
should also save time in the future by minimizing tech debt and leaving behind
documentation that can accelerate future improvements.
The wrong amount of time to put into planning doesn't meet these goals, either
because it's too short and doesn't save you time/produce good documentation, or is
so lengthy that it doesn't pay for itself in savings.
There is no universal formula for the correct amount of time, but I'll provide a
rule of thumb: allocate one day to technical planning for every week of work you
estimate the project to take. In general, this will lead to between half-day and
three-day planning windows. If your project requires less than two days of work, it
likely needs very minimal planning effort and has low risk. Conversely, if you're
looking at a project that is expected to take more than three solid weeks of
development effort, you may not be able to efficiently plan something that large
all at once and should consider breaking it down.
If your team refuses to invest time into planning, you're likely pushing too hard
for results over process. The way to ensure the engineering process produces good
results for the business is not to crack the whip harder, but to establish a
healthy process that enables good results. You wouldn't speed up a structural
engineering team designing a bridge by having them work longer hours. You would
make sure they have the best bridge-designing tools available to them with the best
possible information about the span being bridged. Software engineering is no
different. But instead of using CAD software or real-world measurements of
soil/rock, we have product specifications, design process, and software tools.
Conversely, an overly lengthy planning process where team members insist on getting
every minute detail upfront can be an indicator of a serious cultural problem,
where team members are paralyzed by fear of making a mistake. Effective planning
won't eliminate risk, but thinking through important, high-level decisions in
advance can minimize it. A team that obsesses over details may be afraid of making
mistakes or unwilling to iterate on their work both symptoms of overly results-
driven management. Individuals should not be punished for reasonable mistakes or
planning oversights. It's fine if a tech spec isn't perfect upfront; expect your
team to find mistakes or gaps during implementation, and update the spec when those
issues are found.
Often when writing a technical specification you'll have multiple options for how
to achieve a goal without any overtly compelling arguments on paper to go one way
or another. Or you'll discover unknowns about the effectiveness of a particular
option, which makes the decision ambiguous. If possible especially if it can be
done efficiently I encourage you to give your engineering teams space to prototype
one or more of these solutions to gain data to make better decisions in planning.
Half a day devoted to building a toy with a new tool to validate that the tool will
achieve the desired results upfront is half a day well spent.
### Tech Spec Content
Having a template that your team uses when starting to write a technical
specification is a great way to speed things up and ensure important topics are not
neglected. I recommend your template primarily be a sequence of headings with topic
areas to be covered, perhaps with a bit of instruction or reminder for tech spec
authors. I've included a sample tech spec at
(ctohb.com/templates)[https://ctohb.com/templates].
A quick aside before jumping into content: technical documents can be a bit dry and
serious. If it aligns with your culture, I encourage you to inject lightheartedness
where it's appropriate and not distracting. A good example is a clever meme at the
top of the document that references the subject of the specification. In my
experience, it takes only a manager/leader making a meme once in a spec to
encourage others on the team (read: open the floodgates) to add their own.
* A reminder that the document is in fact a template, and authors should make a
copy before starting writing (this mistake is easy to make!)
* Guidance for how a specification should be thought about/a reference to company
specification guidelines and approval processes
* A background section explaining the business rationale for the project
* Any particularly standout areas of technical risk this project has (e.g., it
touches sensitive PII, or involves previously unused tools/architecture)
* A glossary/definition of any non-obvious terms
* Any explicit business goals the spec aims to achieve/correlation with previously
stated goals (i.e., quarterly KPIs or OKRs)
* A solution architecture overview (the bulk of the document)
* Tech debt specifically discuss why or why not addressing any required/adjacent
debt
* Data modeling, including required updates to a database or data pipelines
* Internal and external reporting or analytics and measurement requirements
* Testing
* Deployment
* Feature toggles/flags
* Implications on overall system reliability or disaster recovery Security and
privacy
* Deliverable milestones
To achieve the goal of ensuring consistency and alignment across projects and
between team members, you must ensure team members are reading and contributing to
each other's planning process. My recommendation to achieve this is to have a
lightweight approval process for a specification before it can be considered
complete.
To close out the process, the author should schedule a meeting whose attendees are
only those who have read the document and contributed in advance. The purpose of
the meeting is to review open questions and conflicts and come to a resolution. The
purpose of the meeting is not for the author to simply read the specification out
loud to a bored or disinterested audience. If there are open questions that require
further diligence to learn about and resolve, then do that offline and review the
results with only the interested parties afterward.
Once all open questions are resolved, document who contributed to the specification
(so a future reader knows to whom they should direct further questions), and
consider the document approved.
The technical leader or manager does not have to be the approver for all technical
documents. I encourage you to build a culture where the team as a whole feels safe
to contribute and doesn't rely on you to provide technical guardrails or support.
Early on, when the department is relatively small, you should be heavily involved
in most or all specifications, but that approach won't scale. As soon as you've
hired other senior individual contributors, architects, or managers, empower them
to be lead reviewers and defer to them, allowing them to do the job you hired them
for. If you find a senior member is not guiding the team well in these reviews,
don't walk all over them in a public forum. Discuss and course-correct with them in
private.
Your tech team is now spending time creating thoughtful documents that cover how
you're engineering your product, and the team should be making fewer mistakes as a
result. The last way that technical specifications help you is by providing useful
resources for future engineers who need to augment or modify the work that's been
done. I recommend that you create a well-organized and searchable directory (e.g.,
an internal wiki such as Confluence or Notion, or document storage like Google
Drive), and that your team be diligent about ensuring all specification documents
are added to the directory. It may also be helpful to link or refer to the
specification in code comments to explain why something is implemented the way it
is.
Developer experience may not always be measured on a dashboard, but when it's
designed poorly, the team knows it, and they may complain loudly about it. Bad
developer experience can derail an engineer an entire afternoon for example, an
attempt to boot up the microservice to test it throws a cryptic traceback and the
maintainer of the service is on vacation, so a mid-level engineer spins their
wheels for hours just trying to get to a reliable build-execute-test loop.
Multiply this inefficiency by all the engineers on your team and all the various
types of repositories, services, and projects that exist at your company and it can
quickly spiral into losing person-months of productivity in direct time. Add in
additional context-switching time spent bringing in others to help solve the
problems, and poor DX quickly goes to the top of the list of areas that, when left
unaddressed, can tank an otherwise high-performing engineering team.
1. Tools that make it easy to have highly reliable and reproducible environments
and dependency chains
1. Documentation and consistency in practices for how things are done
Thankfully, nowadays, many readily available tools and ecosystems can help with #1.
Most programming languages have an ecosystem with standardized tools for dependency
management and reproducible environments. It's up to you to identify and use them
(e.g., npm, pipfile, etc.). Many of these systems produce a file called a lock
file.
The lock file is not for concurrency management to avoid deadlocks; it's designed
to lock in place a specific instance of the dependency graph. You should be
committing these lock files and making sure other developers and any build systems
use them. The lock file helps guarantee that everyone on the team has installed an
identical set of dependencies.
If your chosen programming language does not provide those tools, then it's up to
you to build that reproducibility perhaps by using docker containers, makefiles, or
the like.
Often the difference between good DX and bad DX is twenty or thirty minutes of
upfront effort from somebody familiar with the codebase.
It doesn't take long to ensure that basic build commands work in a fresh install,
and that those commands are documented in a local README.
One opportunity for you as CTO to make this easier is to ensure that the build
commands used across repositories and codebases at your company are consistent.
Maybe it's always docker compose up or always yarn run. Whatever it is, any
developer should be able to git clone any repository, and then the first command
that comes to mind to build and run the software works.
As systems start to get larger it can become an increasingly sizeable chore to get
everything running locally together to test functionality. At this point it may be
worth investing in DX more formally on the roadmap, or even with dedicated
headcount, to ensure that tools are working and developers don't lose large chunks
of time fighting the system instead of writing productive code.
Here are a few easy wins to upgrade DX across your software engineering team:
* Ensure that lint configuration is checked into source control where possible
(i.e., by investing in setting up something like VSCode's settings.json file, found
at ctohb.com/vscode).
* Invest time in making sure that local test data can be set up in local databases
from scratch. Often a quick data generator or seed data script can short-circuit a
lot of developer headaches. Better yet if the seed data can be easily augmented to
add additional corner cases/use cases as the system evolves, so that the base set
of test data can be as comprehensive/representative as possible.
* Develop a plan for how to either mock or actually spin up dependent services
locally to test multiple-service interactions when necessary. Ideally, with good
contracts and domain-driven design, the need for this will be rare, though it
should still be easy when necessary.
In 2022, Stripe, the fintech decacorn (i.e., a company valued at more than $10
billion), decided that Flow, its current programming language, had become too
expensive to use. It was using too much memory, locking up laptops, and integrated
poorly with developer IDEs.
The lesson here is that if the pain of poor developer experience is severe enough,
then almost no cost is too high or any project out of reach to make improvements.
Your team is almost certainly smaller than Stripe s, and you're likely not dealing
with millions of lines of code, but the same calculus applies: if your team is
encountering friction in DX that is slowing it down, you must invest the necessary
developer time and effort to improve it to gain that efficiency back.
Another problem teams often face is changing tooling too often. In certain tech
ecosystems (particularly the JavaScript world), it seems something new and shiny
comes out every month that could provide a productivity boost for your team. I
encourage you to be disciplined about adopting new tools, make sure you've spent
the time to really understand the pain that exists, diligence the new tool, see if
it meets *all* your requirements not just the shiny headline and make decisions
accordingly. For more on my recommended process here, see Implementing Internal
Technology Radar, page 204.
# Tech Architecture
One of your key responsibilities as a tech leader is to make good decisions on your
architecture and tools. Good architecture aligns the strengths of the tools and
patterns you choose with the needs of your organization now and in the foreseeable
future. That requires understanding the strengths, weaknesses, and tradeoffs
inherent in each choice. My goal in this section of the book is to make you aware
in general of the landscape of options in various domains, and help you recognize
the general tradeoffs that different strategies entail.
One thing to keep in mind when discussing tools and tool choice with your team:
engineers can be emotional about tool choice. Tools are reviewed as good and bad,
and people have personal likes, dislikes, and biases. As the leader and decision-
maker, I strongly caution you against adopting this style of language when
discussing tools. Not only can it potentially alienate team members if you're
disparaging their personal favorite tool; it's also unproductive and can distract
from the goal of identifying a good solution for your problem.
Some individual tools are genuinely poorly designed and overshadowed by superior
alternatives. However, more often than not a more nuanced evaluation will reveal
that a given tool isn't inherently bad, but rather appropriate or inappropriate for
a particular company or project. Don't let one bad past experience of trying to use
a tool that was inappropriate for solving one problem prevent you or your team from
using it another time when it may prove a better fit.
## Architecture
There are many excellent resources that explore various architectural patterns
deeply; one of my favorites is Martin Fowler's *Patterns of Enterprise Application
Architecture*. In this chapter, I'll provide a summary of some key phrases you'll
hear so you have context when exploring these topics in depth elsewhere.
**Bounded context:** The boundary within which the domain model applies and where
the ubiquitous language is used.
#### High-Level Patterns
When somebody uses the phrase technical architecture, they are usually referring to
how code is executed and how information moves around in a system. Most
descriptions of architecture involve the phrases services, monoliths, or message
transports. This is in contrast to coding patterns, in which phrases such as
object-oriented, functional programming, or dependency injection appear frequently.
Coding patterns may sometimes be called code architecture and are discussed in
Coding Patterns, page 188.
The key to building a successful monolith is to carefully design the data flows
within the application, using domain-driven design. You can measure this pretty
easily; you want to ensure that when a developer goes to change the functionality
of the application, it is obvious where in the monolith they should be working.
They should only need to change code in an obvious and well-defined or confined
area to achieve their goal. Every additional area of the codebase that needs change
to meet a functional requirement adds additional complexity or opportunity for
error, and in general slows down development.
The phrase service-oriented architecture (SOA) originated in the 1990s and is used
to refer to some fairly specific technology choices. Nowadays, the phrase is used
to more broadly describe a system where information moves between parts of the
system over a network. The main tradeoff with an SOA is that, in comparison to a
monolith, it can be very complex to think about and requires a team to do a lot of
setup and thoughtful design to truly ensure that the benefits outweigh added
complexity.
If you do find yourself contending with an unruly monolith, this doesn't mean your
engineers are bad at their jobs. The nature of software engineering is that
requirements change and systems evolve. Maintaining a monolith may mean, at times,
investing considerable resources into updating the system design to evolve as well,
and it is when a team fails to make this investment that monolith complexity
becomes a barrier to productivity.
Your service has elements that need to be scaled independently. For example, one
feature consumes lots of CPU resources and you don't want that to interfere with
other features, or you prefer not to pay to scale up all features when it's more
cost-effective to scale that one piece independently.
You're working on functionality that needs to expose its own independent API and
has its own exclusive data domain apart from the main system. Especially if this
API is meant to serve external customers, then having this functionality live as
its own service is an obvious good choice.
For some reason, you need to use another programming language as part of your
application. A good example might be because there is a robust and high-quality
framework for solving a certain kind of problem in Python, but the rest of your
application is in Java. Bridging these two languages in memory is possible, but
clunky. The easier option is to bridge them via an API, leaving them naturally
hosted as separate services.
Deploying your monolith is overly expensive, slow, or risky. In this case, you can
enable additional productivity and reduce time to deploy by deploying new code as
an independent service. Just ensure that the new service operates independent of
the monolith and you're not creating new deployment dependencies.
##### Source Control for Service-Oriented Architectures: monorepo and manyrepo
If you choose to manage multiple services as a monorepo you'll likely want to look
for a workspace management solution (e.g., yarn workspaces for JavaScript
ecosystem) to manage building the projects separately. Here are some basic
differences between the monorepo and manyrepo approaches:
it's easy to ensure every service or package dependency is up to date with the
latest version.
Many CI systems do not support multiple packages in a single repo natively, so you
have to build a harness manually to support this.
**Manyrepo, by contrast**
Requires using a central package manager with version control. This isn't
necessarily a bad thing, but it can lead to significant overhead when working on a
project and its dependencies simultaneously.
If you notice your team falling into these patterns or complaining about
coordinating releases between services, this should be a red flag for you to look
closer and consider paying down some tech debt to get back to independently
deployable services. That tech debt is usually located in your contracts, the
design of your APIs, and how data is handled in your system.
In a professional environment, the principal audience for any given line of code is
not the computer but the developer who has to read that code at some point in the
future for further development. This is the golden rule of programming: engineers
should be writing code with the same level of readability that they expect of
anyone else's code.
Per the golden rule of programming, your choice of language should enable your team
to write code that is highly readable and maintainable. In general, a good engineer
can do that in any language; however, some languages make it easier than others to
do so consistently. Some other considerations for what language or ecosystem to
choose:
* How large is the talent pool that is familiar with that language, and more
specifically is familiar with that environment and also interested in startups like
yours?
* Are there existing implementations that you can use as a starting point?
* Do you have particular performance or scaling requirements? Some languages are
much faster than others for specific types of tasks. Haskell is famously
inefficient at string manipulation, and C is famously fast at most things, though
there are other languages that, for certain problems, approach or exceed the speed
of C while providing an easier and more friendly coding environment.
* Is there a particular framework that might be a good starting point in a
particular language? React Native, for example, is a powerful cross-platform mobile
language that requires JavaScript or TypeScript.
In the enterprise setting, I recommend languages with static type systems, such as
Golang, TypeScript, Rust, etc. With a static type system the compiler does more of
the heavy lifting for ensuring code correctness. You should strive for a local
development environment where the tools are finding errors before your code is
executed. Fixing a compile time issue is in general much faster and cheaper than
fixing a runtime issue.
In any widely used language, there will be either a published standard for how code
should be formatted (e.g., PEP8 in Python) or a configurable tool that can enforce
a particular code style and formatting (e.g., ESLint or Prettier in JavaScript, or
ReSharper in C#). Most of these tools are very good at ensuring that code,
regardless of who wrote it, is stylistically identical. In the spirit of ensuring
your codebase is readable, there is no excuse for not using one of these tools and
ensuring 100 percent of your codebase is formatted according to the same rules.
Which rules you use is entirely you and your team's preference, but just make sure
it's consistent and produces a readable result.
Modern static code analysis is capable of identifying and alerting on a wide range
of common code issues, ranging from security gaps to outright bugs to stylistic
inconsistencies. These tools are fairly inexpensive and integrate neatly with a
wide range of commonly used continuous integration systems and developer IDEs. From
experience using these tools on a range of projects and programming languages, the
signal-to-noise ratio is very good, and the output is a net gain in productivity
and software quality. Relatively early on in your software project's life, you
should integrate static code analysis. I encourage you to look at tools that are
specific both to your programming language of choice e.g., ESLint for JavaScript as
well as generic analysis platforms such as SonarCloud, Codebeat, Scrutinizer-CI,
Code Climate, or Cloudacity.
The largest risk in brownfield development is not invented here syndrome. Not
invented here is the tendency for individuals to avoid taking responsibility for or
paying sufficient attention to things they did not create themselves. In brownfield
software development, this can lead to systematic underinvestment in understanding
existing work, causing frustration and inefficiency in augmenting or modifying
existing systems. I strongly encourage managers to make explicit space for a team
to read and understand an existing system before asking them to modify it in any
way. The time spent in comprehension upfront will be paid back by fewer surprises
and faster velocity down the road.
The subject of what style of code to write is a religious discussion for many
coders. My intention in this chapter is to provide a brief description of what the
most common phrases in coding patterns mean, and refer you to more extensive
resources on each practice.
If you're faced with what feels like an emotional conversation on this topic, keep
in mind that many successful companies exist that use each of these patterns.
Everything is a tradeoff. A bad programmer can make a mess with any tool, and
conversely a great programmer will find a way to make a readable solution even with
suboptimal tools.
#### Purity
Code that is pure has no external dependencies or side effects. Said another way,
given the same inputs, a pure piece of code will always produce the same outputs.
The advantage of pure code is that it is easily testable and requires no external
setup or mocking. Pure code is also easier to read and understand, as it does not
require reading any additional code to understand what it does. A simple example of
pure code would be a function that sums together two numbers; given any two input
numbers, the sum function always produces the same output.
Some code is inherently impure for example, code that interacts with the outside
world, such as a filesystem, network, or database. For most other scenarios it's
possible to model business logic in a pure way. Where possible, I encourage you and
your team to write pure code.
To stick with the parts-of-speech model for describing coding patterns, functional
programming treats verbs (functions) as a first-class part of the system. Most
functional programming starts with very tiny pieces of functionality and composes
it together to create more sophisticated and complex systems. When it's done well,
the benefit of functional code is that it tends to be more pure, and thus easier to
read, reason about, and test in isolation. Academic examples of functional code
even exist that can be formally reasoned about, meaning one can produce a
mathematical proof that code runs correctly.
Functional programming, done poorly, can create very verbose and hard-to-read code.
For example, when composing together multiple functions, it's important to consider
how many functions are being composed, and how obvious the behavior of each
function is in the composition chain.
The term Domain-Driven Design comes from a book by Eric Evans, *Domain-Driven
Design*, published in 2003. The core idea is to create a model be it for objects in
object-oriented design or for a schema for your database that models the nouns in
your business domain. This may seem simple and intuitive; however, with complex
business domains, it's easy for code either to inconsistently model the domain, or
to model it in a way that hinders comprehension by the team. Especially with larger
and more complex problems, I always insist that the team sit down and agree on a
consistent way to model the problem, using consistent terminology to refer to the
same concepts across the entire system.
Countless decisions go into every element of building an API. What separates good
APIs from bad APIs is the consistency, predictability, and correctness of those
decisions. As a technical leader the job falls on you to make sure that, across
your organization, you have a structure in place to help developers build APIs that
are consistent with one another, predictable in that they use common patterns that
are appropriate for the problem at hand, and correct.
Achieving these goals requires some form of governance system. This can range from
a set of clearly documented guidelines and standards to a group of people who are
responsible for reviewing and approving all APIs on a regular basis. The larger
your team, the more time and effort you'll need to invest in process and governance
to maintain a high standard.
Out in the wild you're likely to encounter two main types of APIs: HTTP- based and
non-HTTP based. As with any tool, HTTP has its tradeoffs and isn't ideal for every
job, so if your business requirements dictate ultra-low latency, or ultra-high
throughput/low overhead, or real-time streaming applications, you'll likely be
looking for something beyond HTTP. Below I discuss a handful of HTTP API types and
then briefly cover some non- HTTP APIs you're likely to run into.
In the early 2000s, the most common API pattern was the XML-based Simple Object
Access Protocol (SOAP). SOAP and other XML-based API styles are well and truly out
of fashion with startups in the 2020s, but they are still prevalent in legacy
systems, especially from larger companies in technologically slow-moving
industries. You should not be building new SOAP or XML-based APIs.
*REST*
REST is likely the most common form of API you'll encounter. REST has a broad and
robust tool ecosystem and nearly every engineer is familiar with it.
*GraphQL*
GraphQL is similar to REST in that it's JSON over HTTP; however, it does not rely
on HTTP verbs. Nearly everything on GraphQL is a POST, and it uses a structured
schema of queries and mutations.
There's a lot to be said for the benefits of building a graph to model your
company's data, and the good habits that being forced to design a schema bring
about. That said, no system comes without tradeoffs. Because GraphQL forgoes
standard HTTP verbs, it does not play nicely with some elements of the web stack.
GET request caching and developer tooling are still catching up to deal nicely with
GraphQL requests. If those drawbacks are not a significant concern for your
business, I strongly encourage you to check out apollographql.com and consider
using GraphQL especially for internal use cases for your APIs.
*Queueing systems*
A queueing system maintains an inbox (or set of inboxes) to receive messages and an
interface for a consumer to read messages with certain guarantees.
A typical queueing system can guarantee message order (either first in first out,
FIFO; or last in first out, LIFO) as well as at least once or at most once
delivery. Most cloud platforms have hosted implementations of queues, such as AWS
Simple Queue Service (SQS) or Google Cloud Task queues.
Queueing systems often have a notion of *explicit* invocation, which is to say that
when a publisher creates a message, it explicitly specifies how the request should
be handled or executed. By contrast, most publisher subscriber systems support
*implicit* execution. This means publishers do not necessarily know beforehand what
system will handle the message, only that the pub/sub system will deliver it.
The pub/sub pattern and the guarantees it provides are extremely powerful. However,
the tradeoff is that implementations require some care and attention to detail to
realize the advertised guarantees. Implementing a subscriber, for example, requires
paying close attention to message acknowledgement semantics and carefully managing
topic subscriptions to ensure the right messages go to the right place.
*Job Systems*
Jobs, or cron jobs, are a type of backend API that are rarely triggered by a
publisher or client, but instead by some form of timer. Common examples include
nightly data cleanup tasks, or sending weekly email summaries/ notifications. Some
best practices for jobs:
* Use a job system maintained by somebody else, don't build one yourself.
* When choosing a job system (or building one yourself, if you must), ensure that
it:
- has logging for every job execution;
- allows for configuring the retrying of jobs that fail;
- provides notification when jobs fail. It's very common for engineers to set up
a scheduled job, watch it work on day one, and then on day fifteen it fails and
nobody notices until day thirty;
- provides an interface to view jobs and job status;
- allows for job configuration to be stored as code or configuration in source
control;
- allows for jobs to be run inside your environment/private networks/security
groups as necessary to access other internal system APIs/resources;
- integrates with your secret management system.
- allows for easily setting up jobs locally in development and production
environments, and easy testing in each of those environments.
### Documentation
Having thorough, clear, and current documentation for your API is just as critical
as how you build and maintain it. Some key characteristics of great API
documentation:
It's always a good idea to build your API using a system that includes API
documentation generation. Doing otherwise means it'll be practically impossible to
meet all of these goals on a consistent basis. If you're building a REST API, I
strongly encourage you to design your API using OpenAPI (a YAML or JSON document
that describes your API). In most languages there are SDKs to consume an OpenAPI
spec and automatically generate controllers/routes to match the spec and/or
generate a test harness to ensure the implemented API matches the spec. In
addition, there are online tools, such as stoplight.io and readme.com, that can
consume OpenAPI documents and generate aesthetically pleasing and easy-to-navigate
documentation.
If you're using GraphQL, the GraphQL Playground or Apollo Studio explorer can
provide a reasonable stand-in for extensive type documentation. I do recommend you
still build a separate API documentation page, either using a tool like readme.com
or creating something by hand, to act as a primer or getting started guide. The
built-in GraphQL documentation lacks a description of how authentication works, and
it also does a poor job of providing space to explain the relationships between
data in your API.
Another benefit of using either OpenAPI or GraphQL is that the resulting API
specification is portable not only to documentation generators and test frameworks
but also to developer IDEs such as Insomnia or Postman. These IDEs enable
developers to quickly interact with an API to validate functionality without
writing code. Formal specifications can also be used with code generation tools to
ensure typing consistency in code.
### Idempotency
An API request is said to be idempotent when making the same request multiple times
has the same effect as making it a single time. Idempotency is an important concept
in building robust systems and avoiding data corruption. As with all things,
idempotency gives you useful guarantees about a system but it comes with a cost:
implementing idempotency adds complexity to backend systems.
In REST APIs, it is widely assumed that every HTTP verb except POST should be
idempotent. GET requests, for example, by definition should always return the same
result for the same input (unless the underlying data changes, of course). In
general, PUT requests are modifying existing objects and should naturally be
idempotent. Multiple calls to a POST request, however, in most systems signal the
intention to create multiple objects.
For HTTP POST requests in REST and for GraphQL mutation APIs, idempotency is not
provided by the standard/specification. If you want a client to be able to retry
these kinds of requests and have idempotent behavior, you should implement the
idempotency key pattern. An idempotency key is an arbitrary string, provided by the
client (either as an HTTP header or perhaps in GraphQL as an input variable), that
backends use to de-duplicate incoming requests. This requires the backend to store
the idempotency key, and also store the response to a request with that key, to be
provided to clients later on.
Most startups have at least three different kinds of data they use as part of their
business:
* Transactional data
Each of these types of data will come in different volumes, have different
read/write patterns, and require different tools to visualize and glean insights.
A quick note on the phrase big data. As a startup, the chances are very good that
you do *not* have big data in the sense that it needs to be architected with
infinity-scale (or web scale) in mind. Typical off-the-shelf databases with
reasonable quantities of hardware and half-decent data model design are more than
capable of handling tens of millions of rows and hundreds of gigabytes of data with
acceptable performance. Most big data solutions, such as data pipelines or data
warehouse appliances, involve significant added setup complexity, latency, and
cost, and they're likely overkill for your startup. For the sake of simplicity, big
data solutions should only be considered if you can make a compelling argument that
a regular (e.g., PostgreSQL) database cannot do the job. Said another way, don't
prematurely optimize your database architecture.
### Transactional Data
Transactional data is the data that powers your application itself, typically your
primary NoSQL or SQL database. Transactional data requires very low latency and
high availability, and is modest in total size compared to the other forms of data.
My recommendation is to choose an off-the-shelf SQL or NoSQL solution, preferably
something hosted for you such as MongoDB Atlas or Google Cloud SQL. Some nice-to-
haves in your production database:
Business intelligence (BI) is data that is used to gain insight into behaviors of
your users, usually sourced from your transactional data. Early on, you can often
get away with running business intelligence queries directly on your transactional
database. As the size of data and query complexity increase, this becomes more
problematic as it adds additional load to a system that requires high availability
and low latency. The natural solution then is either to query a read-only replica
of your transactional database, or copy/transform the data to another data storage
system via a data pipeline.
Building data pipelines and data warehousing is an entire book unto itself, and the
state of the art is always evolving. I have just a few high-level bits of advice:
In modern times, a startup doesn't need to build and host sophisticated data
pipeline architectures. ELT (Extract, Load, Transform) and ETL (Extract, Transform,
Load) tools can now run entirely inside an enterprise database data lake/warehouse,
and tools such as dbt provide reproducibility, testability, and pipeline-as-code
capabilities, making running data pipelines much more manageable.
Make sure your engineering and product teams are collaborating closely with
whichever member of your team owns data and business intelligence. Bringing in
data's perspective early in the product process will save a lot of headache down
the road with a measure twice, create-data-schema once mindset.
Behavioral data also called behavioral analytics events describes how users have
used your application. Behavioral data is often fairly high volume, with a somewhat
limited schema, and is best used in combination with powerful visualization
software.
One important distinction between behavioral data generated by your application and
transactional data is its precision. Most behavioral data is lossy users have ad
blockers, requests get dropped, or firewalls get in the way. There are many reasons
why events might not make it from a client device to your CDP. That doesn't mean
behavioral data is not useful, but being aware of its lossiness should inform your
expectations for the data and limit the use cases when querying it. If you need
exact numbers, expect to derive those from your BI platform and your transactional
data.
Let's close out this section with a few overall recommendations for designing your
architecture.
As you build out your application you're often confronted with the choice of where
logic should live: on the client (e.g., web browser, mobile device, physical
hardware) or some form of backend server. For certain types of logic, such as
anything related to authentication, value calculations, anti-cheating/tampering
mechanisms, this is a firm requirement. For most other logic it's still a good idea
for the following reasons:
Backends are often easier to test than clients, so you can more confidently confirm
the correctness of business logic on the backend.
More logic on the backend means thinner clients, and also means you can produce
clients for multiple platforms that can leverage a single source for logic,
reducing code duplication.
The APIs from your backend to other backends, or your backend to frontends, should
be thought of as generic-purpose APIs that could be consumed by third parties. This
forces you to maintain several good design habits, including ensuring that
interfaces are comprehensible on their own (domain-driven design), and using
sensible authentication mechanisms and appropriate high-level ownership
abstractions in data design. And, on the off chance you do one day wish to
externalize a service, the road to doing so will be much shorter.
## Tools
The tooling ecosystem and patterns for software engineering are constantly evolving
and changing. You will inevitably be tempted either on your own or by members of
your team to change something about how you're doing engineering, such as adopting
a new library, framework, language, or pattern. Adopting each of these changes
quickly leads to a patchwork quilt of poorly thought-out architecture. Conversely,
ignoring all change leaves you with a stale codebase that, over time, will be less
efficient and harder for newly hired talent to work on. The right approach is to
formalize the process
of changing your tech stack and provide some guardrails to motivate your team to be
curious and thoughtful about tooling changes.
In most cases, I find that when a blip trial fails, it fails relatively early on
and the engineer leading the project doesn't include the blip in the final
delivered implementation.
It's a fact of life that modern startups will spend a lot of money on Software As A
Service (SaaS). Your company is likely no exception, so don't be surprised when you
find you are spending an entire headcount or more on infrastructure and tools
before your Series B fundraise.
### Budget
There are a handful of published benchmarks for SaaS and tool spend at various
company stages as a percentage of either company revenue or total spend.
There's not a single precise benchmark, but it seems that typical SaaS Costs of
Goods Sold (COGS) fall somewhere between 10 and 30 percent of revenue.
Know your spend and keep an eye on cost growth. It's very easy to accidentally
leave a couple of machines running in AWS and add $10,000 to your annual cloud
hosting bills. Most cloud platforms have built-in budgeting features, so There's no
excuse to not use them. If you're using infrastructure as code, it's easy to set up
a module that, for every new cloud system deployed, will automatically apply a
cloud budget at the same time that will monitor and alert on cost for that
particular system.
It's typical for SaaS costs to grow over time, be it because your infrastructure is
growing, or because you discover a new SaaS vendor that can save your team time. I
recommend not using cost as a reason to avoid adopting a typical SaaS tool (with a
cost range in the hundreds per month). Instead, I'd advise factoring regular growth
into your SaaS cost forecasts.
### Tracking
You should be tracking how much your organization spends on engineering tools,
including IDEs, SaaS, and infrastructure (cloud platforms). You can do this
manually in a spreadsheet or using a SaaS Management Platform (SMP). Available from
various vendors such as BetterCloud, Zluri, and Vendr, these solutions link with
your credit card or bank and automatically categorize cash spend.
## DevOps
To me the key piece of this definition is that devops aims to shorten the software
development lifecycle, in other words DevOps is an enabler of broader team
productivity. If you're not already specifically thinking about DevOps, you're
likely underprioritizing or underinvesting in DevOps to some degree. It's not just
my opinion; it's becoming widely accepted throughout tech industries that high-
quality DevOps is a key driver of overall engineering velocity.
**Lead Time:** How long it takes for code to go from commit to running in
production
**Mean Time to Recovery (MTTR):** How long it takes to restore service after an
incident/defect occurs
Together these metrics quantify the idea of how confidently your team can deploy
software. Scoring high on all four metrics requires an investment in automation,
DevOps, testing, and culture. As Thoughtworks is quick to point out, drawing value
from these metrics doesn't necessarily require highly detailed instrumentation,
metrics, or dashboards. DORA publishes a quick check survey (ctohb.com/dora) that
your team can take to track its progress at a coarse-grained level. There are also
plenty of tools that have fairly low barriers to entry that will yield data quality
that's more than sufficient to inform your progress, such as LinearB or Code
Climate.
The following subsections on DevOps present concepts, disciplines, and focus areas
that contribute in some way to improving these metrics.
### Reproducibility
### Containerization
The most common way of explaining the role of containers in the DevOps context is
to consider where the name originated: from shipping containers. Prior to the
standardization of shipping containers, if you wantedto transport goods across the
ocean, you would package your cargo in a variety of forms ranging from placing it
on pallets, storing it in boxes or barrels, or simply wrapping it in cloth. Loading
and unloading goods that arrived packaged in all these different ways was
inefficient and error- prone, mainly because There's no single kind of crane or
wheelbarrow that could effectively move all of the cargo.
The most common way you'll interact with containers is through a software system
called Docker. Docker provides a declarative programming language that lets you
describe, in a file called Dockerfile, how you want the system set up i.e., what
programs need to be installed, what files go where, what dependencies need to
exist. Then you build that file into a container image which provides a
representation of the entire file system specified by your Dockerfile. That image
can then be moved to and run on any other machine with a Docker-compatible
container runtime, with the guarantee that it will start in an isolated environment
with the exact same files and data, every time.
Build the container once (say, in CI) so that it can run in your various
environments production, development, etc. By using a single image, you guarantee
that exactly the same code with exactly the same setup will transition intact from
development to production.
Once you're building container images and moving them around, you'll immediately
want to be organized about managing the built images themselves. I recommend
tagging each image with a unique value derived from source control, perhaps also
with a timestamp (e.g., the git hash of the commit where the image was built), and
hosting the image in an image registry. Dockerhub has a private registry product,
and all the major cloud platforms also offer hosted image registries.
Many hosted registries will also provide vulnerability scanning and other security
features attached to their image registry.
Smaller Docker images upload faster from CI, download faster to application
servers, and start up faster. The difference between uploading a 50MB image and a
5GB image, from an operational perspective, can be the difference between five
seconds to start up a new application server and five minutes. That's five more
minutes added to your time to deploy, Mean Time to Recovery/rollback, etc. It may
not seem like much, but especially in a hotfix scenario, or when you're managing
hundreds of application servers these delays add up and have real business impact.
It follows then that you can minimize the total image size of your container by
keeping the individual layers small, and you can minimize a layer by ensuring that
each command cleans up any unnecessary data before moving to the next command.
Another technique for keeping image size down is to use multi-stage builds. Multi-
stage builds are a bit too complex to describe here, but you can check out Docker's
own article on it at ctohb.com/docker.
Now that you've got reproducible images of reasonably small size managed in a
hosted registry, you have to run and manage them in production. Management
includes:
Some common hosted container platforms include Heroku, Google App Engine, Elastic
Beanstalk, Google Cloud Run. Vercel is another popular hosted backend solution,
though it does not run containers as described here.
##### Self-Managed/Kubernetes
ClickOps refers to the process of configuring your cloud infrastructure using the
user interface, as opposed to the provided APIs. As your infrastructure grows, the
quantity of nuance and detail in your system will quickly exceed your ability to
reproduce it with ClickOps. ClickOps is fine for prototypes or proof of concepts,
but when the time comes to actually build a production environment and mirrored
development environments, using ClickOps will quickly lead to considerable
frustration and cost, as well as limited capability. The alternative to ClickOps is
known as Infrastructure as Code (IaC).
There are several tools and frameworks that allow you to define IaC. The leading
one is HashiCorp's Terraform. Terraform uses HashiCorp Config Language (HCL), a
declarative configuration syntax, to allow engineers to define what resources they
want and how they are to be configured. Terraform code can and should then be
managed like any other code, using source control and peer review practices. Once
approved, Terraform can generate infrastructure change plans and apply those plans
for you with your cloud provider(s) of choice. I cannot emphasize enough how easy
to use, powerful, and maintainable Terraform is, and how much ROI you're likely to
gain by migrating from ClickOps to IaC.
* Ensure the team understands the CI system and is comfortable adding to it,
updating for new requirements, and troubleshooting when things inevitably cause the
build to fail.
* Ensure builds are consistent and deterministic. Unreliable or flaky builds are an
extremely powerful productivity drain and time sink.
* Try to keep build times down. For most teams a good target is for CI to take less
than fifteen minutes.
* Learn the capabilities of your CI tool that aide in keeping builds fast,
including build caches, build artifacts, and running jobs in parallel.
* Builds can get complex. Try to keep your code for CI *dry.* Reuse code between
build pipelines where possible.
* Be consistent in how your builds access secrets. Either depend on a cloud secret
manager, or build environment secrets where necessary, and try not to mix them.
There should be one obvious and consistent way to handle config and secrets.
In the early stages of a project, it's comparatively simple to deploy new code. At
that point, There's not much code or architectural complexity. Before long,
however, the need to manage dependencies, dependent services, CDNs, firewalls,
build artifacts, build configuration, secrets, and more leads to a complex deploy
process. As these requirements accumulate, it's easy to neglect automation, and
simply rely on, for example, a dedicated and highly trusted individual as release
manager. There are countless teams out there following this pattern, and I
guarantee you most of those release managers have release dates circled on their
calendar in advance and dread the stress, long hours, and frustration that those
days inevitably entail.
Fortunately There'sa cure for the error-prone stress concentration that is the
monthly (or longer!) release day. It's to release every day, or heck, every hour!
It follows logically from one of the Ten Pillars of Tech Culture (see page 138),
Frequency Reduces Difficulty, that more frequent releases will force your release
manager, and team, to automate the hard parts of deployments. With enough
iteration, releases can become entirely automated, and with sufficient testing
giving you confidence in new changes, you can get to the point of triggering new
releases for every code change set, referred to as Continuous Deployment.
An automated release process tends to also imply an improved ability to roll back
changes or recover from an issue in production. This is also measured as Mean Time
to Recovery (MTTR).
In summary, automating releases means code goes out faster (reducing lead time*)*,
means you can deploy more often (increasing deployment frequency), and improves
MTTR. That's three of the four key DORA metrics (see page 208) with one initiative!
In the companies I've worked with, either directly or in an advisory capacity, I've
seen at least a dozen teams invest effort into either partially or fully moving to
continuous deployment. It's not always a straightforward journey, it doesn't happen
overnight, and often there are well-reasoned objections. Yet in every circumstance,
when the team looks back on the time invested be it three weeks, three months, or
two years later the difference is nothing short of transformational to the culture
and overall output and velocity of the team by every metric.
Feature branch environments also solve the contention problem encountered by teams
that have only a single staging environment. I advocate giving every branch its own
staging environment.
Having looked at this list of concerns, you may be daunted by the burden of setting
up and maintaining feature branches. Indeed, a proper feature branch setup is not
cheap, but the value it offers is substantial in improving your ability to test,
and in reducing the logistical overhead required to verify different kinds of
software changes.
Knowing how the domain name system (DNS) works, and knowing how to manage DNS and
its security implications for your company, is a critical job that often lands on
the startup CTO. If you're not already familiar with the basics of how DNS works
and the different record types, I recommend you spend a few minutes browsing
Wikipedia now to get a grounding on the subject.
You should also know how DNS is used for email in your organization. In particular,
become familiar with Sender Policy Framework (SPF) records and DKIM/DMARC.
You should also set up your DNS records using Infrastructure as Code (IaC). I've
seen countless companies where DNS is managed exclusively by an executive whose
two-factor authentication is the only one allowed to update a zone record, and when
that person is on vacation There's no fallback mechanism to manage the site.
A better solution is to set up DNS with Terraform (which has integrations with all
the major DNS providers) and then manage DNS records with source control,
empowering individual developers to add new records in a responsible way that isn't
gated on any one individual.
At a high level, a feature toggle is a switch that allows you to change system
behavior without changing the actual code. I strongly advocate using feature
toggles, in particular because they allow your team to have separate processes and
timelines for shipping code and shipping features. Pete Hodgson has a wonderful,
in-depth explanation of feature toggles at ctohb.com/hodgson.
The Four Key Metrics advocate for shipping code as frequently as possible. Doing so
leads to many great positive benefits for the health of your engineering process. A
natural concern, however, is that your business may not be ready for a particular
feature to go live the second the code is done. There are many reasons why your
development and release schedules might be out of sync in this way, such as the
need to coordinate timing with a marketing activation, the creation of customer
support documentation, awaiting regulatory approvals, pausing for internal
communication, etc. Feature toggles enable your engineering team to focus on
shipping as quickly and reliably as possible and delegate the problem of
coordinating when features are enabled to an out-of-band process likely owned by
other teams.
Application performance monitoring (APM) and real user monitoring (RUM) are two
types of tools that help teams understand how an application is performing in
production and identify or prevent user-facing outages.
An APM tool usually sits inside or alongside your application in the production
environment and provides analytics and insights into resource usage and request
latency and throughput from the perspective of your backend. RUM is an external
tool that pretends to be a user and provides analytics on latency from the
frontend's perspective, or the perspective of a real user.
If you had to choose one (and these tools are often very expensive, so you may be
forced to go with one or the other), choose the one that covers the blind spot more
likely to be problematic for your application. If you have tons of users and you
get inundated with real-time complaints for every minor bug or edge case, then an
APM monitoring backend load may prove more valuable than a RUM producing redundant
alerts to your users. For most startups in seed or growth stages, though, you'll
likely have inconsistent application usage, especially covering your edge cases, in
which case RUM may be more valuable than an APM on a mostly idle backend.
Some common tools in this space are New Relic, Datadog, and Akamai mPulse.
## Testing
Imagine that, instead of testing software code, your job was to be the inspector
for a municipal bridge. The bridge is built and now it's your job to inspect it and
decide whether or not to allow it to open to the public. Do you inspect every
single bolt and rivet on the bridge? Doing so would probably give you high
confidence the bridge was safe, but it would also take a long time and derail the
mayor's plans to open the bridge next week.
Another reasonable strategy would be to decide exactly which elements of the bridge
are essential to meet the designated safety factor and test/ inspect those. Maybe
that's only every other rivet, plus all the cables and structural concrete.
Similarly, in software engineering testing, the goal isn't coverage for coverage's
sake, but coverage that provides confidence that the software does what it is
intended to do.
Effective software testing is not always about 100 percent code coverage. The bar
for a good software test suite is that it gives your team confidence that, when the
build is green and all tests pass, the software is ready to be released to end
users. That may mean 100 percent code coverage, or it may mean 30 percent code
coverage. The exact number is up to you to determine and monitor, and that amount
of effort may change over time if you find the tests are not providing the same
confidence they once were (and, conversely, if you find you're overinvesting i.e.,
you're spending a lot of resources on a test suite, yet bugs are still making it
out too often).
Depending on the size of your team, you'll either have no dedicated test team, one
test team working on one or many types of tests, or many test teams working on
various kinds of software testing. No matter who is doing the test it's important
to recognize that software testing is a complicated process, and nuance matters. To
test software effectively, the tester, whether they wrote the code or it was thrown
over the fence to them, has to deeply understand what the code/software should be
doing. Your role is to set up your teams so that they can empathize with each
other. To do this, ensure that the teams share goals/KPIs, that your process has
robust and continuous communication between the developer and tester, and monitor
that the teams have a healthy, productive relationship.
Before jumping into the nuts and bolts of the different software testing paradigms,
it's worth thinking about what the purpose of software testing is, and thus what
makes a good test, and conversely what makes a bad test.
Defining bad tests is simple. Bad tests have a higher cost than benefit to your
team. Some common characteristics of bad tests:
* It takes more time to maintain and fix tests for legitimate code changes than the
pain you save by bugs found.
* The test has a high false positive rate or is inconsistent, which slows down CI
and causes developers frustration and context-switching cost to re-run spurious
failures.
* The test is poorly thought out and literally validates that the code does the
wrong thing.
* The suite tests that the code does the right thing and has low false positives,
but it's so convoluted and hard to understand that only the person who wrote the
test can add a new one, and every other engineer who looks at it gets a migraine.
* The test does not instill confidence that the code under evaluation is ready to
be shipped to end users.
With that picture in mind, it's relatively easy to see by contrast what attributes
good tests should have. Good tests are:
* Capable of running reliably and consistently producing the same result Easy to
augment
One of the best ways to evaluate your testing approach is also the most obvious:
ask how your team feels about the tests with a simple sentiment analysis. The
results tend to be very binary either tests are a source of security that teams
rely on, and naturally augment them because they obviously are a net value-add; or
teams passively, or even actively, hate their tests because they're a drain on
productivity with not enough obvious value.
Your team should be adding testing in ways that boost confidence that the system
works correctly while minimizing the additional burden of maintaining those tests
long term in the face of organic system growth. A general pattern to minimize this
pain is to test the public interface. Public interfaces should be well thought out
and comparably stable over time.
They are also the happy path that consumers of the software will actually
experience.
Tests on public interfaces should therefore change little over time and also
provide confidence that the parts of the code that matter are working correctly.
The public interface for your software may vary from project to project. For many
projects, it'll be an actual HTTP-based API; for some, it'll be a set of
functions/classes in a library or internal service. For other projects it may be a
user interface.
Software testing can be broken down into the following categories/ paradigms:
* Unit testing
* Integration testing
* End-to-end testing
* Manual testing
* Semi-automated testing
**Change granularity detection:** What size of, or type of, code change a test is
likely to detect and cause a failure.
**Run cost:** How quickly or how costly it is to run the tests, in time or in
dollars.
**Setup effort:** How much effort is necessary to set up an effective test suite.
The following chart summarizes how the five paradigms map to these metrics. Note
that there is no one perfect test paradigm; each has tradeoffs, and I encourage you
to think carefully about which tradeoffs make sense for your company and codebase.
The right answer is usually a blend of different test paradigms, using each type of
test where it adds the most value in your application.
Unit testing is often the first thing that comes to mind when somebody talks about
software testing. It's widely what's taught in school and included in textbooks,
and it usually gets the most effort in the real world. Given all this, you'd think
unit tests were the best type of test, though I'd argue that's not always the case.
Let's begin by clearly defining unit tests so as to differentiate them from other
testing paradigms.
Unit tests run entirely in memory on a machine, in a shared memory space with the
code under evaluation, without requiring any network connectivity or dependency on
external services. Most unit tests are very fast to run, test very small amounts of
code at a time, and are relatively easy to get started with. Unit tests are usually
also tightly coupled to your code contracts and often require new code in the form
of mocks to enable code- under-test to execute without normally available external
dependencies.
The key tradeoff made by unit tests and their primary downside is that they are
tightly coupled to the code under test. It's altogether too easy to have unit tests
be deeply intertwined with actual function calls and internal data objects. This
deep dependency means any refactor of the code even simple and benign changes will
require considerable unit test updates as well. The creation of mocks and test data
fixtures often also requires a considerable amount of code to create and maintain
to allow unit tests to run, adding unexpected cost to the unit testing framework.
An end-to-end (e2e) test exercises code in the same manner as the end user. For
backend code, end-to-end tests and integration tests may function in the same way,
running code via an external contract. For user-facing frontend code, an end-to-end
test usually involves mechanizing the client interface, often a web browser or
mobile phone. I would not encourage any tech leader to write their own client
mechanizing code this is a particularly gnarly problem that downloadable tools like
Selenium, Cypress, and Puppeteer can take off your hands. For mobile, there are
tools like HeadSpin and Detox.
The key tradeoff of end-to-end tests, in the real world, is reliability. At least
at the time of writing, reliable web end-to-end tests are still somewhat elusive;
the nature of how the browser renders means race conditions occur very easily by
accident. Building a reliable web end-to-end test suite requires considerable care,
attention to detail, and maintenance.
The payoff, though, is not insubstantial, as it's possible to create very high test
coverage and very high confidence in functionality of user-facing flows with e2e
test suites. There are several companies, including testim.io and rainforestqa.com,
that are exploring using AI and ML to solve this problem. These solutions deploy
fuzzy visual testing instead of relying on the presence or absence of CSS
selectors, for example, to try and improve test reliability. Hopefully by the time
you read this, the state of the art will have advanced a bit further, and the value
proposition of end-to-end tests will be even stronger than at the time of this
writing.
Visual regression testing is a relatively new paradigm that aims to detect defects
in user-facing applications by performing deltas on rendered visuals. There are
frameworks that do that at various levels of granularity, ranging from capturing
screenshots of entire pages to rendering individual components, and producing
deltas to detect defects. The obvious drawback is that every intentional change to
any visual tested component will require a change to a test.
Fortunately, these test frameworks often make it simple and painless to reproduce
the set of correct visuals for comparison, which opens up another pitfall: with
easy tooling to overwrite the testing target, it becomes very easy to produce false
negatives, accidentally accepting a visual delta that is, in fact, erroneous.
Manual testing, as the name implies, is run by humans and not machine code. For
humans testing code, we can further subdivide this category into specialized and
unspecialized testers.
A great in-house manual testing team can provide a huge improvement in overall
software and product quality.
Specialized manual testing teams should be creating detailed test plans for product
functionality and storing those plans in an easy-to-retrieve/ repeat fashion,
ideally with a tool like TestRail that allows creating full test suites of manual
test plans which can be rerun by the team manually on demand. Another benefit of a
tool like this is integration with other developer and product tools for example,
linking a TestRail run to a Jira epic to show how many manual and regression tests
were run for the release of a given feature. Not only is this valuable as a launch
checklist item, but it also aids in retrospectives for any released defects,
allowing you to revisit what manual tests were run before releasing any given
feature, and adding additional manual tests to catch any defects that made it
through.
The major pitfall of these tests is reliability. Because they're created by non-
technical staff, they may not be quite as precise as fully automated tests, making
them more prone to false positives and false negatives. That said, this space is
rapidly evolving with new companies and tools launching every year, such as
Rainforest QA and Testim, with strategies to improve reliability and lower overall
cost.
## Source Control
The industry standard used by the vast majority of companies for managing source
control is Git. There would need to be a very compelling reason for your
organization to use anything else; most of your current and future team members
will arrive already knowing the basics of Git. By not using it, you'd likely be
inflicting an unnecessary learning curve on them by forcing them to learn your
chosen alternative.
There are three main cloud Git hosting platforms: GitLab, GitHub, and Bitbucket.
These three have the lion's share of the market and, again, you should think
carefully before deviating from these standards.
Git has an interestingly shaped learning curve. Most people reach a plateau where
they operate with a rudimentary understanding that gets them by for most happy path
testing scenarios. However, sometimes things go wrong, and a developer loses a
commit or makes a mess with a merge. When this happens, their failure to climb the
rest of the Git learning curve will cause frustration and slowdown. I encourage
you, as the team lead, to invest the effort to climb the back half of the curve.
Use Git exclusively on the command line to become familiar with what is actually
happening. Learn about the reflog, interactive rebases, bisect, and the various
built-in merge strategies. Armed with this knowledge, you can bypass an entire
class of productivity-draining problems and train your team to become Git experts
as well.
In general, experts recommend implementing a robust peer review process for all
code changes. (As I write this in early 2023, there is a growing movement
challenging this recommendation, or at the least adding nuance to it, which I'll
discuss in the next section.) Most peer review is done with what is called
depending on your code hosting solution a pull request, code review, or merge
request. Here are some suggestions for keeping code review productive and
efficient:
* Keep reviews small! Set a maximum size for code reviews, something like ten files
and 200 lines. Anything else should be broken out into multiple stacked/incremental
reviews. (A stacked review is a code review that is based on or dependent on a
prior review. When completed, the reviews are merged sequentially to add up to a
complete change.)
* Establish the goals for code review upfront with your team and bake them into
your culture. Code review is not for style or petty semantics; that's what your
auto code formatter/linter and static analysis are for. Code review's purpose is to
ensure clarity, identify architectural concerns, flag defects and deviations from
patterns, note edge cases, and guarantee adherence to business rules.
* Require the author to make the reviewer's job easy. Authors should include a
description of the change, a link to relevant requirements and tickets, and a video
walkthrough (using a tool like loom.com) of the code and the code working as
intended.
* Encourage the author of any given code review to do a self-review before asking
others to review. A few well-placed comments from the author to guide readers can
save a lot of time.
The common wisdom is that every code change should be reviewed by two people before
being shipped to customers. As with everything, there are tradeoffs. Manual code
review is not free, nor is it a guarantee of software quality. Given that manual
code review comes at a cost, it's worth thinking about when that cost provides the
highest return and using code review as a tool for the highest-ROI scenarios. This
general idea was popularized by a 2021 blog post by Rouan Wilsenach titled
Ship/Show/Ask (ctohb.com/ssa).
Let's examine the cost of code review. A code review requires two people call them
the Author and the Reviewer to experience a number of context switches. A common
asynchronous code pattern might be as follows:
**Context Switch #1:** Author stops coding on Project 1, sets up code review, and
tags Reviewer. Author starts working on Project 2.
**Context Switch #2:** Reviewer gets a notification, stops their work on Project 3,
and begins review of Project 1. Reviewer leaves feedback for Author, résumés work
on Project 3.
**Context Switch #4:** Reviewer stops work on Project 3 and best-case scenario
Reviewer is now satisfied with changes in Project 1 and approves the code review.
Reviewer résumés work on Project 3. Worst case, Author and Reviewer must repeat
Context Switches #3 and #4 several times.
There are ways to minimize these context switches, but they too involve tradeoffs.
A common alternative is to do all code reviews as a synchronous pair programming
exercise; however, that strategy trades context switches for synchronous meeting
time, which is still a drag on productivity. No matter how you slice it, human code
review is expensive.
- Copy/translation updates
- Minor UI changes, ideally submitted with visual evidence of the change
- Test-only changes
- New code that is explicitly not yet used or is disabled behind a feature toggle
- Code that no customer or user is able to access (e.g., hidden pages)
- Code changes that come with tests and involve augmentation of existing patterns
and functionality
- Code that has limited or no real-world usage (e.g., an undeployed product)
- Refactors that can be proven are correct via reliable testing
Although I believe this system improves overall team efficiency, I concede that
it's not an option for everyone. Many compliance regimes (such as PCI or SOC 2)
require a policy of 100 percent human code review. The best you can do in that
scenario is comply and perhaps carve out products or feature areas that are not
governed by the compliance framework to experiment with a more nuanced and
efficient process.
There are many ways to deal with source control branching, though the industry as a
whole is building momentum around the concept of trunk- based development. As this
seems to be the most effective and commonly used pattern at this time, it's what
we'll discuss here. If you're seriously considering a different pattern, you'll
find plentiful resources online discussing methodologies and best practices for
alternative approaches.
There are many blog posts with helpful graphics covering git branching models, such
as this one from Reviewpad: ctohb.com/branching. If the following description
doesn't make sense to you, I urge you to consult any of these posts and their
associated visuals.
Trunk-based development, and its slightly more sophisticated cousin, GitHub Flow,
are models of managing source code that aim to minimize the number and duration of
branches. The exact implementation of GitHub Flow and trunk-based development will
vary, but what they have in common is that there is a single branch whose name
varies and doesn't matter much. Here we'll call it production. Production is always
deployable. In fact, I recommend that you set up automation so every commit to
production can actually be deployed to production. Work can then be done in feature
branches off of production, reviewed in the feature branch, and merged when ready.
That's it one long-lived branch and many short-lived (and ideally small) feature
branches.
* Continuous Integration that runs a robust test suite to ensure that feature
branches are safe to merge.
* The ability to rapidly deploy, with zero downtime, code changes to the production
environment. Similarly, an ability to rapidly revert individual changes in response
to an incident.
* A culture that is disciplined about small and short-lived feature branches. The
GitHub Flow model loses its efficiency and simplicity if feature branches become
large, long-lived and unwieldy. As discussed in the 3.3.5 Feature Branch
Environments, small commits, small branches. and small pull requests are a key
driver of productivity.
The key to maintaining a smooth system of branches and merges with your team is to
keep branches short-lived. Nearly all of the problems associated with code merging
come from code branches being open too long or the branch containing too large a
diff (ctohb.com/diffs). In general, a short-lived branch should be open just a few
days, or two weeks at the absolute most.
With a bit of thought and practice most implementations can be broken down into
independently mergeable pieces. This is a skill that with your guidance teams can
develop over time.
* Limits the amount of time for new code from other branches to be merged into
trunk, thus limiting the scope for code conflicts. Smaller branches also inherently
have less surface area for conflicts.
* Keeps the feature branch code relatively small, thus making it easier for
reviewers to read and limiting the scope for breakages.
* Encourages faster feedback in reviews, and allows for course corrections sooner
in the process of implementing features.
## Production Escalations
Before you implement an escalator and set up a rotation, make sure the engineers on
your team have opted in to being on rotation, and that everyone knows and
understands expectations for creating exceptions (e.g., trading an on-call window
with somebody else during a vacation).
You'll also want to ensure that you have adequate documentation in place, and that
everyone understands the standard procedures for what to do when receiving a page.
Some considerations for establishing these procedures:
* Note where the recipient should post an acknowledgment of receipt of the page
(maybe in the escalator tool itself or a shared group chat dedicated specifically
to handling escalations).
* Enable easy access to the playbooks that are used to help diagnose particular
kinds of problems.
* Determine whether to, and where to, set up any kind of site down notice (e.g., a
company status page needs updating).
* Decide where and how often to post updates on the status of the investigation,
impact estimate, and restoration estimate.
* Determine what to do once an incident is closed, scheduling a root cause analysis
exercise and ensuring the particular incident does not recur.
Any time there is a system issue that has measurable user impact, your team should
perform some level of root cause analysis (RCA). The goal of
an RCA is to understand where your systems had a failure that allowed an impactful
defect to make it to production and to end users.
To be crystal clear, the root cause analysis *must not* be about identifying fault
or assigning blame. That needs to be true in every part of the RCA process and
embedded into the culture of your team. The RCA attacks systemic problems (not
human errors) in your system that allow a failure to occur.
Without that safety and willingness for team members to be forthright with their
feedback and documentation, you'll miss out on key opportunities to improve the
system.
Your team should produce documentation in some form for every RCA. Depending on how
often issues occur with your system, and the nature of those issues, you may wish
to create a classification system for RCAs, with low-impact incidents getting a
lighter-weight RCA process than high-impact incidents. It should be acknowledged
that a thorough RCA on a high-impact incident is an expensive effort, taking
considerable time and thoughtfulness, and that it may prove too heavy-handed for
trivial defects.
That said, for most companies it's better to err on the side of overspending in
this area and ensuring greater reliability. You should start with a thorough RCA on
everything, and transition to a stratified RCA system once you've got a good
understanding of the landscape and impact of the kinds of issues your team will
face.
For issues that merit a full, thoughtful analysis, here is a template that will get
you started and asking your team the right questions: ctohb.com/ rca. It is a good
practice and in fact a requirement for most compliance frameworks to create a new
document like this for every incident and to organize them in an internal company
document store for later reference.
They should complete an initial draft of the RCA and circulate it to relevant peers
before scheduling a time as a group to explore and try and improve the analysis and
future prevention steps.
The meeting attendees should read the RCA draft in advance and come prepared to
explore the nuts and bolts of the incident and ideate on future prevention steps.
The RCA lead need not necessarily be the person who responded to the incident. The
ideal RCA lead should be someone who is very familiar with the systems involved and
can ask insightful questions about where tools and processes failed and generate
ideas for improvement.
Note that we re not throwing anyone who made a human error under the bus. That
person may be the RCA lead if they fit the prior criteria, but their error does not
on its own make them the right person to lead the RCA. They should certainly
contribute and take the opportunity to learn through the process. But again, they
are not punished for their mistake as part of the process. Authoring an RCA is not
a punishment; it's an important responsibility and element of system maintenance.
A good RCA process will often identify many work items for the team to improve the
system and make future incidents less likely. The natural next question is: do we
do them now? For the engineers involved, the answer is likely yes; for a manager
concerned about hitting deadlines and a roadmap, the answer will be less clear.
There is no one right answer to the question, but here is some general guidance:
Never let a good crisis go to waste. Motivation to remediate issues will be at its
peak around the incident and the RCA meeting, and highly motivated engineers are
often most efficient. It's also easy to underestimate the overall cost to your team
of system reliability issues and thus under prioritize reliability improvements.
The fact that a production incident occurred should remind you and your team that
these investments are critical to limiting distractions and enabling teams to focus
on productive feature work and delivering consistent high velocity.
The level of effort for many remedial issues is likely to vary widely. Some typical
tickets might be add more logging or change a setting in our CI provider to ensure
PRs with failing builds cannot be merged. These types of trivial tickets cost more
to maintain and groom in a backlog than they'd take just to do in the moment, so
just do them. The chances they are the wrong thing to do are pretty low, and if
negative consequences result, they can be easily reversed.
For high-effort remediation steps, I encourage you to triage those and put them
through your regular planning process. Often, high-effort remediation steps can be
simplified with the benefit of time and planning. Said another way, the identified
right way to solve the problem on day one may not be the ideal solution, and only
by putting the issue through the regular paces of technical scrutiny can a better,
perhaps less costly, solution emerge.
## IT
IT usually comprises tools like company hardware (desktops, laptops, and phones),
VPNs, email, antivirus and monitoring software, etc. As a startup in the modern
world, whether you're an in-person or remote team, if you make a few wise
decisions, you should not need to spend very much time or capital on IT.
Some key decisions that will help you minimize IT cost at most small tech
companies:
Use a cloud-based system for company email, data, and documents. Most startups are
using Google Workspace, but if your team members (and prospective future hires) are
more comfortable with an alternative, go with that. There's no benefit at this
stage in setting up your own in-house mail server, document storage, data access,
networking, etc.
Even following best practices to minimize IT effort, you'll still have some IT
tasks you cannot avoid, primarily around activating and deactivating user accounts
and password recovery for employees. I encourage you to document for and train
other coworkers, perhaps in HR, in how to do these tasks so they do not interrupt
you or the engineering team on a regular basis.
In this section, I will provide a brief overview of the subject of security and
compliance for startups. You can and should put in the effort to find in-depth
resources beyond this book on these topics.
Especially with security, it's important to be precise and exact with language.
Some definitions of commonly misused terms:
**Authentication, or AuthN**: Validating that a user or client is who they say they
are. Your login system performs user authentication.
Startups are often defined by the extent to which they are resource-constrained. As
a result, security posture and compliance are often the first things deprioritized
on the to-do list, as they are less likely to represent an existential threat to
the business than other pressing concerns. If you have no users or revenue, what is
there for a hacker to steal?
Taking security into account can also become a drag on productivity or an expensive
task, especially if your mission is to secure a system that already exists. But if
you're starting from day one, you have the opportunity to make good decisions at
the start that create a strong security posture with minimal additional cost.
Some ways to incorporate security at your startup that won't cost you much:
* Establish security as a priority in the mindset of your team in your onboarding
and training materials.
* Enroll all engineers in onboarding and recurring basic security training things
like the OWASP Top Ten or various gamified security training that take a few
minutes a month to keep security top of mind.
* Don't waste time building a login page yourself; in 2023 There's really no reason
to. Tools like Auth0, SuperTokens, and AWS Cognito provide secure user signup,
login, social login, forgotten password management, email authentication, two-
factor authentication, and session management. Some of these tools also offer
robust authorization systems. Dealing with auth is a substantial project; it's very
complex and mistakes are expensive. There's no reason your startup needs to solve
that problem.
* Don't be lazy about IT security. Regardless of whether you're using Dropbox, Box,
Google Drive, SharePoint, etc., take a few minutes and set policies to help avoid
human error, such as default sharing permissions to being internal only. Set up
regular data-sharing reports and appoint an employee to do a quarterly audit of
permissions settings on any particularly sensitive documents or spreadsheets.
* Use an enterprise password management solution, such as 1Password, and ensure all
employees are using robust passwords for important tools. Similarly, use Single
Sign-On (SSO) as often as possible and ensure your SSO provider is configured with
high security (at least requiring Multi-Factor Authentication).
* Don't commit secrets in your codebase. Leverage a secure secret manager such as
Google Cloud Secret Manager or AWS Secret Manager, and commit the name/location of
a secret in code and resolve that name to a value in production, either at bootup
time using a tool like Berglas or Whisper, or at runtime directly with the secret
manager APIs.
### Compliance
Whether it's due to the industry you are in, the size of your business, or the
nature of your customers, most startups need to comply with at least one formal
compliance framework. If your users are in Europe, then you need to comply with
GDPR. If you're taking in user data, it's wise to understand the CCPA. If you're
working with enterprise clients, you'll be asked for your SOC 2 or ISO 27001
certification. In healthcare, you've got HIPAA, and if you're in payments, you've
likely heard of PCI DSS.
For a startup, staying in compliance with any or all of these frameworks can be
unacceptably expensive. Here are some tips for staying compliant and anticipating
the cost:
* Don't try to get a compliance certificate at the last minute. Preparing for and
conducting an audit such as for PCI DSS or SOC 2 from start to finish is a lengthy
process, ranging from six to twelve months for most startups. Starting early and
maintaining compliance is cheaper than starting late and doing rework.
You've put together a great hiring process, the team is happy, you're running
sprints like a pro, and your architecture is withstanding the growing demands of
the business. That feels good, but how do you know if it's enough? How do we
measure our own success and performance as a technical leader or CTO?
One way to look at defining greatness in this role might be from a CFO's
perspective: how efficiently can a CTO deploy an R&D budget and convert that into
engineering and product output?
Or perhaps one might look at it from the CEO's viewpoint: how quickly can the team
the CTO leads deliver on certain business objectives?
Or, given how important people leadership is to excelling in this role, we could
view it through a humanistic lens: is your team doing their best work? After all, a
great CTO's mission is to build an organizational culture that allows individual
engineers to do their best work and achieve the impossible with technology.
Or, rather than trying to define a single objective, maybe the best definition of
CTO greatness is a sum of all the skills that a CTO might exercise on a daily
basis. Perhaps great is when you take the sum of architecture, performance
management, vendor management, executive leadership, cultural contributions, public
evangelism, mentorship, and DevOps, put it through a formula, and you end up with a
number bigger than 42.
Try as we might, it seems that great leadership even great *technical* leadership
isn't something we can precisely quantify or measure. Smart minds will struggle to
agree on a common description of greatness, but we will all agree that the role is
diverse and ever-changing, requiring constant learning and adaptation.
There are few universal truths in engineering leadership, but one of them is that
becoming a good engineering leader is a never-ending journey of self-improvement,
discovery, and growth. Proceeding down this path requires humility, willingness to
make mistakes, and, above all, curiosity and a desire to learn.
I hope this handbook has been a helpful reference guide for you with the challenges
you face on your leadership journey. The handbook covers many of the challenges
that I myself have faced over the years as well as those of the many wonderful
leaders I've had the pleasure of interacting with.
I've done my best to provide some structure on meeting those challenges, though
every situation is unique, and ultimately the path you take is yours to devise and
the results are yours to own.
At some point in life, one gets asked: What advice would you give to the younger
version of yourself? is handbook is my answer to that question.
I hope it helps you in your journey to build powerful technology, motivated and
empowered teams, and successful businesses, and, most of all, have fun and do some
good for the world.
# Book References
*Scrum: The Art of Doing Twice the Work in Half the Time* by Jeff Sutherland
## Digital References
**FrequencyReducesDifficulty** (ctohb.com/fowler):
https://martinfowler.com/bliki/FrequencyReducesDifficulty.html
# Glossary
**Agile ceremony**: Agile ceremonies are meetings where a development team comes
together at various stages during the development process for discussions on
planning future work, communicating ongoing work or reviewing and reflecting on
past work.
**Boy Scout Rule**: Leave things better than you found them. As applied to a
technical team, whenever you work in an area of code, always make even a small
improvement, maybe to tests, or documentation, or otherwise improve clarity,
readability or maintainability.
**Context switching**: Changing from one task to another. For engineers that means
setting aside the problem being worked on and starting to work on another. The act
of switching is generally time consuming and less efficient than working on one
problem at a time.
**Direct report**: Direct reports are employees who report directly to someone who
is above them in the organization chart, often a manager, supervisor, or team
leader.
**Key Performance Indicator (KPI)**: KPIs are the critical quantifiable indicators
of progress toward an intended result. Sometimes referred to as input metrics.
**Objectives and Key Results (OKRs)**: Objectives and key results is a goal-setting
framework, originating at Intel in the 1970s, used by individuals, teams, and
organizations to define measurable goals and track their outcomes.
**Root Cause Analysis (RCA)**: An approach, and generally a document, that attempts
to dig below the superficial to truly understand why an event occurred. Generally
used as an investigative tool, and then reference documentation, for why an
incident occurred in a software project.
**Standup meeting**: (aka Daily Scrum) A regular meeting as part of the scrum/agile
ceremonies. Generally intended to be short, less than 30 minutes, to facilitate
communication, updates, conflict resolution and decision making within a team.
**Straw Man Model**: A first draft proposition that can be put together rapidly
with incomplete data. Often used as a starting place proposal with a team to
accelerate the process of collecting feedback and getting to a solution.
Zach Goldberg graduated from the University of Pennsylvania magna cum laude with a
degree in Computer Science and Engineering. He's been the CTO of six startups
including WiFast, Sticks and Brains, AutoLotto, Trellis Technologies, GrowFlow
(acq. Dama Financial, 2022), and Towards Equilibrium Inc, as well as an
Entrepreneur-in-Residence at Tencent and an Associate Product Manager at Google.
After Dama acquired GrowFlow in 2022, Zach sat down and poured all of his
experience into this book. Learn more about Zach's work at zachgoldberg.com.
Ready to write and publish your thought leadership book with us? Learn more at
www.WorldChangers.Media.