Look Inside - The Software Engineers Guidebook
Look Inside - The Software Engineers Guidebook
Part I: Developer Career Fundamentals Part IV: The Pragmatic Tech Lead
owever, this changed a few years into my career when I was passed over for a promotion to
H
a senior engineer role which I thought I was ready for. Not only that but when I asked my
manager how I could get to that next level, they didn’t have any specific feedback. It was then
that I decided that if I ever did become a manager, I’d always offer team members useful
advice on how to grow.
It was when I was working at the riding-hailing app Uber that I became an engineering
manager. By then I was a seasoned engineer, but I still remembered my earlier promise to
myself. So, I did my best to support people on my team to improve professionally, get the
promotions they deserved, and give clear, actionable feedback when I thought colleagues
weren’t ready for the next level, just yet.
s my team grew and I took on skip-level reports, I had less and less time to mentor
A
teammates in-depth. I also started to see patterns in the feedback I gave, so began to publish
blog posts of the advice I found myself giving repeatedly; about writing well, and doing good
code reviews. These posts were warmly received, and a lot more people than I expected read
and shared them with colleagues. This is when I began writing this book.
y year two of the writing process, I had a draft thatcouldbe ready to publish. However, at
B
that time I launched The Pragmatic Engineer Newsletter. The focus of this newsletter is
keeping the pulse oftoday’stech market, plus regulardeepdives into how well-known,
international companies operate, software engineering trends, and occasional interviews with
interesting tech people. Writing the newsletter made me realize just how many “gaps” were in
the book draft. The past two years have been spent rewriting and honing its contents, one
chapter at a time.
fter four years of writing, I can say with conviction that “The Software Engineer’s Guidebook”
A
and The Pragmatic Engineer Newsletter are complementary resources. This is despite the fact
there is very little overlap in their contents.
riting this book helped me kick off the newsletter because it was obvious there are plenty of
W
timelysoftware engineering topics to write about,which would make little sense to cover in a
book with a longer lifespan than a weekly newsletter. The newsletter has helped me improve
the book; I’ve learned lots about interesting trends and new tools that feel like they are here
to stay for a decade or longer, such as AI coding tools, cloud development environments, and
developer portals. These technologies are referenced in this book in much less detail than
you will find in the newsletter.
I hope you discover useful ideas in this book, which serve you well for years to come.
Introduction
his is the book I wish I could have read early in my career as a software developer;
T
especially when I joined a larger tech company for a healthy pay rise, and found a very
different engineering culture with surprisingly little guidance for navigating my new
environment.
his book follows the structure of a “typical” career path for a software engineer, from
T
starting out as a fresh-faced software developer, through being a role model senior/lead, all
the way to the staff/principle/distinguished level. It summarizes what I’ve learned as a
developer and how I’ve approached coaching engineers at different stages of their careers.
e cover “soft” skills which become increasingly important as your seniority increases, and
W
the “hard” parts of the job, like software engineering concepts and approaches which help
you grow professionally.
he names of levels and their expectations can – and do! – vary across companies.The
T
higher “tier” a business is, the more tends to be expected of engineers, compared to lower
tier places. For example, the “senior engineer” level has notoriously high expectations at
Google (L5 level) and Meta (E5 level,) compared to lower-tier companies. If you work at a
higher-tier business, it may be useful to read the chapters about higher levels, and not only
the level you’re currently interested in.
aming and levels vary, but the principles of what makes a great engineer who is impactful at
N
the individual, team, and organizational levels, are remarkably constant. No matter where you
are in your career, I hope this book provides a fresh perspective and new ideas on how to
grow as an engineer.
● art 1: Developer Career Fundamentals
P
● Part 2: The Competent Software Developer
● Part 3: The Well-Rounded Senior Engineer
● Part 4: The Pragmatic Tech Lead
● Part 5: Role Model Staff and Principal Engineers
● Part 6: Conclusion
arts 1 and 6 apply to all engineering levels, from entry-level software developer, to
P
principal-and-above engineer. Parts 2, 3, 4, and 5 cover increasingly senior engineering levels
and group together topics in chapters, such as “Software Engineering,” “Collaboration,”
“Getting Things Done,” etc.
his is a reference book you can return to as you grow in your career.I suggest focusing on
T
topics you struggle with, or the career level you are aiming for. Keep in mind that expectations
can vary greatly between companies.
In this book, I’ve aligned the topics and leveling definitions to expectations at Big Tech and
scaleups. However, there are topics that are also useful at lower career levels which we dive
deeper into, later in the book. For example, in Part 5: “Reliable Software Systems,” we cover
logging, monitoring, and oncall in-depth, but it’s useful – and often necessary! – to know
about practices below the staff engineer level. I suggest using the table of contents by topic,
as well as by level when deciding which chapters to prioritize.
oftware engineering starts with coding, and ends with practices which guarantee the
S
long-term maintainability and extensibility of the systems you build. In this chapter, we cover
areas which well-rounded, senior engineers are competent in:
owever, an effective engineer doesn’t stop at being proficient in a few technologies; they
H
continue to broaden their knowledge of frameworks, languages and platforms.
hen you know a programming language, learning another one is much easier.This is
W
because most languages are pretty similar– at leaston the surface. For example, if you know
JavaScript, then learning TypeScript begins easily enough. Likewise, knowing Swift means
you can understand a lot of Java, Kotlin or C#, just by reading them.
f course, each language has its own syntax, idiosyncrasies, strengths and weaknesses. You
O
discover all these details by using the language, and comparing it to others you already know,
well enough.
our first – or even second – programming language is most likely an imperative one.
Y
Learning additional imperative languages is useful, but picking a different type of language
will help you grow more as a professional.
Imperative, declarative and functional languages each require different ways of thinking. It
can be challenging to switch from an imperative language to a functional or declarative one,
but you expand your understanding and “toolkit” by doing so.
fter you master a language from each category, you will have little trouble picking up more
A
languages. This is because there’s morefundamentaldifferences between an imperative and
a functional language like Go and Elixir, than between two imperative or two functional
languages, such as Go and Ruby, or Elixir and Haskell.
● ackend
B
● Frontend
● Mobile
● Embedded platforms
hen your team is building a new feature or solving a problem, there’s a good chance work
W
will take place across platforms. For example, shipping a new payment flow will surely mean
changes on the backend, the frontend, and perhaps even the mobile side. Debugging in the
mobile app will mean investigating whether the issue derives from the mobile business logic,
the backend, or perhaps at the intersection of backend APIs and the mobile business logic
parsing the API response.
If you have no idea what happens on neighboring stacks, you’ll have trouble debugging more
complex, full stack issues, and leading projects to build and ship full stack features.
well-rounded senior engineer can take any problem and figure out how to break it down
A
between different platforms. To do this, expertise in your domain is needed, along with
enough competence in other domains.
● G et access to other platforms’ codebases.For example,if the team you work on
owns mobile, web, and backend codebases, then get access to those which aren’t
your “primary” platforms. If you’re a backend engineer, check out the web and mobile
codebases and set them up for compiling, running tests and deploying them locally on
your machine.
● Read code reviews by team members on other platforms.Follow along with code
reviews, by reviewing those on other platforms, or by asking to be added as a
non-blocking reviewer. Reading code is much easier than writing it, and most code
changes are related to business logic, so you should have little trouble understanding
the intentions of changes. You might even be able to spot business logic issues, or
missing business logic test cases!
● Volunteer for small tasks on the other platform.Thebest way to get more familiar
with another platform is to work with it. Pick up a non-urgent, unimportant task you can
complete at your own pace. Ask advice from other engineers on the team.
● Pair with an engineer working on another stack.Pairprogramming is an efficient way
to pick up a new stack. Ask to pair with someone who is more experienced on the
stack you’d like to pick up; you’ll speed up the learning process. You could start by
shadowing this person – and as you become more hands on, ask to lead the session
and for the other person to give feedback on your approach.
● Do an “exchange month” of working on another platform.An even better way to
learn more intensively is to switch platforms for a period of time. This could be a few
weeks, or months. The downside is that your velocity will drop in the short term, as
you’ll be learning the basics of another platform. However, in the mid to long term,
your velocity will increase as you’ll have the expertise and tools to unblock yourself.
AI helpers can make the transition quicker
I helpers can aid the transition between languages. With tools like ChatGPT, Bard, GitHub
A
Copilot, and other AI assistants, it’s much easier to pick up new programming languages.
These assistants can do things, like:
T
● ranslate a piece of code from one language to another
● Summarize how functions and variables are declared in a language
● Summarize differences between two languages
eep in mind that many AI assistants suffer from hallucination: they sometimes make up
K
things that aren’t true. Therefore, it’s necessary to verify their output. But for the purpose of
getting familiar with a new language, AI assistants are helpful and can speed up the learning
process.
2. DEBUGGING
he difference between a senior and a non-senior engineer is pretty clear in debugging and
T
tracking down difficult bugs. More experienced engineers tend to debug faster, and pinpoint
root causes of more challenging problems – seemingly with ease. They also have a better
sense of where the issue might come from, and where to get started in debugging and
resolving it. How do they do this?
art of it is practice and expertise. The longer you write code, the more often you come
P
across unexpected edge cases and bugs, and so you start to build a “toolkit” of the potential
root causes of problems.
ver time, you also expand your debugging toolkit. In Part 2: “Software development,” we
O
touch on how to get better at debugging, covering:
G
● et to know your debugging tools
● Know how to debug without tools
● Familiarize yourself with advanced debugging tools
he ability to debug efficiently tends to set experienced and less experienced engineers
T
apart. Below are more approaches for improving at debugging.
inding the right dashboards and logging portals can be especially challenging at companies
F
where teams own many services, and each uses different ways of logging things, recording
information in various systems, or uses different logging formats.
pon joining a company, make it a priority to learn where the production logs are stored, and
U
where to find systems’ health dashboards. These might be living in systems like Datadog,
Sentry, Splunk, New Relic, or Sumo Logic. Or within in-house systems built on top of the likes
of Prometheus, Clickhouse, Grafana, or other custom solutions. And they might be in a mix of
places. Figure out where they are, get access, and learn how to query them. Do this for
systems your team owns, and also related systems which you interact with.
raw up architecture diagrams based on reading the code, and ask people on your team to
D
confirm if your understanding is correct. Get to the point where you know which part of the
code owns what functionality.
ith large codebases, it’s good to understand their structure and how to find relevant
W
parts.At larger companies, codebases are common withwell over 1M lines built by hundreds
of engineers. It’s unrealistic todeeplyunderstanda codebase of this size, but it is reasonable
to aim for abroadunderstanding, so you can go deepinto the parts of it you need to work on.
t companies which use monorepos, get a sense of their structure and what different parts of
A
the monorepo are responsible for. How are various parts of the system built? How are tests
run?
t companies using standalone repositories, seek access to these. Aim to understand how
A
systems work at a high level relating to your team. It’s a good exercise to check some of these
out, build them, run tests, and run the service or feature locally.
ind out how to search the whole codebase, and learn useful shortcuts.Most companies
F
have some kind of “global code search.” This might be a custom, in-house solution, or a
vendor like GitHub’s code search, or Sourcegraph. Find out how to use the global code
search tool and which features it supports. For example, how can you search a specific folder
of the code? How can you search for test cases? What about searching only the codebase
which your team owns?
ven at large companies where engineers can access most of the codebase, there are some
E
parts of the codebase which may be off limits. This is often for compliance, regulatory or
confidentiality reasons. In most cases, it should make no real difference to your day-to-day
work. But if it slows you down, you could ask for access.
If you work at a company with a dedicated infrastructure team, it can be tempting to skip the
learning process and turn to the infra team, when you suspect an infrastructure issue.
However, this approach will ultimately slow you down. Besides, learning how infrastructure
works under the hood is not only interesting in itself; this depth of understanding is table
stakes for well-rounded senior engineers.
ebugging outages requires learning to access and analyze production logs, locating the
D
code responsible for certain business logic, making changes to the code, validating changes,
and rolling them out. All of this happens in urgent situations when timely action matters.
here are ways to improve debugging skills for outages other than waiting for a bug to strike
T
your system. Check out postmortems of former outages, if your company publishes them. As
you read, try to “debug” by locating the logs which pinpoint issues, and finding the code
behind the outage. Researching historical outages is a great way to learn about new
dashboards and systems you don’t know well, and to discover new outage mitigation steps.
This is the end of the chapter excerpt. In the book, the chapter continues with the sections:
3
● . Tech debt
● 4. Documentation
● 5. Scaling best practices across a team
Part IV: Stakeholder management (Chapter 18)
This exerpt covers 2 out of the 6 sections from Chapter 18 in the book.
takeholders are people and groups with an interest in a project’s outcome. Internally, they
S
may be product folks, the legal team, engineering teams, or any other business unit.
Stakeholders can also be external to the company, in the form of users, customers, vendors,
regulatory bodies, and others.
he best time to figure out the key stakeholders in your project is as soon as possible. The
T
worst time is when you are ready to ship, as an important-enough person could then appear
seemingly from nowhere and take a proper look at your project for the first time, and declare
major changes are needed. In this case, it would have been better to consult this key
stakeholder earlier.
In this chapter, we cover ways of identifying stakeholders and working with them.
I worked on a project with several teams involved, in which the project lead sent weekly,
pages-long status updates to all team members and posted updates on chat. Yet it felt like
everyone was pulling in different directions, and it was unclear what the realfocuswas,
beyond finishing our assigned engineering task. In the end, the project seemed like a failure
and left a sour taste in everyone’s mouths.
n another, similarly complex project, the goal was much clearer and updates were sparser,
O
but the project felt more united. And when we shipped, the business stakeholders surprised
the engineering team with a bottle of champagne as a thank you. The difference between this
project and the previous one? The project lead communicated much more with product folks
and business stakeholders.
ucketing stakeholders into one of the following categories can be a useful mental model. For
B
this, visualize a flowing river, with teams building dams in different places, both downstream
and upstream, from your team.
Upstream and downstream dependencies, visualized.
● U pstream dependenciesare teams whose work you dependon. They must do a
specific task in order for your team to do its work, and for the project to get done.
● Downstream dependenciesare teams which depend onyour work. Downstream
teams come after yours, meaning your work must be done before they can complete
their part of a project.
● Strategic stakeholdersare people or teams you wantto keep in the loop, who can
often help unblock upstream dependencies.
his categorization helps make it clearer which stakeholders to communicate with in certain
T
situations. For example:
● W hen making a change to one of your APIs, communicate this change to downstream
dependencies which depend on this API.
● When you need to use an API which another team owns, they are an upstream
dependency. Reach out to them and confirm that the API will not have any major
changes, and that they’re aware you’re building a new feature on top of it.
● A country marketing manager could have a special interest in your project because
they want to launch a campaign when the feature rolls out. They’re a strategic
stakeholder, so add this person to update emails, and keep them in the loop in case of
any delays.
In another case, I observed an engineering team work for a month on a project, only for the
legal department to intervene and block it, unexpectedly. Legal had not been in the loop,
even though they should have been. They reviewed the proposed changes, said the project
was too risky to ship and wouldn’t budge from this judgment, so the project was canceled.
The engineering team would have saved themselves much wasted work had they involved
the legal team earlier.
o, how do you find out who your stakeholders are? This question is especially relevant at
S
large companies with dozens – or hundreds – of engineering teams, and large numbers of
product/design/data and business folks. There’s the hard way, as detailed above, and there’s
the easy way:
ust ask!Consult people who definitely are stakeholdersabout who else could be a
J
stakeholder. For example:
This is the end of the chapter excerpt. In the book, the chapter continues with the sections:
● . Figuring out who your stakeholders are
3
● 4. Keeping them in the loop
● 5. Problematic stakeholders
● 6. Learning from stakeholders
Part V: Reliable Software Systems (Chapter 24)
This exerpt covers 2 out of the 7 sections from Chapter 24 in the book.
here’s a fair chance your organization implicitly or explicitly expects staff+ engineers to lead
T
efforts to make systems more reliable.
In this chapter, we cover common approaches for building and maintaining reliable systems,
including:
1 . wning reliability
O
2. Logging
3. Monitoring
4. Alerting
5. Oncall
6. Incident management
7. Building resilient systems
n OKR is often a helpful way to improve the reliability of systems. For example, you can
A
capture objectives to make systems more reliable, performant, and efficient. Then you can
define measurable key performance indicators (KPIs,) such as:
ou almost always need to partner with engineering managers to move the needle on
Y
reliability.At the end of the day, engineering managersare responsible and accountable for
the performance of their teams and reliability of their systems. However, as a staff+ engineer,
you possess the skills to recognize when reliability is a problem, and to employ various
approaches to improve this. You can – and should! – bring data to engineering managers to
highlight why it’s important to invest in reliability, and what the return of this investment would
be.
We covered more on OKRs and KPIs in Part 5: “Understanding the Business.”
2. LOGGING
efore we dive into logging approaches, let’s put the record straight about why it matters.
B
Logs are meant to help an engineering team debug production issues, by capturing missing
but necessary information for future reference during troubleshooting.
hich logging strategy can help your team debug its production issues? Well, this depends
W
on your application, platform, and business environment.
There’s a logging toolset that can be helpful when deciding how and what to log:
● L og levels.Most logging tools provide ways to logvarious logging levels, such as
“debug,” “info,” “warning,” and “error.” These are levels that can be used when filtering
logs. How they’re used depends on your environment and team practices.
● Log structure.Which details do logs capture, arelocal variables logged, do logs
capture timestamps – down to milliseconds or nanoseconds – to make it easy to tell
which one of two logging events happened first? Do these timestamps include
timezones?
● Automated logging.Which parts of the system log automatically,so logging isn’t
dependent on an engineer remembering to do it?
● Log retention.How long are logs retained on clientdevices, and for how long are they
on the backend? Retaining logs for longer can be useful, but takes up space and could
end up costing more in data storage.
● Toggling logging levels.For applications, it’s commonpractice to have “debug builds”
where all log levels are outputted, but only warning or error log levels are logged on a
production build. The details depend on platform-level implementation and team
practices.
utting a short logging guide together for the team is a matter of talking with a few engineers,
P
and empowering a team member to make a proposal – or doing it yourself. For logging
basics, agreeing on something is better than nothing, as long as the team knows it owns this
guide and can change it.
Events To Log
● uthentication/authorization decisions (including logoff)
A
● System access, data access
● System/application changes (especially privilege changes)
● Data changes: add/edit/delete
● Invalid input (possible badness/threats)
● Resources (RAM, Disk, CPU, Bandwidth, any other hard or soft limits)
● Health/availability: startups/shutdowns, faults/errors, delays, backups success/failure
T
● imestamp & timezone (when)
● System, application, or component (where); IP's and contemporaneous DNS lookups
of involved parties; names/roles of systems involved (what servers are we talking to?),
name/role of local application (what is this server?)
● User (who)
● Action (what)
● Status (result)
● Priority (severity, importance, rank, level, etc)
● Reason
ut why put another framework in place, just for logging? Creating a simple interface helps
B
abstract the underlying vendor in use, which could be especially relevant at larger companies
where vendors change and it’s helpful to make migrations far easier. It can also help analyze
logging usage in future. Of course, don’t build a new framework for its own sake; do it when it
solves the problem of ad-hoc, inconsistent logging, and unclear guidelines for which
frameworks to use.
3. MONITORING
ow can you tell if a system is healthy or not? The most reliable way is to monitor key
H
characteristics, and trigger an alert when a metric seems unhealthy.
● p 50: the 50th percentile or median value. 50% of data points are below this number,
and 50% are above. This value represents the “average” use case pretty well.
● p95: the 95th percentile. This represents the worst-performing 5% of data points. This
value is particularly important in performance-monitoring scenarios because the worst
performing 5% of data points could refer to power users.
● p99: the 99th percentile. This number represents measurements which 1% of
customers or requests see longer times for. It could be acceptable for this number to
be an outlier in some use cases.
Things to monitor
o what should you monitor? There are plenty of obvious choices which provide health
S
information about a system or app, including:
U
● ptime.For what percentage of time is the systemor app fully operational?
● CPU, memory, disk space.Monitoring resource usagecan provide useful indicators
for when a service or app risks becoming unhealthy.
● Response times. How long does it take a system orapp to respond? What is the
median, and what’s the experience of the slowest 5% of requests or users (p95), and
the slowest 1% (p99)?
● Error rates.How frequent are errors, such as exceptionsthrown, 4XX responses on
HTTP services, and other error states? What percentage of all requests are errors?
● H TTP status code responses. If there is a spike inerror codes like 5XX or 4XX, it
could indicate a problem
● Latency metrics. What are the p50, p95 and p99 latenciesof server responses?
For web apps and mobile apps, additional metrics are worth monitoring:
● P age load time.How long does the webpage take toload? How does this compare
across p50, p75 and p95?
● Core Web Vitals metrics.Google released “Web Vitals”in 2020, which are quality
signals to deliver a great user experience. These metrics can capture a more detailed
icture of web performance. The core signals are Largest Contentful Paint (LCP,) First
p
Input Delay (FID,) and Cumulative Layout Shift (CLS.)
● S tart-up time.How long does it take for the app tostart? The longer this takes, the
more likely customer churn is.
● Crash rate. What percentage of sessions end with theapp crashing?
● App bundle size.How does this change over time? Thisis important for apps because
a larger size could mean fewer users install it.
usiness metrics tell the “real” story of how healthy apps or services are.The metrics above
B
are more generic and infrastructural; they indicate fundamental problems. However, the
above metrics can look good, and a service or app can still be unhealthy.
This is the end of the chapter excerpt. In the book, the chapter continues with the sections:
● . Monitoring (the second part of the section)
3
● 4. Alerting
● 5. Oncall
● 6. Incident management
● 7. Building resilient systems
Index
he index of the book helps understand the type of topics the book goes into. Below are the
T
first 3 index pages from the 7-page index.
Get the book here.