0% found this document useful (0 votes)
2K views21 pages

Look Inside - The Software Engineers Guidebook

This document provides an excerpt from a book titled "The Software Engineer's Guidebook" which aims to provide guidance to software engineers at various stages of their career. The excerpt summarizes two sections from Chapter 13 which discusses software engineering topics that well-rounded senior engineers should be competent in, including programming languages, platforms and domains, debugging, technical debt, documentation, and scaling best practices across a team. It emphasizes the importance for engineers to continue broadening their knowledge of technologies beyond just a few proficiencies and to learn an imperative, declarative and functional programming language in depth.

Uploaded by

Amera B Amer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2K views21 pages

Look Inside - The Software Engineers Guidebook

This document provides an excerpt from a book titled "The Software Engineer's Guidebook" which aims to provide guidance to software engineers at various stages of their career. The excerpt summarizes two sections from Chapter 13 which discusses software engineering topics that well-rounded senior engineers should be competent in, including programming languages, platforms and domains, debugging, technical debt, documentation, and scaling best practices across a team. It emphasizes the importance for engineers to continue broadening their knowledge of technologies beyond just a few proficiencies and to learn an imperative, declarative and functional programming language in depth.

Uploaded by

Amera B Amer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

‭Look Inside: The Software Engineer’s Guidebook‬

‭Table of contents.‬‭The book is 413 pages long and‬‭consists of 27 chapters:‬

‭Part I: Developer Career Fundamentals‬ ‭Part IV: The Pragmatic Tech Lead‬

‭1. Career paths‬ ‭16. Project management‬

‭2. Owning your career‬ ‭17. Shipping in production‬

‭3. Performance reviews‬ ‭18. Stakeholder management‬

‭4. Promotions‬ ‭19. Team structure‬

‭5. Thriving in different environments‬ ‭20. Team dynamics‬

‭6. Switching jobs‬

‭ art V: Role-Model Staff and Principal‬


P
‭Part II: The Competent Software Developer‬
‭Engineers‬
‭7. Getting things done‬ ‭21. Understanding the business‬
‭8. Coding‬ ‭22. Collaboration‬
‭9. Software development‬ ‭23. Software engineering‬
‭10. Tools of the productive engineer‬ ‭24. Reliable software engineering‬

‭25. Software architecture‬


‭Part III: The Well-Rounded Senior Engineer‬

‭11. Getting things done‬ ‭Part VI: Conclusion‬


‭12. Collaboration and teamwork‬ ‭26. Lifelong learning‬
‭13. Software engineering‬ ‭27. Further reading‬
‭14. Testing‬

‭15. Software architecture‬

‭See‬‭a more detailed table of contents.‬

‭Get the book here.‬


‭Preface‬
I‭’ve been a software engineer for about 10 years, and a manager for five more. During my first‬
‭few years as a developer, I received little to no professional guidance. But I didn’t mind, as I‬
‭assumed hard work would eventually lead to progress.‬

‭ owever, this changed a few years into my career when I was passed over for a promotion to‬
H
‭a senior engineer role which I thought I was ready for. Not only that but when I asked my‬
‭manager how I could get to that next level, they didn’t have any specific feedback. It was then‬
‭that I decided that if I ever did become a manager, I’d always offer team members useful‬
‭advice on how to grow.‬

I‭t was when I was working at the riding-hailing app Uber that I became an engineering‬
‭manager. By then I was a seasoned engineer, but I still remembered my earlier promise to‬
‭myself. So, I did my best to support people on my team to improve professionally, get the‬
‭promotions they deserved, and give clear, actionable feedback when I thought colleagues‬
‭weren’t ready for the next level, just yet.‬

‭ s my team grew and I took on skip-level reports, I had less and less time to mentor‬
A
‭teammates in-depth. I also started to see patterns in the feedback I gave, so began to publish‬
‭blog posts of the advice I found myself giving repeatedly; about writing well, and doing good‬
‭code reviews. These posts were warmly received, and a lot more people than I expected read‬
‭and shared them with colleagues. This is when I began writing this book.‬

‭ y year two of the writing process, I had a draft that‬‭could‬‭be ready to publish. However, at‬
B
‭that time I launched The Pragmatic Engineer Newsletter. The focus of this newsletter is‬
‭keeping the pulse of‬‭today’s‬‭tech market, plus regular‬‭deepdives into how well-known,‬
‭international companies operate, software engineering trends, and occasional interviews with‬
‭interesting tech people. Writing the newsletter made me realize just how many “gaps” were in‬
‭the book draft. The past two years have been spent rewriting and honing its contents, one‬
‭chapter at a time.‬

‭ fter four years of writing, I can say with conviction that “The Software Engineer’s Guidebook”‬
A
‭and The Pragmatic Engineer Newsletter are complementary resources. This is despite the fact‬
‭there is very little overlap in their contents.‬

‭ riting this book helped me kick off the newsletter because it was obvious there are plenty of‬
W
‭timely‬‭software engineering topics to write about,‬‭which would make little sense to cover in a‬
‭book with a longer lifespan than a weekly newsletter. The newsletter has helped me improve‬
‭the book; I’ve learned lots about interesting trends and new tools that feel like they are here‬
‭to stay for a decade or longer, such as AI coding tools, cloud development environments, and‬
‭developer portals. These technologies are referenced in this book in much less detail than‬
‭you will find in the newsletter.‬

‭I hope you discover useful ideas in this book, which serve you well for years to come.‬
‭Introduction‬
‭ his is the book I wish I could have read early in my career as a software developer;‬
T
‭especially when I joined a larger tech company for a healthy pay rise, and found a very‬
‭different engineering culture with surprisingly little guidance for navigating my new‬
‭environment.‬

‭ his book follows the structure of a “typical” career path for a software engineer‬‭, from‬
T
‭starting out as a fresh-faced software developer, through being a role model senior/lead, all‬
‭the way to the staff/principle/distinguished level. It summarizes what I’ve learned as a‬
‭developer and how I’ve approached coaching engineers at different stages of their careers.‬

‭ e cover “soft” skills which become increasingly important as your seniority increases, and‬
W
‭the “hard” parts of the job, like software engineering concepts and approaches which help‬
‭you grow professionally.‬

‭ he names of levels and their expectations can – and do! – vary across companies.‬‭The‬
T
‭higher “tier” a business is, the more tends to be expected of engineers, compared to lower‬
‭tier places. For example, the “senior engineer” level has notoriously high expectations at‬
‭Google (L5 level) and Meta (E5 level,) compared to lower-tier companies. If you work at a‬
‭higher-tier business, it may be useful to read the chapters about higher levels, and not only‬
‭the level you’re currently interested in.‬

‭ aming and levels vary, but the principles of what makes a great engineer who is impactful at‬
N
‭the individual, team, and organizational levels, are remarkably constant. No matter where you‬
‭are in your career, I hope this book provides a fresh perspective and new ideas on how to‬
‭grow as an engineer.‬

‭How to read this book‬


‭It is composed of six standalone parts, each made up of several chapters:‬

‭‬
● ‭ art 1: Developer Career Fundamentals‬
P
‭●‬ ‭Part 2: The Competent Software Developer‬
‭●‬ ‭Part 3: The Well-Rounded Senior Engineer‬
‭●‬ ‭Part 4: The Pragmatic Tech Lead‬
‭●‬ ‭Part 5: Role Model Staff and Principal Engineers‬
‭●‬ ‭Part 6: Conclusion‬

‭ arts 1 and 6 apply to all engineering levels, from entry-level software developer, to‬
P
‭principal-and-above engineer. Parts 2, 3, 4, and 5 cover increasingly senior engineering levels‬
‭and group together topics in chapters, such as “Software Engineering,” “Collaboration,”‬
‭“Getting Things Done,” etc.‬

‭ his is a reference book you can return to as you grow in your career.‬‭I suggest focusing on‬
T
‭topics you struggle with, or the career level you are aiming for. Keep in mind that expectations‬
‭can vary greatly between companies.‬
I‭n this book, I’ve aligned the topics and leveling definitions to expectations at Big Tech and‬
‭scaleups. However, there are topics that are also useful at lower career levels which we dive‬
‭deeper into, later in the book. For example, in Part 5: “Reliable Software Systems,” we cover‬
‭logging, monitoring, and oncall in-depth, but it’s useful – and often necessary! – to know‬
‭about practices below the staff engineer level. I suggest using the table of contents by topic,‬
‭as well as by level when deciding which chapters to prioritize.‬

‭And now, let’s jump in…‬

‭Part III: Software Engineering (Chapter 13)‬


‭This exerpt covers 2 out of the 5 sections from Chapter 13 in the book.‬

‭ oftware engineering starts with coding, and ends with practices which guarantee the‬
S
‭long-term maintainability and extensibility of the systems you build. In this chapter, we cover‬
‭areas which well-rounded, senior engineers are competent in:‬

1‭ .‬ ‭ anguages, platforms and domains‬


L
‭2.‬ ‭Debugging‬
‭3.‬ ‭Tech debt‬
‭4.‬ ‭Documentation‬
‭5.‬ ‭Scaling best practices across a team‬

‭1. LANGUAGES, PLATFORMS AND DOMAINS‬


I‭t’s expected that a well-rounded senior engineer has a solid grasp of a few programming‬
‭languages and a few platforms – platforms like frontend, backend, iOS, Android, native‬
‭desktop, embedded, and so on –, and mastery of at least one. We cover more on how to‬
‭master a language in Part 2, “Software Development.”‬

‭ owever, an effective engineer doesn’t stop at being proficient in a few technologies; they‬
H
‭continue to broaden their knowledge of frameworks, languages and platforms.‬

‭ hen you know a programming language, learning another one is much easier.‬‭This is‬
W
‭because most languages are pretty similar‬‭– at least‬‭on the surface. For example, if you know‬
‭JavaScript, then learning TypeScript begins easily enough. Likewise, knowing Swift means‬
‭you can understand a lot of Java, Kotlin or C#, just by reading them.‬

‭ f course, each language has its own syntax, idiosyncrasies, strengths and weaknesses. You‬
O
‭discover all these details by using the language, and comparing it to others you already know,‬
‭well enough.‬

‭Learn an imperative, a declarative and a functional language in depth‬


‭There are three distinct types of programming language:‬
‭1.‬ I‭mperative‬‭: the most common type of programming language, wherein the computer‬
‭is given step-by-step instructions on what to do, as a set of commands. For example:‬
‭“If X, then do this. Or else, do that.” C, C++, Go, Java, JavaScript, Swift, PHP, Python,‬
‭Ruby, Rust, TypeScript, and most object-oriented languages, are all imperatives.‬
‭2.‬ ‭Declarative‬‭programming specifies the expected outcome of the program, but doesn’t‬
‭give instructions on how to achieve this. SQL, HTML and Prolog languages are‬
‭examples.‬
‭3.‬ ‭Functional‬‭languages are a subset of declarative languages‬‭which are distinct enough‬
‭to merit their own category. These treat functions as first-class, meaning functions can‬
‭be passed as arguments to other functions, or returned as values. Examples include‬
‭Haskell, Lisp, Erlang, Elixir and F#. Functional languages tend to provide immutable‬
‭states and pure functions with no side effects.‬

‭ our first – or even second – programming language is most likely an imperative one.‬
Y
‭Learning additional imperative languages is useful, but picking a different type of language‬
‭will help you grow more as a professional.‬

I‭mperative, declarative and functional languages each require different ways of thinking. It‬
‭can be challenging to switch from an imperative language to a functional or declarative one,‬
‭but you expand your understanding and “toolkit” by doing so.‬

‭ or example, functional programming is widely applied within imperative languages, because‬


F
‭following a functional model guarantees an immutable state. A good case is the‬‭Reactive‬
‭programming pattern‬‭, which takes functional programming‬‭ideas and offers a more functional‬
‭pattern to languages like Java (RxJava), Swift (RxSwift), C# (Rx.NET), Scala (RxScala) and‬
‭others‬‭.‬

‭ fter you master a language from each category, you will have little trouble picking up more‬
A
‭languages. This is because there’s more‬‭fundamental‬‭differences between an imperative and‬
‭a functional language like Go and Elixir, than between two imperative or two functional‬
‭languages, such as Go and Ruby, or Elixir and Haskell.‬

‭Get familiar with software development platforms‬


‭It’s common‬‭for a software engineer to specialize‬‭in a platform, like:‬

‭‬
● ‭ ackend‬
B
‭●‬ ‭Frontend‬
‭●‬ ‭Mobile‬
‭●‬ ‭Embedded platforms‬

‭ hen your team is building a new feature or solving a problem, there’s a good chance work‬
W
‭will take place across platforms. For example, shipping a new payment flow will surely mean‬
‭changes on the backend, the frontend, and perhaps even the mobile side. Debugging in the‬
‭mobile app will mean investigating whether the issue derives from the mobile business logic,‬
‭the backend, or perhaps at the intersection of backend APIs and the mobile business logic‬
‭parsing the API response.‬
I‭f you have no idea what happens on neighboring stacks, you’ll have trouble debugging more‬
‭complex, full stack issues, and leading projects to build and ship full stack features.‬

‭Become more full stack‬


“‭ Full stack engineering” is increasingly a baseline expectation of senior engineers across the‬
‭tech industry. This is because product folks and business stakeholders don’t‬‭really‬‭care about‬
‭the distinction between embedded, backend, and frontend/web. From their point of view, the‬
‭distinction is an engineering decision.‬

‭ well-rounded senior engineer can take any problem and figure out how to break it down‬
A
‭between different platforms. To do this, expertise in your domain is needed, along with‬
‭enough competence in other domains.‬

‭So, how do you build this understanding? There’s plenty of approaches:‬

‭●‬ G ‭ et access to other platforms’ codebases.‬‭For example,‬‭if the team you work on‬
‭owns mobile, web, and backend codebases, then get access to those which aren’t‬
‭your “primary” platforms. If you’re a backend engineer, check out the web and mobile‬
‭codebases and set them up for compiling, running tests and deploying them locally on‬
‭your machine.‬
‭●‬ ‭Read code reviews by team members on other platforms.‬‭Follow along with code‬
‭reviews, by reviewing those on other platforms, or by asking to be added as a‬
‭non-blocking reviewer. Reading code is much easier than writing it, and most code‬
‭changes are related to business logic, so you should have little trouble understanding‬
‭the intentions of changes. You might even be able to spot business logic issues, or‬
‭missing business logic test cases!‬
‭●‬ ‭Volunteer for small tasks on the other platform.‬‭The‬‭best way to get more familiar‬
‭with another platform is to work with it. Pick up a non-urgent, unimportant task you can‬
‭complete at your own pace. Ask advice from other engineers on the team.‬
‭●‬ ‭Pair with an engineer working on another stack.‬‭Pair‬‭programming is an efficient way‬
‭to pick up a new stack. Ask to pair with someone who is more experienced on the‬
‭stack you’d like to pick up; you’ll speed up the learning process. You could start by‬
‭shadowing this person – and as you become more hands on, ask to lead the session‬
‭and for the other person to give feedback on your approach.‬
‭●‬ ‭Do an “exchange month” of working on another platform.‬‭An even better way to‬
‭learn more intensively is to switch platforms for a period of time. This could be a few‬
‭weeks, or months. The downside is that your velocity will drop in the short term, as‬
‭you’ll be learning the basics of another platform. However, in the mid to long term,‬
‭your velocity will increase as you’ll have the expertise and tools to unblock yourself.‬
‭AI helpers can make the transition quicker‬
‭ I helpers can aid the transition between languages. With tools like ChatGPT, Bard, GitHub‬
A
‭Copilot, and other AI assistants, it’s much easier to pick up new programming languages.‬
‭These assistants can do things, like:‬

‭‬ T
● ‭ ranslate a piece of code from one language to another‬
‭●‬ ‭Summarize how functions and variables are declared in a language‬
‭●‬ ‭Summarize differences between two languages‬

‭ eep in mind that many AI assistants suffer from hallucination: they sometimes make up‬
K
‭things that aren’t true. Therefore, it’s necessary to verify their output. But for the purpose of‬
‭getting familiar with a new language, AI assistants are helpful and can speed up the learning‬
‭process.‬

‭2. DEBUGGING‬
‭ he difference between a senior and a non-senior engineer is pretty clear in debugging and‬
T
‭tracking down difficult bugs. More experienced engineers tend to debug faster, and pinpoint‬
‭root causes of more challenging problems – seemingly with ease. They also have a better‬
‭sense of where the issue might come from, and where to get started in debugging and‬
‭resolving it. How do they do this?‬

‭ art of it is practice and expertise. The longer you write code, the more often you come‬
P
‭across unexpected edge cases and bugs, and so you start to build a “toolkit” of the potential‬
‭root causes of problems.‬

‭ ver time, you also expand your debugging toolkit. In Part 2: “Software development,” we‬
O
‭touch on how to get better at debugging, covering:‬

‭‬ G
● ‭ et to know your debugging tools‬
‭●‬ ‭Know how to debug without tools‬
‭●‬ ‭Familiarize yourself with advanced debugging tools‬

‭ he ability to debug efficiently tends to set experienced and less experienced engineers‬
T
‭apart. Below are more approaches for improving at debugging.‬

‭Know which dashboards and logging systems to look at‬


‭ specially at larger tech companies, your ability to debug production issues is heavily‬
E
‭dependent on knowing‬‭where‬‭to find production logs,‬‭production metrics, and how to query‬
‭these metrics. Even so, it usually takes months for senior engineers to appreciate the‬
‭importance of locating these systems.‬

‭ inding the right dashboards and logging portals can be especially challenging at companies‬
F
‭where teams own many services, and each uses different ways of logging things, recording‬
‭information in various systems, or uses different logging formats.‬
‭ pon joining a company, make it a priority to learn where the production logs are stored, and‬
U
‭where to find systems’ health dashboards. These might be living in systems like Datadog,‬
‭Sentry, Splunk, New Relic, or Sumo Logic. Or within in-house systems built on top of the likes‬
‭of Prometheus, Clickhouse, Grafana, or other custom solutions. And they might be in a mix of‬
‭places. Figure out where they are, get access, and learn how to query them. Do this for‬
‭systems your team owns, and also related systems which you interact with.‬

‭Make debugging easier for others‬


‭ s a senior engineer, you should know which dashboards and logging systems to look at. But‬
A
‭if they are not in place, then you’re in a position to put them in place, and make them easy to‬
‭use.‬

‭We cover more on this topic in Part 5: “Reliable Software Engineering.”‬

‭Understand the codebase‬


‭ nderstand smaller codebases inside out.‬‭When working‬‭with a decent sized codebase –‬
U
‭typically no larger than 100,000 lines and written by no more than 20 people – there’s no‬
‭excuse for not understanding‬‭exactly‬‭where everything‬‭is located. Look through the structure‬
‭of the codebase, read a lot of code, and map out how the different parts of the code are‬
‭connected.‬

‭ raw up architecture diagrams based on reading the code, and ask people on your team to‬
D
‭confirm if your understanding is correct. Get to the point where you know which part of the‬
‭code owns what functionality.‬

‭ ith large codebases, it’s good to understand their structure and how to find relevant‬
W
‭parts.‬‭At larger companies, codebases are common with‬‭well over 1M lines built by hundreds‬
‭of engineers. It’s unrealistic to‬‭deeply‬‭understand‬‭a codebase of this size, but it is reasonable‬
‭to aim for a‬‭broad‬‭understanding, so you can go deep‬‭into the parts of it you need to work on.‬

‭ t companies which use monorepos, get a sense of their structure and what different parts of‬
A
‭the monorepo are responsible for. How are various parts of the system built? How are tests‬
‭run?‬

‭ t companies using standalone repositories, seek access to these. Aim to understand how‬
A
‭systems work at a high level relating to your team. It’s a good exercise to check some of these‬
‭out, build them, run tests, and run the service or feature locally.‬

‭ ind out how to search the whole codebase, and learn useful shortcuts.‬‭Most companies‬
F
‭have some kind of “global code search.” This might be a custom, in-house solution, or a‬
‭vendor like GitHub’s code search, or Sourcegraph. Find out how to use the global code‬
‭search tool and which features it supports. For example, how can you search a specific folder‬
‭of the code? How can you search for test cases? What about searching only the codebase‬
‭which your team owns?‬
‭ ven at large companies where engineers can access most of the codebase, there are some‬
E
‭parts of the codebase which may be off limits. This is often for compliance, regulatory or‬
‭confidentiality reasons. In most cases, it should make no real difference to your day-to-day‬
‭work. But if it slows you down, you could ask for access.‬

‭Know enough about the infrastructure‬


‭ ome production issues are caused by infrastructure problems. Figure out how services are‬
S
‭deployed into production, how secrets are stored, and how certificates are set up. Look into‬
‭how the infrastructure is managed, and where infrastructure configurations are stored.‬

I‭f you work at a company with a dedicated infrastructure team, it can be tempting to skip the‬
‭learning process and turn to the infra team, when you suspect an infrastructure issue.‬
‭However, this approach will ultimately slow you down. Besides, learning how infrastructure‬
‭works under the hood is not only interesting in itself; this depth of understanding is table‬
‭stakes for well-rounded senior engineers.‬

‭Learn through outages‬


‭ ebug outages as they happen, and reread old outage investigations.‬‭A great way to improve‬
D
‭your debugging skills is to debug when it‬‭really‬‭matters,‬‭as outages happen. If your team has‬
‭an outage, offer to help investigate and find what caused it, so the cause can be mitigated.‬

‭ ebugging outages requires learning to access and analyze production logs, locating the‬
D
‭code responsible for certain business logic, making changes to the code, validating changes,‬
‭and rolling them out. All of this happens in urgent situations when timely action matters.‬

‭ here are ways to improve debugging skills for outages other than waiting for a bug to strike‬
T
‭your system. Check out postmortems of former outages, if your company publishes them. As‬
‭you read, try to “debug” by locating the logs which pinpoint issues, and finding the code‬
‭behind the outage. Researching historical outages is a great way to learn about new‬
‭dashboards and systems you don’t know well, and to discover new outage mitigation steps.‬

‭This is the end of the chapter excerpt. In the book, the chapter continues with the sections:‬

‭‬ 3
● ‭ . Tech debt‬
‭●‬ ‭4. Documentation‬
‭●‬ ‭5. Scaling best practices across a team‬
‭Part IV: Stakeholder management (Chapter 18)‬
‭This exerpt covers 2 out of the 6 sections from Chapter 18 in the book.‬

‭ takeholders are people and groups with an interest in a project’s outcome. Internally, they‬
S
‭may be product folks, the legal team, engineering teams, or any other business unit.‬
‭Stakeholders can also be external to the company, in the form of users, customers, vendors,‬
‭regulatory bodies, and others.‬

‭ he best time to figure out the key stakeholders in your project is as soon as possible. The‬
T
‭worst time is when you are ready to ship, as an important-enough person could then appear‬
‭seemingly from nowhere and take a proper look at your project for the first time, and declare‬
‭major changes are needed. In this case, it would have been better to consult this key‬
‭stakeholder earlier.‬

‭In this chapter, we cover ways of identifying stakeholders and working with them.‬

1‭ .‬ ‭ he real goal of stakeholder management‬


T
‭2.‬ ‭Types of stakeholders‬
‭3.‬ ‭Figuring out who your stakeholders are‬
‭4.‬ ‭Keeping them in the loop‬
‭5.‬ ‭Problematic stakeholders‬
‭6.‬ ‭Learning from stakeholders‬

‭1. THE REAL GOAL OF STAKEHOLDER MANAGEMENT‬


‭ s a tech lead, why do you need to manage stakeholders? Why identify them and give them‬
A
‭detailed enough, frequent enough updates? Is it to maintain a good relationship with them?‬
‭This is nice to have, but isn’t the real goal.‬

‭ he point of stakeholder management is for the project to succeed by keeping everyone‬


T
‭on the same page.‬‭So many projects fail because the‬‭people involved have different ideas on‬
‭what to do and how to do it. This means that when engineering announces a project is done,‬
‭business stakeholders often say that what has been built is not what the business needs.‬

‭ takeholder management involves various approaches to ensure everyone with a meaningful‬


S
‭say in the project knows what’s happening, knows about new risks and changes to the‬
‭project, and is aware of – and does not object to – key responses to changes in a project. It is‬
‭just a tool to help a project succeed, and to ensure everyone agrees what success looks like.‬

I‭ worked on a project with several teams involved, in which the project lead sent weekly,‬
‭pages-long status updates to all team members and posted updates on chat. Yet it felt like‬
‭everyone was pulling in different directions, and it was unclear what the real‬‭focus‬‭was,‬
‭beyond finishing our assigned engineering task. In the end, the project seemed like a failure‬
‭and left a sour taste in everyone’s mouths.‬
‭ n another, similarly complex project, the goal was much clearer and updates were sparser,‬
O
‭but the project felt more united. And when we shipped, the business stakeholders surprised‬
‭the engineering team with a bottle of champagne as a thank you. The difference between this‬
‭project and the previous one? The project lead communicated much more with product folks‬
‭and business stakeholders.‬

‭ ood stakeholders management is highly collaborative.‬‭For the latter, successful project,‬


G
‭the tech lead did far less “formal” stakeholder management in terms of emails and written‬
‭updates. What they did was talk with business stakeholders in person and on video. And the‬
‭tech lead became familiar enough with the business domain that they used good judgment in‬
‭seeking input from the business when an engineering risk meant potentially changing the‬
‭scope of the work.‬

‭2. TYPES OF STAKEHOLDERS‬


‭For most engineering projects, stakeholders typically fall into these groups:‬

‭●‬ C ‭ ustomers:‬‭users of a project’s outputs. For engineering‬‭teams building B2C‬


‭(business to customer) products, these are the end users. For B2B (business to‬
‭business) projects, they’re the product’s users. And for internal-facing projects –‬
‭frequently built by‬‭platform teams‬‭– these are internal‬‭teams.‬
‭●‬ ‭Business stakeholders:‬‭internal, non-tech groups at‬‭a company with skin in a project,‬
‭such as Legal, Marketing, Customer Support, Finance, Operations, and others. My take‬
‭is that a good kickoff is needed so that engineering is aware of who all the business‬
‭stakeholders are, which we cover in the “Project Management” section. Why? If a‬
‭business stakeholder is out of the loop, it can delay a project.‬
‭●‬ ‭External stakeholders‬‭: teams at other companies with‬‭an interest in a project, such as‬
‭vendors, or other engineering teams at a partner organization.‬
‭●‬ ‭Product stakeholders‬‭: product managers, design, data‬‭science, and other groups in‬
‭tech which work closely with product managers. Most of them collaborate with‬
‭business stakeholders and the engineering team.‬
‭●‬ ‭Engineering stakeholders‬‭: internal engineering teams‬‭which are upstream or‬
‭downstream dependencies for a project, defined below.‬

‭Categorize stakeholders by dependency‬

‭ ucketing stakeholders into one of the following categories can be a useful mental model. For‬
B
‭this, visualize a flowing river, with teams building dams in different places, both downstream‬
‭and upstream, from your team.‬
‭Upstream and downstream dependencies, visualized.‬

‭●‬ U ‭ pstream dependencies‬‭are teams whose work you depend‬‭on. They must do a‬
‭specific task in order for your team to do its work, and for the project to get done.‬
‭●‬ ‭Downstream dependencies‬‭are teams which depend on‬‭your work. Downstream‬
‭teams come after yours, meaning your work must be done before they can complete‬
‭their part of a project.‬
‭●‬ ‭Strategic stakeholders‬‭are people or teams you want‬‭to keep in the loop, who can‬
‭often help unblock upstream dependencies.‬

‭ his categorization helps make it clearer which stakeholders to communicate with in certain‬
T
‭situations. For example:‬

‭●‬ W ‭ hen making a change to one of your APIs, communicate this change to downstream‬
‭dependencies which depend on this API.‬
‭●‬ ‭When you need to use an API which another team owns, they are an upstream‬
‭dependency. Reach out to them and confirm that the API will not have any major‬
‭changes, and that they’re aware you’re building a new feature on top of it.‬
‭●‬ ‭A country marketing manager could have a special interest in your project because‬
‭they want to launch a campaign when the feature rolls out. They’re a strategic‬
‭stakeholder, so add this person to update emails, and keep them in the loop in case of‬
‭any delays.‬

‭3. FIGURE OUT WHO YOUR STAKEHOLDERS ARE‬


‭ s a tech lead, knowing your key stakeholders is vital because not knowing can easily harm a‬
A
‭project in the forms of wasted work and delays. I have personal experience of this. On one‬
‭ roject, my team forgot to share the engineering plan – the RFC – with an engineering team‬
p
‭whose service we needed to modify. When we got to making this modification, the‬
‭engineering team in question blocked it because our approach made no sense for their‬
‭system. We eventually found a solution, but not knowing the team was a stakeholder delayed‬
‭the project by two weeks.‬

I‭n another case, I observed an engineering team work for a month on a project, only for the‬
‭legal department to intervene and block it, unexpectedly. Legal had not been in the loop,‬
‭even though they should have been. They reviewed the proposed changes, said the project‬
‭was too risky to ship and wouldn’t budge from this judgment, so the project was canceled.‬
‭The engineering team would have saved themselves much wasted work had they involved‬
‭the legal team earlier.‬

‭ o, how do you find out who your stakeholders are? This question is especially relevant at‬
S
‭large companies with dozens – or hundreds – of engineering teams, and large numbers of‬
‭product/design/data and business folks. There’s the hard way, as detailed above, and there’s‬
‭the easy way:‬

‭ ust ask!‬‭Consult people who definitely are stakeholders‬‭about who else could be a‬
J
‭stakeholder. For example:‬

‭This is the end of the chapter excerpt. In the book, the chapter continues with the sections:‬

‭‬
● ‭ . Figuring out who your stakeholders are‬
3
‭●‬ ‭4. Keeping them in the loop‬
‭●‬ ‭5. Problematic stakeholders‬
‭●‬ ‭6. Learning from stakeholders‬
‭Part V: Reliable Software Systems (Chapter 24)‬
‭This exerpt covers 2 out of the 7 sections from Chapter 24 in the book.‬

‭ here’s a fair chance your organization implicitly or explicitly expects staff+ engineers to lead‬
T
‭efforts to make systems more reliable.‬

I‭n this chapter, we cover common approaches for building and maintaining reliable systems,‬
‭including:‬

1‭ .‬ ‭ wning reliability‬
O
‭2.‬ ‭Logging‬
‭3.‬ ‭Monitoring‬
‭4.‬ ‭Alerting‬
‭5.‬ ‭Oncall‬
‭6.‬ ‭Incident management‬
‭7.‬ ‭Building resilient systems‬

‭1. OWNING RELIABILITY‬


‭ hat role do you play in reliability as a staff+ engineer?‬‭In Big Tech, it’s often an explicit‬
W
‭expectation that you own reliability within your sphere of influence, be that on your own team‬
‭or other teams. This means it’s your responsibility to ensure reliability is measured, plans are‬
‭put in place to improve it, and to advocate for extra engineering bandwidth to improve‬
‭reliability.‬

‭ n OKR is often a helpful way to improve the reliability of systems. For example, you can‬
A
‭capture objectives to make systems more reliable, performant, and efficient. Then you can‬
‭define measurable key performance indicators (KPIs,) such as:‬

‭ ‬ I‭mprove the p95 latency for System X by 10%‬



‭●‬ ‭Increase the throughput of the System Y by 30%, without changing the hardware‬
‭resources‬
‭●‬ ‭Decrease the cold start time of System Z by 15%‬

‭ ou almost always need to partner with engineering managers to move the needle on‬
Y
‭reliability.‬‭At the end of the day, engineering managers‬‭are responsible and accountable for‬
‭the performance of their teams and reliability of their systems. However, as a staff+ engineer,‬
‭you possess the skills to recognize when reliability is a problem, and to employ various‬
‭approaches to improve this. You can – and should! – bring data to engineering managers to‬
‭highlight why it’s important to invest in reliability, and what the return of this investment would‬
‭be.‬

‭We covered more on OKRs and KPIs in Part 5: “Understanding the Business.”‬
‭2. LOGGING‬
‭ efore we dive into logging approaches, let’s put the record straight about why it matters.‬
B
‭Logs are meant to help an engineering team debug production issues, by capturing missing‬
‭but necessary information for future reference during troubleshooting.‬

‭ hich logging strategy can help your team debug its production issues? Well, this depends‬
W
‭on your application, platform, and business environment.‬

‭There’s a logging toolset that can be helpful when deciding how and what to log:‬

‭●‬ L ‭ og levels.‬‭Most logging tools provide ways to log‬‭various logging levels, such as‬
‭“debug,” “info,” “warning,” and “error.” These are levels that can be used when filtering‬
‭logs. How they’re used depends on your environment and team practices.‬
‭●‬ ‭Log structure.‬‭Which details do logs capture, are‬‭local variables logged, do logs‬
‭capture timestamps – down to milliseconds or nanoseconds – to make it easy to tell‬
‭which one of two logging events happened first? Do these timestamps include‬
‭timezones?‬
‭●‬ ‭Automated logging.‬‭Which parts of the system log automatically,‬‭so logging isn’t‬
‭dependent on an engineer remembering to do it?‬
‭●‬ ‭Log retention.‬‭How long are logs retained on client‬‭devices, and for how long are they‬
‭on the backend? Retaining logs for longer can be useful, but takes up space and could‬
‭end up costing more in data storage.‬
‭●‬ ‭Toggling logging levels.‬‭For applications, it’s common‬‭practice to have “debug builds”‬
‭where all log levels are outputted, but only warning or error log levels are logged on a‬
‭production build. The details depend on platform-level implementation and team‬
‭practices.‬

‭Make your logging practices explicit‬


‭ onsider introducing logging practices if the teams you work with don’t have any. Logging is‬
C
‭an area which engineers often wish they’d pushed for agreement on what and how to log,‬
‭when they’re trying and failing to‬‭find information‬‭in the logs.‬

‭ utting a short logging guide together for the team is a matter of talking with a few engineers,‬
P
‭and empowering a team member to make a proposal – or doing it yourself. For logging‬
‭basics, agreeing on something is better than nothing, as long as the team knows it owns this‬
‭guide and can change it.‬

‭A logging guide that’s stood the test of time‬


‭ he guide below is from 2008, by Anton Chuvakin, who was then chief logging evangelist at‬
T
‭LogLogic. This logging guide remains relevant, and so with Anton’s consent, here it is:‬

‭The best logs:‬

‭●‬ ‭Tell you exactly what happened: when, where, how‬


‭‬
● ‭ re suitable for manual, semi-automated, and automated analysis‬
A
‭●‬ ‭Can be analyzed without the application that produced them being to hand‬
‭●‬ ‭Don't slow the system down‬
‭●‬ ‭Can be proven as reliable if used as evidence‬

‭Events To Log‬

‭‬
● ‭ uthentication/authorization decisions (including logoff)‬
A
‭●‬ ‭System access, data access‬
‭●‬ ‭System/application changes (especially privilege changes)‬
‭●‬ ‭Data changes: add/edit/delete‬
‭●‬ ‭Invalid input (possible badness/threats)‬
‭●‬ ‭Resources (RAM, Disk, CPU, Bandwidth, any other hard or soft limits)‬
‭●‬ ‭Health/availability: startups/shutdowns, faults/errors, delays, backups success/failure‬

‭What To Log – Every Event Should Have:‬

‭‬ T
● ‭ imestamp & timezone (when)‬
‭●‬ ‭System, application, or component (where); IP's and contemporaneous DNS lookups‬
‭of involved parties; names/roles of systems involved (what servers are we talking to?),‬
‭name/role of local application (what is this server?)‬
‭●‬ ‭User (who)‬
‭●‬ ‭Action (what)‬
‭●‬ ‭Status (result)‬
‭●‬ ‭Priority (severity, importance, rank, level, etc)‬
‭●‬ ‭Reason‬

‭Have a framework that makes logging the “right” way easy‬


‭ ow does your team do logging, does everyone invoke logs however they see fit? This‬
H
‭approach makes sense for very small teams with senior engineers, but on larger teams it‬
‭tends to result in ad-hoc logging approaches: devs log to the console, or use a third-party‬
‭logging vendor, or invoke an in-house logging solution.‬

‭ relatively straightforward way to improve consistency is to reach an agreement on the‬


A
‭logging approach – for example, which strategies to use – and then make it really hard to log‬
‭the “wrong” way by introducing a lightweight but opinionated logging framework.‬

‭ ut why put another framework in place, just for logging? Creating a simple interface helps‬
B
‭abstract the underlying vendor in use, which could be especially relevant at larger companies‬
‭where vendors change and it’s helpful to make migrations far easier. It can also help analyze‬
‭logging usage in future. Of course, don’t build a new framework for its own sake; do it when it‬
‭solves the problem of ad-hoc, inconsistent logging, and unclear guidelines for which‬
‭frameworks to use.‬
‭3. MONITORING‬
‭ ow can you tell if a system is healthy or not? The most reliable way is to monitor key‬
H
‭characteristics, and trigger an alert when a metric seems unhealthy.‬

‭50th, 90th, 95th percentile‬


‭ ercentiles are a key concept in monitoring and service level agreements (SLAs.)‬‭When‬
P
‭monitoring things like load times or response times, it’s not enough to look only at average‬
‭numbers. Why not? They can mask worst-case scenarios which impact many customers. To‬
‭avoid this, consider monitoring the following percentiles:‬

‭●‬ p ‭ 50: the 50th percentile or median value. 50% of data points are below this number,‬
‭and 50% are above. This value represents the “average” use case pretty well.‬
‭●‬ ‭p95: the 95th percentile. This represents the worst-performing 5% of data points. This‬
‭value is particularly important in performance-monitoring scenarios because the worst‬
‭performing 5% of data points could refer to power users.‬
‭●‬ ‭p99: the 99th percentile. This number represents measurements which 1% of‬
‭customers or requests see longer times for. It could be acceptable for this number to‬
‭be an outlier in some use cases.‬

‭Things to monitor‬
‭ o what should you monitor? There are plenty of obvious choices which provide health‬
S
‭information about a system or app, including:‬

‭‬ U
● ‭ ptime.‬‭For what percentage of time is the system‬‭or app fully operational?‬
‭●‬ ‭CPU, memory, disk space.‬‭Monitoring resource usage‬‭can provide useful indicators‬
‭for when a service or app risks becoming unhealthy.‬
‭●‬ ‭Response times‬‭. How long does it take a system or‬‭app to respond? What is the‬
‭median, and what’s the experience of the slowest 5% of requests or users (p95), and‬
‭the slowest 1% (p99)?‬
‭●‬ ‭Error rates.‬‭How frequent are errors, such as exceptions‬‭thrown, 4XX responses on‬
‭HTTP services, and other error states? What percentage of all requests are errors?‬

‭For backend services:‬

‭●‬ H ‭ TTP status code responses‬‭. If there is a spike in‬‭error codes like 5XX or 4XX, it‬
‭could indicate a problem‬
‭●‬ ‭Latency metrics‬‭. What are the p50, p95 and p99 latencies‬‭of server responses?‬

‭For web apps and mobile apps, additional metrics are worth monitoring:‬

‭●‬ P ‭ age load time.‬‭How long does the webpage take to‬‭load? How does this compare‬
‭across p50, p75 and p95?‬
‭●‬ ‭Core Web Vitals metrics.‬‭Google released “Web Vitals”‬‭in 2020, which are quality‬
‭signals to deliver a great user experience. These metrics can capture a more detailed‬
‭ icture of web performance. The core signals are Largest Contentful Paint (LCP,) First‬
p
‭Input Delay (FID,) and Cumulative Layout Shift (CLS.)‬

‭For a mobile app, additional metrics worth monitoring are:‬

‭●‬ S ‭ tart-up time.‬‭How long does it take for the app to‬‭start? The longer this takes, the‬
‭more likely customer churn is.‬
‭●‬ ‭Crash rate‬‭. What percentage of sessions end with the‬‭app crashing?‬
‭●‬ ‭App bundle size.‬‭How does this change over time? This‬‭is important for apps because‬
‭a larger size could mean fewer users install it.‬

‭ usiness metrics tell the “real” story of how healthy apps or services are.‬‭The metrics above‬
B
‭are more generic and infrastructural; they indicate fundamental problems. However, the‬
‭above metrics can look good, and a service or app can still be unhealthy.‬

‭Monitoring business metrics‬


‭ o get a full picture of systems’ health, you need to monitor business metrics which are highly‬
T
‭specific to the product. For example at Uber, core business metrics for the Rides products‬
‭were lifecycle events:‬

‭This is the end of the chapter excerpt. In the book, the chapter continues with the sections:‬

‭‬
● ‭ . Monitoring (the second part of the section)‬
3
‭●‬ ‭4. Alerting‬
‭●‬ ‭5. Oncall‬
‭●‬ ‭6. Incident management‬
‭●‬ ‭7. Building resilient systems‬
‭Index‬
‭ he index of the book helps understand the type of topics the book goes into. Below are the‬
T
‭first 3 index pages from the 7-page index.‬
‭Get the book here.‬

You might also like