0% found this document useful (0 votes)
43 views58 pages

(English) SRE Fundamentals

The TAM Webinar on SRE Fundamentals aims to educate the Google Cloud Community about Site Reliability Engineering (SRE) principles and practices that enhance system scalability, reliability, and efficiency. The session covers the definition of SRE, its historical context, key principles such as error budgets, and the importance of metrics and monitoring. Attendees will also learn about the roles of SRE teams, practices for managing change, and the culture of blameless postmortems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views58 pages

(English) SRE Fundamentals

The TAM Webinar on SRE Fundamentals aims to educate the Google Cloud Community about Site Reliability Engineering (SRE) principles and practices that enhance system scalability, reliability, and efficiency. The session covers the definition of SRE, its historical context, key principles such as error budgets, and the importance of metrics and monitoring. Attendees will also learn about the roles of SRE teams, practices for managing change, and the culture of blameless postmortems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

SRE Fundamentals

TAM Webinar

SEP 2022
Proprietary + Confidential

Let’s go

1 Purpose & Target

2 Agenda

3 Learning & Certification

4 Q&A
Purpose
& Target
SRE Fundamentals
The objective of this TAM Webinar is to share Site Reliability Engineering (SRE) knowledge
with Google Cloud Community.

In this webinar you will learn the principles and practices that allow your systems to be
more scalable, reliable and efficient - these lessons can be directly applied to you
company.
Agenda
SRE Fundamentals Agenda
14:00 ~ 14:05 { Opening }

14:05 ~ 14:50 { SRE introduction }

14:50 ~ 15:00 { Q&A}


Sep

XX
14:00 (BRT)
Pamella
Canova
Technical Account Manager
Introduction to SRE

Pamella Canova
Technical Account Manager
Topics

What is SRE? Key principles Practices of How to get Ways to get


of SRE SRE started help
What is SRE?
1
Definition & History
"SRE is what happens when you ask a software engineer to design an operations
team" Benjamin Treynor, SRE VP.

Site reliability engineering (SRE) is a set of principles and practices that incorporates
aspects of software engineering and applies them to infrastructure and operations
problems. The main goals are to create scalable and highly reliable software systems.

Site reliability engineering (SRE) was born at Google in 2003, prior to the DevOps
movement, when the first team of software engineers led by Ben Treynor Sloss, was
tasked to make Google’s already large-scale sites more reliable, efficient, and scalable.
The practices they developed responded so well to Google’s needs that other big tech
companies, also adopted them and brought new practices to the table.
Software's
long-term
cost
Software engineering as a
discipline focuses on designing
and building rather than operating
and maintaining, despite estimates
that 40%1 to 90%2 of the total
costs are incurred after launch.
1
Glass, R. (2002). Facts and Fallacies of Software
Engineering, Addison-Wesley Professional; p. 115.
2
Dehaghani, S. M. H., & Hajrahimi, N. (2013). Which Factors
Affect Software Projects Maintenance Cost More? Acta
Informatica Medica, 21(1), 63–66.
http://doi.org/10.5455/AIM.2012.21.63-66

Image:Pixabay License. No attribution required.


Incentives aren't aligned.

Developers Operators
Agility Stability
Reducing product lifecycle friction

Concept Business Development Operations Market

Agile DevOps
solves this solves this
interface DevOps
DevOps 5 key areas
is a set of practices, guidelines 1. Reduce organizational silos
and culture designed to break 2. Accept failure as normal
down silos in IT development, 3. Implement gradual changes
operations, architecture, 4. Leverage tooling and automation
networking and security.
5. Measure everything
The SRE approach
to operations
Use data to guide decision-making.
Treat operations like a software
engineering problem:
● Hire people motivated and capable
to write automation.
● Use software to accomplish tasks
normally done by sysadmins.
● Design more reliable and operable
service architectures from the start.
What do SRE teams do?
Site Reliability Engineers develop SRE is a job function, a mindset, and
solutions to design, build, and run a set of engineering approaches to
large-scale systems scalably, running better production systems.
reliably, and efficiently.
We approach our work with a spirit of
We guide system architecture constructive pessimism: we hope for
by operating at the intersection the best, but plan for the worst.
of software development and
systems engineering.
class SRE implements DevOps
DevOps 5 key areas
is a set of practices, guidelines and 1. Reduce organizational silos
culture designed to break down silos in
2. Accept failure as normal
IT development, operations, architecture,
networking and security. 3. Implement gradual changes
4. Leverage tooling and automation
Site Reliability Engineering 5. Measure everything
is a set of practices we've found to work,
some beliefs that animate those
practices, and a job role.
Error Budgets
The key principle of SRE
2
How to measure reliability
Naive approach: Relatively easy to measure for a continuous
good time binary metric e.g. machine uptime
Availability =
total time
= which fraction of time
the service is available and working
Much harder for distributed request/response services
– Is a server that currently does not get requests up or down?
Intuitive for humans – If 1 of 3 servers are down, is the service up or down?
How to measure reliability
More sophisticated approach: Handles distributed request/response services
good interactions well
Availability =
total interactions
= which fraction of real users for whom
the service is available and working
Enables these cases:
– Is a server that currently does not get requests up or down?
– If 1 of 3 servers are down, is the service up or down?
Allowed unreliability window
Reliability
level
per year per quarter per 30 days

90% 36.5 days 9 days 3 days

95% 18.25 days 4.5 days 1.5 days

99% 3.65 days 21.6 hours 7.2 hours

99.5% 1.83 days 10.8 hours 3.6 hours

99.9% 8.76 hours 2.16 hours 43.2 minutes

99.95% 4.38 hours 1.08 hours 21.6 minutes

99.99% 52.6 minutes 12.96 minutes 4.32 minutes

99.999% 5.26 minutes 1.30 minutes 25.9 seconds

Source: https://landing.google.com/sre/sre-book/chapters/availability-table/
Allowed unreliability window
Reliability
level
per year per quarter per 30 days

90% 36.5 days 9 days 3 days

95% 18.25 days 4.5 days 1.5 days


Allowed
99% 3.65 days 21.6 hours 7.2 hours Error Rate
duration

99.5% 1.83 days 10.8 hours 3.6 hours 100% 21.6 minutes

99.9% 8.76 hours 2.16 hours 43.2 minutes 10% 3.6 hours

99.95% 4.38 hours 1.08 hours 21.6 minutes 1% 36 hours

99.99% 52.6 minutes 12.96 minutes 4.32 minutes 0.1% 15 days

99.999% 5.26 minutes 1.30 minutes 25.9 seconds <0.05% all month

Source: https://landing.google.com/sre/sre-book/chapters/availability-table/

100% is the wrong reliability
target for basically everything.”
Benjamin Treynor Sloss, Vice President of 24x7 Engineering, Google
Error budgets
● Product management & SRE define an
availability target.

● 100% minus availability target


is a “budget of unreliability”
(or the error budget).

● Monitoring measures actual uptime.

● Control loop for utilizing budget!

Public Domain Image


Benefits of error budgets
Common incentive for devs and SREs Dev team becomes self-policing
Find the right balance between innovation and reliability The error budget is a valuable resource for them

Dev team can manage the risk themselves Shared responsibility for system uptime
They decide how to spend their error budget Infrastructure failures eat into the devs’ error budget

Unrealistic reliability goals become unattractive


Such goals dampen the velocity of innovation
Glossary SLI SLO SLA

of terms
service level service level service level
indicator: a objective: a top-line agreement:
well-defined target for fraction consequences
measure of of successful
'successful enough' interactions • SLA = (SLO + margin)
+ consequences = SLI
• used to specify • specifies goals + goal + consequences
SLO/SLA (SLI + goal)

• Func(metric) <
threshold
SLO definition and measurement
Service-level objective (SLO): a target for SLIs Choosing an appropriate SLO is complex.
aggregated over time Try to keep it simple, avoid absolutes,
perfection can wait.
● Measured using an SLI (service-level indicator)
● Typically, sum(SLI met) / window >= target Why?
percentage ● Sets priorities and constraints
for SRE and dev work
● Sets user expectations about
Try to exceed SLO target, but not by much
level of service
Product lifecycle
Business Process

Concept Business Development Operations Market

SLO & SRE


solve this problem
class SRE implements DevOps
DevOps 5 key areas
is a set of practices, guidelines and 1. Reduce organizational silos: Share ownership
culture designed to break down silos in 2. Accept failure as normal: Error budgets
IT development, operations, architecture, 3. Implement gradual changes
networking and security. 4. Leverage tooling and automation
5. Measure everything: Measure reliability
Site Reliability Engineering
is a set of practices we've found to work,
some beliefs that animate those
practices, and a job role.
The practices of
SRE 3
Areas of practice

Metrics & Capacity Change Emergency


Culture
Monitoring Planning Management Response
Monitoring & Alerting
Monitoring: automate Alerting: triggers notification Only involve humans
recording system metrics when conditions are detected when SLO is threatened

• Primary means of • Page: Immediate human • Humans should never


determining and response is required watch dashboards, read
maintaining reliability log files, and so on just
• Ticket: A human needs to take to determine whether
action, but not immediately the system is okay
Demand
forecasting and
capacity planning
Plan for organic growth
Increased product adoption and usage by
customers.

Determine inorganic growth


Sudden jumps in demand due to feature
launches, marketing campaigns, etc.

Correlate raw resources to service capacity


Make sure that you have enough spare capacity
to meet your reliability goals.
Efficiency and
performance
Capacity can be expensive —> optimize utilization
● Resource use is a function of demand (load),
capacity, and software efficiency
● SRE demands prediction and provisioning,
and can modify the software

SRE monitors utilization and performance


● Regressions can be detected and acted upon
● Immature team: by adjusting the resources
or by improving the software efficiency
● Mature team: rollback
Source: Pixabay (no attribution required)
Change management
Roughly 70%1 of outages Mitigations: Remove humans from the loop
are due to changes in a with automation to:
Implement progressive
live system
rollouts • Reduce errors

Quickly and accurately • Reduce fatigue


detect problems • Improve velocity

Roll back changes safely


when problems arise
1
Analysis of Google internal data, 2011-2018
Pursuing maximum
change velocity
100% is the wrong reliability target for basically
everything

● Determine the desired reliability for your product


● Don't try to provide better quality than desired

Spend error budget to increase development velocity

● The goal is not zero outages, but maximum


velocity within the error budget
● Use error budget for releases, experiments etc.
Provisioning
A combination of change management
and capacity planning
● Increase the size of an existing
service instance/location
● Spin up additional instances/locations

Needs to be done quickly


● Unused capacity can be expensive

Needs to be done correctly


● Added capacity needs to be tested
● Often a significant configuration change —>
risky
Emergency
response
“Things break, that’s life”
Few people naturally react well to emergencies,
so you need a process:

● First of all, don’t panic!


You aren’t alone and the sky isn’t falling.
● Mitigate, troubleshoot, and fix.
● If you feel overwhelmed, pull in more
people.
Incident &
postmortem
thresholds
● User-visible downtime or degradation
beyond a certain threshold
● Data loss of any kind
● On-call engineer significant intervention
(release rollback, rerouting of traffic, etc.)
● A resolution time above some threshold

It is important to define incident &


postmortem criteria before an incident occurs.

Used with permission of the image owner Jennifer Petoff, Sidewalk Safari Blog
Postmortem philosophy
The primary goals of writing a postmortem Postmortems are expected after
are to ensure that: any significant undesirable event

• The incident is documented • Writing a postmortem is not a punishment

• All contributing root causes are well understood

• Effective preventive actions are put in place


to reduce the likelihood and/or impact of recurrence
Blamelessness
● Postmortems must focus on identifying the
contributing causes without indicating any
individual or team

● A blamelessly written postmortem assumes that


everyone involved in an incident had good
intentions

● "Human" errors are systems problems. You can’t


“fix” people, but you can fix systems and
processes to better support people in making the
right choices.

● If a culture of finger pointing prevails, people will


not bring issues to light for fear of punishment
Toil management/operational work

Why? What?
Because: Work directly tied to running a service that is:

• Exposure to real failures guides • Manual (manually running a script)


how you design systems
• Repetitive (done every day or for every new customer)
• You can’t automate everything
• Automatable (no human judgement is needed)
• If you do enough Ops work,
• Tactical (interrupt-driven and reactive)
you know what to automate
• Without enduring value (no long-term system improvements)

• O(n) with service growth (grows with user count or service size)
Team skills
Hire good software engineers (SWE)
and good systems engineers (SE).
Not necessarily all in one person.
Try to get a 50:50 mix of SWE
and SE skillsets on team
Everyone should be able to code.
SE != "ops work"

For more detail, see “Hiring Site Reliability Engineers,” by Chris Jones,
Todd Underwood, and Shylaja Nukala, ;login:, June 2015
Empowering SREs
● SREs must be empowered to enforce the
error budget and toil budget.

● SREs are valuable and scarce. Use their


time wisely.

● Avoid forcing SREs to take on too much


operational burden; load-shed to keep the
team healthy.

Source: Pixabay (no attribution required)


Recap of SRE practices

Metrics & Capacity Change Emergency Culture


Monitoring Planning Management Response

● SLOs ● Forecasting ● Release process ● Oncall ● Toil management


● Dashboards ● Demand-driven ● Consulting design ● Analysis ● Engineering alignment
● Analytics ● Performance ● Automation ● Postmortems ● Blamelessness
class SRE implements DevOps
DevOps 5 key areas
is a set of practices, guidelines and 1. Reduce organizational silos: Share ownership
culture designed to break down silos in 2. Accept failure as normal: Error budgets & blameless
IT development, operations, architecture, postmortems
networking and security. 3. Implement gradual changes: Reduce cost of failure
4. Leverage tooling and automation: Automate common
Site Reliability Engineering cases
is a set of practices we've found to work, 5. Measure everything: Measure toil and reliability
some beliefs that animate those
practices, and a job role.
How to get
started 4
Do these 1. Start with Service Level Objectives.
SRE teams work to a SLO and/or error

four things. 2.
budget. They defend the SLO.
Hire people who write software.
They'll quickly become bored by performing
tasks by hand and replace manual work.
3. Ensure parity of respect with rest of the
development/engineering organization.
4. Provide a feedback loop for self-regulation.
SRE teams choose their work.
SREs must be able to shed work or reduce
SLOs when overloaded.
You can do ●

Pick one service to run according to SRE model
Empower the team with strong executive

this. ●
sponsorship and support
Culture and psychological safety is critical.
● Measure Service Level Objectives & team health.
● Incremental progress frees time for more
progress.
Spread the ● Spread the techniques and knowledge once you
have a solid case study within your company

love. ● If you have well-defined SLOs, Google can work


with you to reduce friction via shared monitoring
and other collaboration.
SRE solves ● Effortless scale shouldn't meet escalating
operational demands.

cloud ● Automation and engineering for operability enable


scaling systems without scaling organizations.

reliability. ● Tension between product development and


operations doesn't need to exist.
● Error budgets provide measurement and flexibility
to deliver both reliability and product velocity.
Learning & Certification
Find Google SRE publications—including the SRE Books, articles, trainings,
and more—for free at sre.google/resources.

Book covers copyright O’Reilly Media. Used with permission.


Site Reliability Engineering: Measuring and Managing Reliability
https://www.coursera.org/learn/site-reliabilityhttps-engineering-slos
Google cloud Certifications

Foundational
Cloud knowledge and working
in the cloud

Professional Professional Professional Professional Professional


Cloud DevOps Collaboration Cloud Architect Data Engineer Machine Learning Associate
Engineer Engineer Engineer Recommended 6+ months
hands-on experience
with GCP

Professional
Recommended 3+ years
industry experience & 1 year
Cloud Associate Professional Professional Professional hands-on experience
Digital Cloud Cloud Cloud Network Cloud Security with GCP
Leader Engineer Developer Engineer Engineer
Questions?
Thank you

You might also like