0% found this document useful (0 votes)

43 views58 pages

(English) SRE Fundamentals

The TAM Webinar on SRE Fundamentals aims to educate the Google Cloud Community about Site Reliability Engineering (SRE) principles and practices that enhance system scalability, reliability, and efficiency. The session covers the definition of SRE, its historical context, key principles such as error budgets, and the importance of metrics and monitoring. Attendees will also learn about the roles of SRE teams, practices for managing change, and the culture of blameless postmortems.

Uploaded by

BùiTrầnChitrung

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views58 pages

(English) SRE Fundamentals

Uploaded by

BùiTrầnChitrung

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 58

SRE Fundamentals

TAM Webinar

SEP 2022
Proprietary + Confidential

Let’s go

1 Purpose & Target

2 Agenda

3 Learning & Certification

4 Q&A
Purpose
& Target
SRE Fundamentals
The objective of this TAM Webinar is to share Site Reliability Engineering (SRE) knowledge
with Google Cloud Community.

In this webinar you will learn the principles and practices that allow your systems to be
more scalable, reliable and efficient - these lessons can be directly applied to you
company.
Agenda
SRE Fundamentals Agenda
14:00 ~ 14:05 { Opening }

14:05 ~ 14:50 { SRE introduction }

14:50 ~ 15:00 { Q&A}

Sep

XX
14:00 (BRT)
Pamella
Canova
Technical Account Manager
Introduction to SRE

Pamella Canova
Technical Account Manager
Topics

What is SRE? Key principles Practices of How to get Ways to get

of SRE SRE started help
What is SRE?
1
Definition & History
"SRE is what happens when you ask a software engineer to design an operations
team" Benjamin Treynor, SRE VP.

Site reliability engineering (SRE) is a set of principles and practices that incorporates
aspects of software engineering and applies them to infrastructure and operations
problems. The main goals are to create scalable and highly reliable software systems.

Site reliability engineering (SRE) was born at Google in 2003, prior to the DevOps
movement, when the first team of software engineers led by Ben Treynor Sloss, was
tasked to make Google’s already large-scale sites more reliable, efficient, and scalable.
The practices they developed responded so well to Google’s needs that other big tech
companies, also adopted them and brought new practices to the table.
Software's
long-term
cost
Software engineering as a
discipline focuses on designing
and building rather than operating
and maintaining, despite estimates
that 40%1 to 90%2 of the total
costs are incurred after launch.
1
Glass, R. (2002). Facts and Fallacies of Software
Engineering, Addison-Wesley Professional; p. 115.
2
Dehaghani, S. M. H., & Hajrahimi, N. (2013). Which Factors
Affect Software Projects Maintenance Cost More? Acta
Informatica Medica, 21(1), 63–66.
http://doi.org/10.5455/AIM.2012.21.63-66

Image:Pixabay License. No attribution required.

Incentives aren't aligned.

Developers Operators
Agility Stability
Reducing product lifecycle friction

Concept Business Development Operations Market

Agile DevOps
solves this solves this
interface DevOps
DevOps 5 key areas
is a set of practices, guidelines 1. Reduce organizational silos
and culture designed to break 2. Accept failure as normal
down silos in IT development, 3. Implement gradual changes
operations, architecture, 4. Leverage tooling and automation
networking and security.
5. Measure everything
The SRE approach
to operations
Use data to guide decision-making.
Treat operations like a software
engineering problem:
● Hire people motivated and capable
to write automation.
● Use software to accomplish tasks
normally done by sysadmins.
● Design more reliable and operable
service architectures from the start.
What do SRE teams do?
Site Reliability Engineers develop SRE is a job function, a mindset, and
solutions to design, build, and run a set of engineering approaches to
large-scale systems scalably, running better production systems.
reliably, and eﬃciently.
We approach our work with a spirit of
We guide system architecture constructive pessimism: we hope for
by operating at the intersection the best, but plan for the worst.
of software development and
systems engineering.
class SRE implements DevOps
DevOps 5 key areas
is a set of practices, guidelines and 1. Reduce organizational silos
culture designed to break down silos in
2. Accept failure as normal
IT development, operations, architecture,
networking and security. 3. Implement gradual changes
4. Leverage tooling and automation
Site Reliability Engineering 5. Measure everything
is a set of practices we've found to work,
some beliefs that animate those
practices, and a job role.
Error Budgets
The key principle of SRE
2
How to measure reliability
Naive approach: Relatively easy to measure for a continuous
good time binary metric e.g. machine uptime
Availability =
total time
= which fraction of time
the service is available and working
Much harder for distributed request/response services
– Is a server that currently does not get requests up or down?
Intuitive for humans – If 1 of 3 servers are down, is the service up or down?
How to measure reliability
More sophisticated approach: Handles distributed request/response services
good interactions well
Availability =
total interactions
= which fraction of real users for whom
the service is available and working
Enables these cases:
– Is a server that currently does not get requests up or down?
– If 1 of 3 servers are down, is the service up or down?
Allowed unreliability window
Reliability
level
per year per quarter per 30 days

90% 36.5 days 9 days 3 days

95% 18.25 days 4.5 days 1.5 days

99% 3.65 days 21.6 hours 7.2 hours

99.5% 1.83 days 10.8 hours 3.6 hours

99.9% 8.76 hours 2.16 hours 43.2 minutes

99.95% 4.38 hours 1.08 hours 21.6 minutes

99.99% 52.6 minutes 12.96 minutes 4.32 minutes

99.999% 5.26 minutes 1.30 minutes 25.9 seconds

Source: https://landing.google.com/sre/sre-book/chapters/availability-table/
Allowed unreliability window
Reliability
level
per year per quarter per 30 days

90% 36.5 days 9 days 3 days

95% 18.25 days 4.5 days 1.5 days

Allowed
99% 3.65 days 21.6 hours 7.2 hours Error Rate
duration

99.5% 1.83 days 10.8 hours 3.6 hours 100% 21.6 minutes

99.9% 8.76 hours 2.16 hours 43.2 minutes 10% 3.6 hours

99.95% 4.38 hours 1.08 hours 21.6 minutes 1% 36 hours

99.99% 52.6 minutes 12.96 minutes 4.32 minutes 0.1% 15 days

99.999% 5.26 minutes 1.30 minutes 25.9 seconds <0.05% all month

Source: https://landing.google.com/sre/sre-book/chapters/availability-table/
“
100% is the wrong reliability
target for basically everything.”
Benjamin Treynor Sloss, Vice President of 24x7 Engineering, Google
Error budgets
● Product management & SRE deﬁne an
availability target.

● 100% minus availability target

is a “budget of unreliability”
(or the error budget).

● Monitoring measures actual uptime.

● Control loop for utilizing budget!

Public Domain Image

Benefits of error budgets
Common incentive for devs and SREs Dev team becomes self-policing
Find the right balance between innovation and reliability The error budget is a valuable resource for them

Dev team can manage the risk themselves Shared responsibility for system uptime
They decide how to spend their error budget Infrastructure failures eat into the devs’ error budget

Unrealistic reliability goals become unattractive

Such goals dampen the velocity of innovation
Glossary SLI SLO SLA

of terms
service level service level service level
indicator: a objective: a top-line agreement:
well-deﬁned target for fraction consequences
measure of of successful
'successful enough' interactions • SLA = (SLO + margin)
+ consequences = SLI
• used to specify • speciﬁes goals + goal + consequences
SLO/SLA (SLI + goal)

• Func(metric) <
threshold
SLO definition and measurement
Service-level objective (SLO): a target for SLIs Choosing an appropriate SLO is complex.
aggregated over time Try to keep it simple, avoid absolutes,
perfection can wait.
● Measured using an SLI (service-level indicator)
● Typically, sum(SLI met) / window >= target Why?
percentage ● Sets priorities and constraints
for SRE and dev work
● Sets user expectations about
Try to exceed SLO target, but not by much
level of service
Product lifecycle
Business Process

Concept Business Development Operations Market

SLO & SRE

solve this problem
class SRE implements DevOps
DevOps 5 key areas
is a set of practices, guidelines and 1. Reduce organizational silos: Share ownership
culture designed to break down silos in 2. Accept failure as normal: Error budgets
IT development, operations, architecture, 3. Implement gradual changes
networking and security. 4. Leverage tooling and automation
5. Measure everything: Measure reliability
Site Reliability Engineering
is a set of practices we've found to work,
some beliefs that animate those
practices, and a job role.
The practices of
SRE 3
Areas of practice

Metrics & Capacity Change Emergency

Culture
Monitoring Planning Management Response
Monitoring & Alerting
Monitoring: automate Alerting: triggers notiﬁcation Only involve humans
recording system metrics when conditions are detected when SLO is threatened

• Primary means of • Page: Immediate human • Humans should never

determining and response is required watch dashboards, read
maintaining reliability log ﬁles, and so on just
• Ticket: A human needs to take to determine whether
action, but not immediately the system is okay
Demand
forecasting and
capacity planning
Plan for organic growth
Increased product adoption and usage by
customers.

Determine inorganic growth

Sudden jumps in demand due to feature
launches, marketing campaigns, etc.

Correlate raw resources to service capacity

Make sure that you have enough spare capacity
to meet your reliability goals.
Efficiency and
performance
Capacity can be expensive —> optimize utilization
● Resource use is a function of demand (load),
capacity, and software eﬃciency
● SRE demands prediction and provisioning,
and can modify the software

SRE monitors utilization and performance

● Regressions can be detected and acted upon
● Immature team: by adjusting the resources
or by improving the software eﬃciency
● Mature team: rollback
Source: Pixabay (no attribution required)
Change management
Roughly 70%1 of outages Mitigations: Remove humans from the loop
are due to changes in a with automation to:
Implement progressive
live system
rollouts • Reduce errors

Quickly and accurately • Reduce fatigue

detect problems • Improve velocity

Roll back changes safely

when problems arise
1
Analysis of Google internal data, 2011-2018
Pursuing maximum
change velocity
100% is the wrong reliability target for basically
everything

● Determine the desired reliability for your product

● Don't try to provide better quality than desired

Spend error budget to increase development velocity

● The goal is not zero outages, but maximum

velocity within the error budget
● Use error budget for releases, experiments etc.
Provisioning
A combination of change management
and capacity planning
● Increase the size of an existing
service instance/location
● Spin up additional instances/locations

Needs to be done quickly

● Unused capacity can be expensive

Needs to be done correctly

● Added capacity needs to be tested
● Often a signiﬁcant conﬁguration change —>
risky
Emergency
response
“Things break, that’s life”
Few people naturally react well to emergencies,
so you need a process:

● First of all, don’t panic!

You aren’t alone and the sky isn’t falling.
● Mitigate, troubleshoot, and fix.
● If you feel overwhelmed, pull in more
people.
Incident &
postmortem
thresholds
● User-visible downtime or degradation
beyond a certain threshold
● Data loss of any kind
● On-call engineer significant intervention
(release rollback, rerouting of traffic, etc.)
● A resolution time above some threshold

It is important to deﬁne incident &

postmortem criteria before an incident occurs.

Used with permission of the image owner Jennifer Petoff, Sidewalk Safari Blog
Postmortem philosophy
The primary goals of writing a postmortem Postmortems are expected after
are to ensure that: any signiﬁcant undesirable event

• The incident is documented • Writing a postmortem is not a punishment

• All contributing root causes are well understood

• Effective preventive actions are put in place

to reduce the likelihood and/or impact of recurrence
Blamelessness
● Postmortems must focus on identifying the
contributing causes without indicating any
individual or team

● A blamelessly written postmortem assumes that

everyone involved in an incident had good
intentions

● "Human" errors are systems problems. You can’t

“ﬁx” people, but you can ﬁx systems and
processes to better support people in making the
right choices.

● If a culture of ﬁnger pointing prevails, people will

not bring issues to light for fear of punishment
Toil management/operational work

Why? What?
Because: Work directly tied to running a service that is:

• Exposure to real failures guides • Manual (manually running a script)

how you design systems
• Repetitive (done every day or for every new customer)
• You can’t automate everything
• Automatable (no human judgement is needed)
• If you do enough Ops work,
• Tactical (interrupt-driven and reactive)
you know what to automate
• Without enduring value (no long-term system improvements)

• O(n) with service growth (grows with user count or service size)
Team skills
Hire good software engineers (SWE)
and good systems engineers (SE).
Not necessarily all in one person.
Try to get a 50:50 mix of SWE
and SE skillsets on team
Everyone should be able to code.
SE != "ops work"

For more detail, see “Hiring Site Reliability Engineers,” by Chris Jones,
Todd Underwood, and Shylaja Nukala, ;login:, June 2015
Empowering SREs
● SREs must be empowered to enforce the
error budget and toil budget.

● SREs are valuable and scarce. Use their

time wisely.

● Avoid forcing SREs to take on too much

operational burden; load-shed to keep the
team healthy.

Source: Pixabay (no attribution required)

Recap of SRE practices

Metrics & Capacity Change Emergency Culture

Monitoring Planning Management Response

● SLOs ● Forecasting ● Release process ● Oncall ● Toil management

● Dashboards ● Demand-driven ● Consulting design ● Analysis ● Engineering alignment
● Analytics ● Performance ● Automation ● Postmortems ● Blamelessness
class SRE implements DevOps
DevOps 5 key areas
is a set of practices, guidelines and 1. Reduce organizational silos: Share ownership
culture designed to break down silos in 2. Accept failure as normal: Error budgets & blameless
IT development, operations, architecture, postmortems
networking and security. 3. Implement gradual changes: Reduce cost of failure
4. Leverage tooling and automation: Automate common
Site Reliability Engineering cases
is a set of practices we've found to work, 5. Measure everything: Measure toil and reliability
some beliefs that animate those
practices, and a job role.
How to get
started 4
Do these 1. Start with Service Level Objectives.
SRE teams work to a SLO and/or error

four things. 2.
budget. They defend the SLO.
Hire people who write software.
They'll quickly become bored by performing
tasks by hand and replace manual work.
3. Ensure parity of respect with rest of the
development/engineering organization.
4. Provide a feedback loop for self-regulation.
SRE teams choose their work.
SREs must be able to shed work or reduce
SLOs when overloaded.
You can do ●
●
Pick one service to run according to SRE model
Empower the team with strong executive

this. ●
sponsorship and support
Culture and psychological safety is critical.
● Measure Service Level Objectives & team health.
● Incremental progress frees time for more
progress.
Spread the ● Spread the techniques and knowledge once you
have a solid case study within your company

love. ● If you have well-deﬁned SLOs, Google can work

with you to reduce friction via shared monitoring
and other collaboration.
SRE solves ● Effortless scale shouldn't meet escalating
operational demands.

cloud ● Automation and engineering for operability enable

scaling systems without scaling organizations.

reliability. ● Tension between product development and

operations doesn't need to exist.
● Error budgets provide measurement and ﬂexibility
to deliver both reliability and product velocity.
Learning & Certification
Find Google SRE publications—including the SRE Books, articles, trainings,
and more—for free at sre.google/resources.

Book covers copyright O’Reilly Media. Used with permission.

Site Reliability Engineering: Measuring and Managing Reliability
https://www.coursera.org/learn/site-reliabilityhttps-engineering-slos
Google cloud Certifications

Foundational
Cloud knowledge and working
in the cloud

Professional Professional Professional Professional Professional

Cloud DevOps Collaboration Cloud Architect Data Engineer Machine Learning Associate
Engineer Engineer Engineer Recommended 6+ months
hands-on experience
with GCP

Professional
Recommended 3+ years
industry experience & 1 year
Cloud Associate Professional Professional Professional hands-on experience
Digital Cloud Cloud Cloud Network Cloud Security with GCP
Leader Engineer Developer Engineer Engineer
Questions?
Thank you

Screening, Size Reduction, Flotation, Agitation
67% (3)
Screening, Size Reduction, Flotation, Agitation
496 pages
Site Reliability Engineering v2
No ratings yet
Site Reliability Engineering v2
115 pages
Arch SRE
No ratings yet
Arch SRE
375 pages
Site Reliability Engineering Ebook PDF
No ratings yet
Site Reliability Engineering Ebook PDF
21 pages
Sysadmin: Hiring Site Reliability Engineers
No ratings yet
Sysadmin: Hiring Site Reliability Engineers
5 pages
Developing A SRE Culture-English
No ratings yet
Developing A SRE Culture-English
4 pages
Google Cloud DevOps Engineer Exam Prep Sheet
No ratings yet
Google Cloud DevOps Engineer Exam Prep Sheet
16 pages
On-Call in Action
No ratings yet
On-Call in Action
13 pages
Stockpile Calculations
100% (2)
Stockpile Calculations
6 pages
Essentials Guide To SRE
100% (1)
Essentials Guide To SRE
20 pages
SRE and Incident Management
No ratings yet
SRE and Incident Management
58 pages
Lead Dev Talk (Fork) PDF
No ratings yet
Lead Dev Talk (Fork) PDF
45 pages
SRE Linkedin
No ratings yet
SRE Linkedin
12 pages
Bộ Đề Luyện Thi Tốt Nghiệp THPT 2025 Môn Tiếng Anh Cấu Trúc Mới Phần 2
No ratings yet
Bộ Đề Luyện Thi Tốt Nghiệp THPT 2025 Môn Tiếng Anh Cấu Trúc Mới Phần 2
133 pages
What Is SRE
100% (1)
What Is SRE
40 pages
Unit 05 - SRE
No ratings yet
Unit 05 - SRE
15 pages
SREF Brazilian Portuguese Exam Study Guide
No ratings yet
SREF Brazilian Portuguese Exam Study Guide
91 pages
Ls Maths8 2ed TR Diagnostic Check Answers
100% (1)
Ls Maths8 2ed TR Diagnostic Check Answers
4 pages
SRE SRE at Google. Jamie Wilkinson, Hope Is Not A Strategy. - DOTC Melbourne 2018
100% (2)
SRE SRE at Google. Jamie Wilkinson, Hope Is Not A Strategy. - DOTC Melbourne 2018
43 pages
Site Reliability Engineering Handbook
No ratings yet
Site Reliability Engineering Handbook
31 pages
SRE Google Notes
100% (1)
SRE Google Notes
8 pages
The SRE Report 2024 - Catchpoint
No ratings yet
The SRE Report 2024 - Catchpoint
59 pages
Ebook 10 Essential Skills of A Site Reliability Engineer Sre
100% (3)
Ebook 10 Essential Skills of A Site Reliability Engineer Sre
18 pages
Ammett Williams Google Cloud DevOps
No ratings yet
Ammett Williams Google Cloud DevOps
10 pages
SRE-Lecture 2-Principles OF SRE
No ratings yet
SRE-Lecture 2-Principles OF SRE
46 pages
DasSIDirect 3.0
No ratings yet
DasSIDirect 3.0
192 pages
SRE Foundation V1 - 0 - Value Added Resources 11 - 2019
No ratings yet
SRE Foundation V1 - 0 - Value Added Resources 11 - 2019
8 pages
SRE Practitioner v1.0 Exam Study Guide - July2021
No ratings yet
SRE Practitioner v1.0 Exam Study Guide - July2021
94 pages
M2 - DevOps, SRE, and Why They Exist
No ratings yet
M2 - DevOps, SRE, and Why They Exist
34 pages
RP State of Sre Report 2022
No ratings yet
RP State of Sre Report 2022
46 pages
January 1995 PW
100% (1)
January 1995 PW
78 pages
M1 - Introduction To The Course
No ratings yet
M1 - Introduction To The Course
23 pages
Enterprise Roadmap To Sre
No ratings yet
Enterprise Roadmap To Sre
62 pages
Career Framework - SRE
No ratings yet
Career Framework - SRE
12 pages
SRE 21 ShivagamiGugan SlideDeck
No ratings yet
SRE 21 ShivagamiGugan SlideDeck
27 pages
SRE Report 2023 Catchpoint
No ratings yet
SRE Report 2023 Catchpoint
56 pages
Site Reliability Engineering Ebook
100% (2)
Site Reliability Engineering Ebook
21 pages
CUR-Site Reliability Engineering KYP-141122-042715
No ratings yet
CUR-Site Reliability Engineering KYP-141122-042715
13 pages
White Paper - EDT11 - Site Reliability Engine
No ratings yet
White Paper - EDT11 - Site Reliability Engine
7 pages
Paper 15
No ratings yet
Paper 15
21 pages
Site Reliability Engineer Nanodegree Program Syllabus
No ratings yet
Site Reliability Engineer Nanodegree Program Syllabus
13 pages
Day 1
No ratings yet
Day 1
5 pages
SRE Paper
No ratings yet
SRE Paper
26 pages
Site Reliability Engineering Course Content (SRE)
No ratings yet
Site Reliability Engineering Course Content (SRE)
5 pages
Ebook The Sre Transformation
No ratings yet
Ebook The Sre Transformation
8 pages
Wepik Integrating Site Reliability Engineering and Devops For Enhanced Operational Excellence 20240822082600iu2w
No ratings yet
Wepik Integrating Site Reliability Engineering and Devops For Enhanced Operational Excellence 20240822082600iu2w
8 pages
Site Reliability Engineering (SRE)
No ratings yet
Site Reliability Engineering (SRE)
3 pages
Cloud ITIL
No ratings yet
Cloud ITIL
92 pages
SRE SRE: Site Reliability Engineering
No ratings yet
SRE SRE: Site Reliability Engineering
3 pages
Yanmar 4lha STP
No ratings yet
Yanmar 4lha STP
2 pages
SRE Best Practices
No ratings yet
SRE Best Practices
11 pages
IDC Analyst Brief SRE Blueprint Creating and Fulfilling SLOs For Optimized Business Outcomes
No ratings yet
IDC Analyst Brief SRE Blueprint Creating and Fulfilling SLOs For Optimized Business Outcomes
4 pages
Site Reliability Engineer Nanodegree Program Syllabus
No ratings yet
Site Reliability Engineer Nanodegree Program Syllabus
16 pages
Google SRE - Site Reliability Engineering Book Google Index
No ratings yet
Google SRE - Site Reliability Engineering Book Google Index
4 pages
CFD Tutorial 1 - Elbow
100% (1)
CFD Tutorial 1 - Elbow
26 pages
SREF Blueprint
No ratings yet
SREF Blueprint
1 page
By Microsoft Website: DURATION: 6 Weeks Amount Paid: Yes: Introduction To Data Science
100% (1)
By Microsoft Website: DURATION: 6 Weeks Amount Paid: Yes: Introduction To Data Science
21 pages
Geometry Formula Sheet 2D Shapes For 11 Plus Exam GSD
No ratings yet
Geometry Formula Sheet 2D Shapes For 11 Plus Exam GSD
1 page
Machinery Report
No ratings yet
Machinery Report
13 pages
Stock Market Prediction Using MLP and Random Forest
No ratings yet
Stock Market Prediction Using MLP and Random Forest
18 pages
Engineering Reliability Into Web Sites: Google SRE
No ratings yet
Engineering Reliability Into Web Sites: Google SRE
21 pages
Unit 7 Lecture F24
No ratings yet
Unit 7 Lecture F24
95 pages
Choosing Right Automation Tool
No ratings yet
Choosing Right Automation Tool
8 pages
PTDLKD Final Report 2 PDFF
No ratings yet
PTDLKD Final Report 2 PDFF
60 pages
wph11 01 Rms 20240815
No ratings yet
wph11 01 Rms 20240815
18 pages
Egyptian Numbers
No ratings yet
Egyptian Numbers
3 pages
Chapter 8: Analysis Setup: Setting Up Loading Conditions Formatting Models For Analysis
No ratings yet
Chapter 8: Analysis Setup: Setting Up Loading Conditions Formatting Models For Analysis
17 pages
10.10.10 Brain Teasers
No ratings yet
10.10.10 Brain Teasers
7 pages
Exam2 - LabLec - Review2022 2
No ratings yet
Exam2 - LabLec - Review2022 2
40 pages
Failures Related To Heat Treating Operations PDF
No ratings yet
Failures Related To Heat Treating Operations PDF
32 pages
Lecture 11 F22
No ratings yet
Lecture 11 F22
24 pages
Lecture 12 F22
No ratings yet
Lecture 12 F22
18 pages
WeekARM Assy Slides
No ratings yet
WeekARM Assy Slides
17 pages
HIST152 Essay1
No ratings yet
HIST152 Essay1
5 pages
Assignment 8 Embedded
No ratings yet
Assignment 8 Embedded
9 pages
ENGL106 Tcle Firstdraft Proposal
No ratings yet
ENGL106 Tcle Firstdraft Proposal
4 pages
Lab Report-03 (ME-339 Control Engineering Lab)
No ratings yet
Lab Report-03 (ME-339 Control Engineering Lab)
6 pages
Paper - On-Site Investigation Techniques For The Structural Evaluation of Historic Masonry Buildings
No ratings yet
Paper - On-Site Investigation Techniques For The Structural Evaluation of Historic Masonry Buildings
8 pages
hw7 Handout
No ratings yet
hw7 Handout
2 pages
Using Python To Explore GOES-16 Data
No ratings yet
Using Python To Explore GOES-16 Data
13 pages
TG63 DS en
No ratings yet
TG63 DS en
4 pages
ICAO Frequency Management Manual
No ratings yet
ICAO Frequency Management Manual
19 pages
Enumerations in WinCC
No ratings yet
Enumerations in WinCC
16 pages
Reliability of Gait Performance Tests in Men and Women With Hemiparesis After Stroke
No ratings yet
Reliability of Gait Performance Tests in Men and Women With Hemiparesis After Stroke
8 pages
Chaos Theory: A Brief Introduction
No ratings yet
Chaos Theory: A Brief Introduction
11 pages
Practice Problem Set #6: Stocks I Theoretical and Conceptual Questions: (See Notes or Textbook For Solutions)
No ratings yet
Practice Problem Set #6: Stocks I Theoretical and Conceptual Questions: (See Notes or Textbook For Solutions)
2 pages
The Convolution Sum
No ratings yet
The Convolution Sum
8 pages
PTCR Behaviour of Highly Donor Doped Batio: S. Urek and M. Drofenik
No ratings yet
PTCR Behaviour of Highly Donor Doped Batio: S. Urek and M. Drofenik
4 pages
Mastering Test Automation: A Practical Guide to Scalable & Efficient Testing
From Everand
Mastering Test Automation: A Practical Guide to Scalable & Efficient Testing
Chizitere Sylvia Olebu
No ratings yet
Agile Testing: An Overview
From Everand
Agile Testing: An Overview
Florian Heuer
4/5 (10)
Study Guide Implementing DevOps Solutions (DevNet Professional) 300-910 DEVOPS
From Everand
Study Guide Implementing DevOps Solutions (DevNet Professional) 300-910 DEVOPS
Anand Vemula
No ratings yet
Enterprise Bug Busting: From Testing through CI/CD to Deliver Business Results
From Everand
Enterprise Bug Busting: From Testing through CI/CD to Deliver Business Results
Rosalind Radcliffe
No ratings yet
SECURING THE PIPELINE: Modern DevSecOps Journey: A Comprehensive Guideline For Integration Of Tools In Devsecops.
From Everand
SECURING THE PIPELINE: Modern DevSecOps Journey: A Comprehensive Guideline For Integration Of Tools In Devsecops.
Muhammad Amman Zaheer
No ratings yet
How to Start a Career in QA: Steps and Tips
From Everand
How to Start a Career in QA: Steps and Tips
Idrak Mirzayev
No ratings yet
Mastering the Art of Unit Testing: Unraveling the Secrets of Expert-Level Programming
From Everand
Mastering the Art of Unit Testing: Unraveling the Secrets of Expert-Level Programming
Steve Jones
No ratings yet
Agile and Quality by Design
From Everand
Agile and Quality by Design
Ronald N. Goulden, MBA, PMP
No ratings yet
Site Reliability Engineering Foundations: Definitive Reference for Developers and Engineers
From Everand
Site Reliability Engineering Foundations: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
DevOps Basics, Principles, and More
From Everand
DevOps Basics, Principles, and More
Tom Henricksen
No ratings yet
Agile Methodology
From Everand
Agile Methodology
IntroBooks Team
No ratings yet

(English) SRE Fundamentals

Uploaded by

(English) SRE Fundamentals

Uploaded by

SRE Fundamentals

1 Purpose & Target

3 Learning & Certification

14:05 ~ 14:50 { SRE introduction }

14:50 ~ 15:00 { Q&A}

What is SRE? Key principles Practices of How to get Ways to get

Image:Pixabay License. No attribution required.

Concept Business Development Operations Market

90% 36.5 days 9 days 3 days

95% 18.25 days 4.5 days 1.5 days

99% 3.65 days 21.6 hours 7.2 hours

99.5% 1.83 days 10.8 hours 3.6 hours

99.9% 8.76 hours 2.16 hours 43.2 minutes

99.95% 4.38 hours 1.08 hours 21.6 minutes

99.99% 52.6 minutes 12.96 minutes 4.32 minutes

99.999% 5.26 minutes 1.30 minutes 25.9 seconds

90% 36.5 days 9 days 3 days

95% 18.25 days 4.5 days 1.5 days

99.95% 4.38 hours 1.08 hours 21.6 minutes 1% 36 hours

99.99% 52.6 minutes 12.96 minutes 4.32 minutes 0.1% 15 days

● 100% minus availability target

● Monitoring measures actual uptime.

● Control loop for utilizing budget!

Public Domain Image

Unrealistic reliability goals become unattractive

Concept Business Development Operations Market

SLO & SRE

Metrics & Capacity Change Emergency

• Primary means of • Page: Immediate human • Humans should never

Determine inorganic growth

Correlate raw resources to service capacity

SRE monitors utilization and performance

Quickly and accurately • Reduce fatigue

Roll back changes safely

● Determine the desired reliability for your product

Spend error budget to increase development velocity

● The goal is not zero outages, but maximum

Needs to be done quickly

Needs to be done correctly

● First of all, don’t panic!

It is important to deﬁne incident &

• The incident is documented • Writing a postmortem is not a punishment

• All contributing root causes are well understood

• Effective preventive actions are put in place

● A blamelessly written postmortem assumes that

● "Human" errors are systems problems. You can’t

● If a culture of ﬁnger pointing prevails, people will

• Exposure to real failures guides • Manual (manually running a script)

● SREs are valuable and scarce. Use their

● Avoid forcing SREs to take on too much

Source: Pixabay (no attribution required)

Metrics & Capacity Change Emergency Culture

● SLOs ● Forecasting ● Release process ● Oncall ● Toil management

love. ● If you have well-deﬁned SLOs, Google can work

cloud ● Automation and engineering for operability enable

reliability. ● Tension between product development and

Book covers copyright O’Reilly Media. Used with permission.

Professional Professional Professional Professional Professional

You might also like