Fault Injection by Evans Jones Netflix
Fault Injection by Evans Jones Netflix
Fault Injection by Evans Jones Netflix
Failure
Fault
Injec,on
and
Service
Resilience
at
Ne6lix
Josh
Evans
Director
of
Opera,ons
Engineering,
Ne6lix
Josh
Evans
24
years
in
technology
Tech
support,
Tools,
Test
Automa,on,
IT
&
QA
Management
Time
at
Ne6lix
~15
years
Ecommerce,
streaming,
tools,
services,
opera,ons
Current
Role:
Director
of
Opera,ons
Engineering
Ne5lix Ecosystem
Sta,c
Content
Akamai
AWS/Ne6lix
Control
Plane
ISP
Ne6lix CDN
Service
Partners
31.5 seconds
99.999%
5.26 minutes
52.56 minutes
8.76 hours
3.26 days
36.5 days
99.99%
99.9%
99%
90%
Availability (nines)
99.9999%
0
1
10
100
Rate of Change
1000
31.5 seconds
99.999%
5.26 minutes
52.56 minutes
8.76 hours
3.26 days
36.5 days
99.99%
99.9%
99%
90%
Availability (nines)
99.9999%
0
1
10
100
Rate of Change
1000
99.999%
99.99%
99.9%
99%
90%
Availability (nines)
99.9999%
Engineering
Opera,ons
Best
Prac,ces
Con,nuous
Improvement
4
3
2
1
0
1
10
100
Rate of Change
1000
Is that enough?
No
How
do
we
know
if
weve
succeeded?
Does
the
system
work
as
designed?
Is
it
as
resilient
as
we
believe?
How
do
we
prevent
driging
into
failure?
Auto-Scaling
Virtual instance clusters that scale and shrink with traffic
Reactive and predictive mechanisms
Auto-replacement of bad instances
Circuit Breakers
An
Instance
Fails
Monkey
loose
in
your
DC
Run
during
business
hours
Instances
fail
all
the
,me
What we learned
State
is
problema,c
Auto-replacement
works
Surviving
a
single
instance
failure
is
not
enough
What we encountered
What
we
learned
Large
scale
events
are
hard
to
simulate
Hundreds
of
clusters,
thousands
of
instances
Rapidly
shiging
trac
is
error
prone
LBs
must
expire
connec,ons
quickly
Lingering
connec,ons
to
caches
must
be
addressed
Not
all
clusters
pre-scaled
for
addi,onal
load
What
we
learned
Hidden
assump,ons
&
congura,ons
Some
apps
not
congured
for
cross-zone
calls
Mismatched
,meouts
fallbacks
prevented
fail-over
REST
client
preserva,on
mode
prevented
fail-over
Cassandra
works
as
expected
Regrouping
From
zone
outage
to
zone
evacua,on
Carefully
deregistered
instances
Staged
trac
shigs
Resuming
true
outage
simula,ons
soon
Regions
Fail
Customer
Device
Chaos
Kong
Geo-located
Regional
Load
Balancers
AZ1
AZ2
AZ3
AZ1
AZ2
AZ3
Data
Data
Data
Data
Data
Data
What
we
learned
It
works!
Disable
predic,ve
auto-scaling
Use
instance
counts
from
previous
day
Service
Architecture
Service
A
Device
Internet
ELB
Zuul
Edge
AWS
Service B
Service C
Latency
Monkey
Service
A
Device
Internet
ELB
Zuul
Edge
Service B
Service C
What we learned
Device
Internet
ELB
Service A
Zuul
Edge
Service B
Service C
Benets
Condence
building
for
latency
monkey
tes,ng
Con,nuous
resilience
tes,ng
in
test
and
produc,on
Tes,ng
of
minimum
viable
service,
fallbacks
Device
resilience
evalua,on
Is
that
it?
Fault-injec,on
isnt
enough
Bad
code/deployments
Congura,on
mishaps
Byzan,ne
failures
Memory
leaks
Performance
degrada,on
A Multi-Faceted Approach
Technical
Culture
You
build
it,
you
run
it
Each
failure
is
an
opportunity
to
learn
Blameless
incident
reviews
Commitment
to
con,nuous
improvement
Josh Evans
Director of Operations Engineering, Netflix
[email protected], @josh_evans_nflx