Fault Injection by Evans Jones Netflix

Embracing
Failure
Fault Injec,on and Service Resilience
at Ne6lix
Josh Evans
Director of Opera,ons Engineering, Ne6lix
Josh Evans
24 years in technology
Tech support, Tools, Test Automa,on, IT & QA
Management

Time at Ne6lix ~15 years
Ecommerce, streaming, tools, services, opera,ons

Current Role: Director of Opera,ons Engineering
Ne5lix Ecosystem
Sta,c
Content
Akamai
AWS/Ne6lix
Control
Plane
ISP
48 million members, 41 countries

> 1 billion hours per month
> 1000 device types
3 AWS Regions, hundreds of services

Hundreds of thousands of requests/second
Partner provided services (Xbox Live, PSN)
CDN serving petabytes of data at terabits/second
Ne6lix CDN
Service
Partners
Our Focus is on Quality and Velocity

Availability vs. Rate of Change
6
31.5 seconds
99.999%
5.26 minutes
52.56 minutes
8.76 hours
3.26 days
36.5 days
99.99%
99.9%
99%
90%
Availability (nines)
99.9999%
0
1
10
100
Rate of Change
1000
We Seek 99.99% Availability for Starts

6
31.5 seconds
99.999%
5.26 minutes
52.56 minutes
8.76 hours
3.26 days
36.5 days
99.99%
99.9%
99%
90%
99.9999%
0
1
10
100
Rate of Change
1000
Our goal is to shiE the curve

6
99.999%
99.99%
99.9%
99%
90%
99.9999%
Engineering
Opera,ons
Best Prac,ces

Con,nuous Improvement
4
3
2
1
0
1
10
100
Rate of Change
1000
Availability means that members can

sign up
ac,vate a device
browse
watch
What keeps us up at night
Failures happen all the ,me

Disks fail
Power goes, and your generator fails
Sogware bugs
Human error

Failure is unavoidable
We design for failure

Excep,on handling
Auto-scaling clusters
Redundancy
Fault tolerance and isola,on
Fall-backs and degraded experiences
Protect the customer from failures
Is that enough?
No
How do we know if weve succeeded?
Does the system work as designed?
Is it as resilient as we believe?
How do we prevent driging into failure?
We test for failure

Unit tes,ng
Integra,on tes,ng
Stress/load tes,ng
Simula,on matrices
Tes,ng increases condence but

is that enough?
Tes,ng distributed systems is hard

Massive, changing data sets
Web-scale trac
Complex interac,ons and informa,on ows
Asynchronous requests
3rd party services
All while innova,ng and improving our service
What if we regularly inject failures into our

systems under controlled circumstances?
Embracing Failure in Produc,on

Dont wait for random failures
Cause failure to validate resiliency
Test design assump,ons by stressing them
Remove uncertainty by forcing failures regularly
Two Key Concepts
Auto-Scaling
Virtual instance clusters that scale and shrink with traffic
Reactive and predictive mechanisms
Auto-replacement of bad instances
Circuit Breakers
An Instance Fails
Monkey loose in your DC
Run during business hours
Instances fail all the ,me

What we learned
State is problema,c
Auto-replacement works
Surviving a single instance
failure is not enough
A Data Center Fails

Chaos Gorilla
Simulate an availability zone

outage
3-zone congura,on
Eliminate one zone
Ensure that others can
handle the load and nothing
breaks
What we encountered
What we learned
Large scale events are hard to simulate
Hundreds of clusters, thousands of instances

Rapidly shiging trac is error prone
LBs must expire connec,ons quickly
Lingering connec,ons to caches must be addressed
Not all clusters pre-scaled for addi,onal load
What we learned
Hidden assump,ons & congura,ons
Some apps not congured for cross-zone calls
Mismatched ,meouts fallbacks prevented fail-over
REST client preserva,on mode prevented fail-over

Cassandra works as expected
Regrouping
From zone outage to zone evacua,on
Carefully deregistered instances
Staged trac shigs

Resuming true outage simula,ons soon
Regions Fail
Customer Device
Chaos Kong
Geo-located
Regional Load Balancers
Regional Load Balancers
Zuul Trac Shaping/Rou,ng
Zuul Trac Shaping/Rou,ng
AZ1
AZ2
AZ3
AZ1
AZ2
AZ3
Data
Data
Data
Data
Data
Data
What we learned
It works!
Disable predic,ve auto-scaling
Use instance counts from previous day
Room for Improvement

Not a true regional outage simula,on
Staged migra,on
No split brain
Not everything fails completely

Latency Monkey
Simulate latent service calls

Inject arbitrary latency and
errors at the service level
Observe for eects

Service Architecture
Service A
Device
Internet
ELB
Zuul
Edge
AWS
Service B
Service C
Latency Monkey
Service A
Device
Internet
ELB
Server-side URI lters

All requests
URI payern match
Percentage of requests

Arbitrary delays or responses
Zuul
Edge
Service B
Service C
What we learned
Startup resiliency is an issue

Services owners dont know all dependencies
Fallbacks can fail too
Second order eects not easily tested
Dependencies change over ,me
Holis,c view is necessary
Some teams opt out
Fault Injec,on Tes,ng (FIT)

Request-level simula,ons
Device
Internet
ELB
Service A
Zuul
Device or Account Override?
Edge
Service B
Service C
Benets
Condence building for latency monkey tes,ng
Con,nuous resilience tes,ng in test and
produc,on
Tes,ng of minimum viable service, fallbacks
Device resilience evalua,on
Device Resiliency Matrix
Is that it?
Fault-injec,on isnt enough
Bad code/deployments
Congura,on mishaps
Byzan,ne failures
Memory leaks
Performance degrada,on
A Multi-Faceted Approach
Continuous Build, Delivery, Deployment

Test Environments, Infrastructure, Coverage
Automated Canary Analysis
Staged Configuration Changes
Crisis Response Tooling & Operations
Real-time Analytics, Detection, Alerting
Operational Insight - Dashboards, Reports
Performance and Efficiency Engagements
Its also about

people and culture
Technical Culture
You build it, you run it
Each failure is an opportunity to learn
Blameless incident reviews
Commitment to con,nuous improvement

Context and Collabora,on

Context engages partners
Data and root causes
Global vs. local
Urgent vs. important
Long term vision

Collabora,on yields beyer solu,ons and buy-in
The Simian Army is part of

the Ne6lix open source
cloud pla6orm

hyp://ne6lix.github.com
Josh Evans
Director of Operations Engineering, Netflix
[email protected], @josh_evans_nflx

Fault Injection by Evans Jones Netflix

Uploaded by

Copyright:

Available Formats

Fault Injection by Evans Jones Netflix

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Fault Injection by Evans Jones Netflix

Uploaded by

Copyright:

Available Formats

Embracing

48 million members, 41 countries

3 AWS Regions, hundreds of services

Our Focus is on Quality and Velocity

We Seek 99.99% Availability for Starts

Our goal is to shiE the curve

Availability means that members can

What keeps us up at night

Failures happen all the ,me

We design for failure

We test for failure

Tes,ng increases condence but

Tes,ng distributed systems is hard

What if we regularly inject failures into our

Embracing Failure in Produc,on

Two Key Concepts

A Data Center Fails

Simulate an availability zone

Regional Load Balancers

Zuul Trac Shaping/Rou,ng

Zuul Trac Shaping/Rou,ng

Room for Improvement

Not everything fails completely

Simulate latent service calls

Server-side URI lters

Startup resiliency is an issue

Fault Injec,on Tes,ng (FIT)

Device or Account Override?

Device Resiliency Matrix

Continuous Build, Delivery, Deployment

Its also about

Context and Collabora,on

Collabora,on yields beyer solu,ons and buy-in

The Simian Army is part of

You might also like