Fault Injection by Evans Jones Netflix

Download as pdf or txt
Download as pdf or txt
You are on page 1of 44

Embracing

Failure
Fault Injec,on and Service Resilience
at Ne6lix
Josh Evans
Director of Opera,ons Engineering, Ne6lix

Josh Evans
24 years in technology
Tech support, Tools, Test Automa,on, IT & QA
Management

Time at Ne6lix ~15 years
Ecommerce, streaming, tools, services, opera,ons

Current Role: Director of Opera,ons Engineering

Ne5lix Ecosystem

Sta,c
Content
Akamai

AWS/Ne6lix
Control
Plane

ISP

48 million members, 41 countries


> 1 billion hours per month
> 1000 device types

3 AWS Regions, hundreds of services


Hundreds of thousands of requests/second
Partner provided services (Xbox Live, PSN)
CDN serving petabytes of data at terabits/second

Ne6lix CDN

Service
Partners

Our Focus is on Quality and Velocity


Availability vs. Rate of Change
6

31.5 seconds

99.999%

5.26 minutes

52.56 minutes

8.76 hours

3.26 days

36.5 days

99.99%
99.9%
99%
90%

Availability (nines)

99.9999%

0
1

10

100

Rate of Change

1000

We Seek 99.99% Availability for Starts


Availability vs. Rate of Change
6

31.5 seconds

99.999%

5.26 minutes

52.56 minutes

8.76 hours

3.26 days

36.5 days

99.99%
99.9%
99%
90%

Availability (nines)

99.9999%

0
1

10

100

Rate of Change

1000

Our goal is to shiE the curve


Availability vs. Rate of Change
6

99.999%

99.99%
99.9%
99%
90%

Availability (nines)

99.9999%

Engineering
Opera,ons
Best Prac,ces


Con,nuous Improvement

4
3
2
1
0
1

10

100

Rate of Change

1000

Availability means that members can


sign up
ac,vate a device
browse
watch

What keeps us up at night

Failures happen all the ,me


Disks fail
Power goes, and your generator fails
Sogware bugs
Human error

Failure is unavoidable

We design for failure


Excep,on handling
Auto-scaling clusters
Redundancy
Fault tolerance and isola,on
Fall-backs and degraded experiences
Protect the customer from failures

Is that enough?

No
How do we know if weve succeeded?
Does the system work as designed?
Is it as resilient as we believe?
How do we prevent driging into failure?

We test for failure


Unit tes,ng
Integra,on tes,ng
Stress/load tes,ng
Simula,on matrices

Tes,ng increases condence but


is that enough?

Tes,ng distributed systems is hard


Massive, changing data sets
Web-scale trac
Complex interac,ons and informa,on ows
Asynchronous requests
3rd party services
All while innova,ng and improving our service

What if we regularly inject failures into our


systems under controlled circumstances?

Embracing Failure in Produc,on


Dont wait for random failures
Cause failure to validate resiliency
Test design assump,ons by stressing them
Remove uncertainty by forcing failures regularly

Two Key Concepts

Auto-Scaling
Virtual instance clusters that scale and shrink with traffic
Reactive and predictive mechanisms
Auto-replacement of bad instances

Circuit Breakers

An Instance Fails
Monkey loose in your DC
Run during business hours
Instances fail all the ,me

What we learned

State is problema,c
Auto-replacement works
Surviving a single instance
failure is not enough

A Data Center Fails


Chaos Gorilla

Simulate an availability zone


outage
3-zone congura,on
Eliminate one zone
Ensure that others can
handle the load and nothing
breaks

What we encountered

What we learned
Large scale events are hard to simulate
Hundreds of clusters, thousands of instances


Rapidly shiging trac is error prone
LBs must expire connec,ons quickly
Lingering connec,ons to caches must be addressed
Not all clusters pre-scaled for addi,onal load

What we learned
Hidden assump,ons & congura,ons
Some apps not congured for cross-zone calls
Mismatched ,meouts fallbacks prevented fail-over
REST client preserva,on mode prevented fail-over

Cassandra works as expected

Regrouping
From zone outage to zone evacua,on
Carefully deregistered instances
Staged trac shigs

Resuming true outage simula,ons soon

Regions Fail
Customer Device

Chaos Kong
Geo-located
Regional Load Balancers

Regional Load Balancers

Zuul Trac Shaping/Rou,ng

Zuul Trac Shaping/Rou,ng

AZ1

AZ2

AZ3

AZ1

AZ2

AZ3

Data

Data

Data

Data

Data

Data

What we learned
It works!
Disable predic,ve auto-scaling
Use instance counts from previous day

Room for Improvement


Not a true regional outage simula,on
Staged migra,on
No split brain

Not everything fails completely


Latency Monkey

Simulate latent service calls


Inject arbitrary latency and
errors at the service level
Observe for eects

Service Architecture
Service A

Device

Internet

ELB

Zuul

Edge

AWS

Service B

Service C

Latency Monkey
Service A

Device

Internet

ELB

Server-side URI lters


All requests
URI payern match
Percentage of requests

Arbitrary delays or responses

Zuul

Edge

Service B

Service C

What we learned

Startup resiliency is an issue


Services owners dont know all dependencies
Fallbacks can fail too
Second order eects not easily tested
Dependencies change over ,me
Holis,c view is necessary
Some teams opt out

Fault Injec,on Tes,ng (FIT)


Request-level simula,ons

Device

Internet

ELB

Service A

Zuul

Device or Account Override?

Edge

Service B

Service C

Benets
Condence building for latency monkey tes,ng
Con,nuous resilience tes,ng in test and
produc,on
Tes,ng of minimum viable service, fallbacks
Device resilience evalua,on

Device Resiliency Matrix

Is that it?
Fault-injec,on isnt enough
Bad code/deployments
Congura,on mishaps
Byzan,ne failures
Memory leaks
Performance degrada,on

A Multi-Faceted Approach

Continuous Build, Delivery, Deployment


Test Environments, Infrastructure, Coverage
Automated Canary Analysis
Staged Configuration Changes
Crisis Response Tooling & Operations
Real-time Analytics, Detection, Alerting
Operational Insight - Dashboards, Reports
Performance and Efficiency Engagements

Its also about


people and culture

Technical Culture
You build it, you run it
Each failure is an opportunity to learn
Blameless incident reviews
Commitment to con,nuous improvement

Context and Collabora,on


Context engages partners
Data and root causes
Global vs. local
Urgent vs. important
Long term vision

Collabora,on yields beyer solu,ons and buy-in

The Simian Army is part of


the Ne6lix open source
cloud pla6orm

hyp://ne6lix.github.com

Josh Evans
Director of Operations Engineering, Netflix
[email protected], @josh_evans_nflx

You might also like