Assignment #1 Paper #5 - Resilience Distributed Systems - A White Paper
Assignment #1 Paper #5 - Resilience Distributed Systems - A White Paper
Assignment #1 Paper #5 - Resilience Distributed Systems - A White Paper
RESILIENCE IN DISTRIBUTED
SYSTEMS
Abstract
With the rapid increase in the number of internet users and frequent
changes in consumer trends, traditional systems have no choice but
to scale out, distribute, and decentralize. To give you an idea of the
extend of scaling involved, Facebook and YouTube each would have
had 40,000 to 45,000 hits from desktop users alone in the 5 seconds
it took you to read this paragraph. [5].
With the peta and exa bytes of data being generated every day and
the growing adoption of IoT, this scaling is only going to become
more exponential and systems will need to scale out and depend on
each other more than ever.
Distributed systems are driven by various Architecturally Significant
Requirements (ASRs) [24] and one such ASR is resilience.
This paper is about the SPEAR (Software Performance Engineering
and ARchitecture services) team’s perspective on Application
Resilience in distributed systems – what it means in simple terms,
how to study it, the factors affecting it, and what patterns/best
practices can help us in improving the same.
Who is this document for? design of a persistent database will be to effectively, with minimal disruption to
effectively manage secondary disk. business operations.
This is presented from the perspective of a
lead application designer or an architect, What is resilience? Why is this important? – Because it affects
but there are some theories and methods business, trust and lives.
Resilience of an application, in simple
for IT managers, developers, architects, A downtime of 1 minute at Amazon can
language, is the capability of the
tech leads, and program managers who result in a potential loss of USD 234,000
application to spring back to an acceptable
are looking to understand and improve through their online channels alone. [6][7]
operational condition after it faces an
resilience in a distributed system. [8] A technical glitch caused an outage on
event affecting its operating conditions.
Some basic definitions first. the New York Stock Exchange, leading to a
[ ‘capability’ - what you have inside your drop in share prices and indexes.
What is a distributed system? systems to put them back in acceptable
Can you imagine an outage in a critical
operating condition),
A distributed system is one where multiple hospital system, air traffic control system,
components of a system are physically ‘event’ – failure of responsibility of some core trading platform or on a police or
or logically separated and governed module within the application or a failure emergency contact system?
by common and component-specific of responsibility of some dependent
system or a force majeure situation] Some metrics like RPO (Recovery Point
requirements.
Objective), RTO (Recovery Time Objective),
We say common because if you are A ‘failure of responsibility’ or simply MTTR (Mean Time to Recovery), Number of
designing a system to be 99.99% a failure could be a breach of SLA or failures/bugs detected in the system, SLA
available you cannot have a crucial some sort of agreement regarding an variance, etc., are some ways to measure
cache component in the system which application. It could be big, like a failure resilience of a system. We will not be
is available only 97%. If that is the case, of Amazon Route 53 services or a bug in going into detail in this paper about the
you need to have a trade-off in place to the implementation of the Set interface measurement and tools used but take a
make sure the service availability is met in which behaves unexpectedly under normal look at the various failures and patterns
spite of only 97% availability of the cache operating conditions. which can be used to improve the MTTR or
component. The flipside of resilience is all about RPO or RTO of a system.
We say component specific because the understanding and preparing for failures. Let’s begin with a couple of modeling
design of the cache component is driven So resilience can also be defined as the techniques needed for studying resilience
by speed and will be leveraging much capability of the system to understand and – Flow and failure modeling.
of the primary storage disk, whereas the manage failures occurring in the system
Flow modeling lets the user select the food (A), check out study we need to look at the alternate
the same (B), check coupons (C) via an flows. Alternate flows are the control
Traditionally we study the various flows in external service, recalculate the checkout and data flows which are taken by the
the system via use case analysis, control/ amount (B) if there are any discounts, select application if an unexpected behavior
data flows, sequence diagrams etc., but will an address (D), external service for making occurs. So to understand and design
this linear and branching flow study really payment (E1) which in turn automatically alternate flows we need to include the list
help? places an order (E2) via the restaurant API. of implicit services and dependencies at
Let’s consider the control flow of this each step.
While this is more of a happy flow or ideal
example: an online food ordering website flow of a business operation, for resilience
III. Testing methods: testing are – Chaos Monkey, Simian Army, systems and tools today also allow us to
gremlin etc. detect and rollback the deployments made
The application landscape has changed in case of an error found in production.
and the world is getting rapidly rewritten Another way is to employ shift left testing
in code for tomorrow, even as we speak. strategy where the testing methods can The testing tools and methods today are
start engaging very early in the cycle sophisticated and continually evolving.
To keep up with this change, it is and act as a gateway to promote the Embracing these new testing methods,
imperative that software testing methods code. Tool chains in the DevOps, CI/CD tools, and processes is imperative for
have to be made more robust. One such pipeline should integrate testing tools and building a robust system.
method is Chaos testing [15] which is the practically all kinds of testing, from unit,
process of testing failures in a distributed performance, security, integration etc., can IV. Deployment issues:
system by injecting known failures in the be executed and studied in a controlled An intelligent deployment strategy can
systems and observing the behavior. environment and it can be decided if the prevent issues from being caused due to
For example, inject a JVM memory software or the patch can go live. Tools software problems. Let’s look at a couple of
exception in a remote JVM instance and like kubernetes, JMeter, Jenkins, Docker, examples below:
observing the response time of the system. Dynatrace etc., enable us to model newer
1. Blue/Green deployment – The idea is
testing approaches. Robust monitoring
Some tools which can be used for chaos to maintain two production environments
References
1. German bank error - http://content.time.com/time/business/article/0,8599,1952305,00.html
2. Year 2010 problem - https://www.dw.com/en/millions-of-german-bank-cards-hit-by-software-bug/a-5088075
3. Reliability - http://www.cse.msu.edu/~stire/HomePage/Papers/wadsChapter05.pdf
4. Idempotent failover - https://www.springer.com/us/book/9783540407270
5. Traffic stats - https://www.similarweb.com/website/facebook.com#overview
6. Downtime and lag time - https://medium.com/@vikigreen/impact-of-slow-page-load-time-on-website-performance-40d5c9ce568a
7. Downtime of amazon - https://www.forbes.com/sites/kellyclay/2013/08/19/amazon-com-goes-down-loses-66240-per-minute/#5eabc9b6495c
8. Amazon revenue by segment - online sales - https://www.statista.com/statistics/672747/amazons-consolidated-net-revenue-by-segment/
9. 3 Banks DDoS attacks - https://www.cshub.com/attacks/news/incident-of-the-week-ddos-attack-hits-3-banks
10. EBS - AWS - https://aws.amazon.com/ebs/features/#Amazon_EBS_Snapshots
11. Intel’s comments - https://www.theregister.co.uk/2018/10/08/intel_security_commitment/
12. Telstra human error - https://www.news.com.au/technology/gadgets/mobile-phones/telstra-explains-network-outage-as-worker-faces-the-music/
news-story/7e3f2214350094c3c2096ad14f7480ae
13.VISA outage - https://www.cbronline.com/news/visa-outage
14. Actor Model - https://doc.akka.io/docs/akka/2.5/guide/actors-intro.html
15. Chaos Engineering - https://principlesofchaos.org/
16. ASUS attack - https://techhq.com/2019/03/asus-breach-highlights-software-supply-chain-risk/
17. Let it Fail approach - http://ward.bay.wiki.org/view/let-it-fail
18. Agile Manifesto - https://agilemanifesto.org/
19. Black Hole - https://www.brianmadden.com/opinion/Dealing-with-the-Black-Hole-Effect-Throttling-Logons-to-New-Servers
20. TOGAF Architecture principles - http://pubs.opengroup.org/architecture/togaf8-doc/arch/chap29.html
21. Flow Modeling references:
a. FMEA - https://wiki.ece.cmu.edu/ddl/index.php/FMEA
b. PGM - http://pgm.stanford.edu/algorithms/
22. Mechanical Sympathy - https://dzone.com/articles/mechanical-sympathy
23. Survival analysis - https://en.wikipedia.org/wiki/Survival_analysis
24. Architecturally Significant Requirements:
a. https://www.ida.liu.se/~TDDD09/openup/core.tech.common.extend_supp/guidances/concepts/arch_significant_requirements_1EE5D757.html
b. https://www.ibm.com/developerworks/rational/library/4706.html
© 2019 Infosys Limited, Bengaluru, India. All Rights Reserved. Infosys believes the information in this document is accurate as of its publication date; such information is subject to change without notice. Infosys
acknowledges the proprietary rights of other companies to the trademarks, product names and such other intellectual property rights mentioned in this document. Except as expressly permitted, neither this
documentation nor any part of it may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, printing, photocopying, recording or otherwise, without the
prior permission of Infosys Limited and/ or any named intellectual property rights holders under this document.