SDA Session 8
SDA Session 8
• Reliability
• Availability
• Single points of failure
2
Distributed computing – living with failures
• Failures of nodes and links is a common concern in Distributed Systems
• Essential to have fault tolerance aspect in design
• Fault tolerance is a measure of
• How a distributed system functions in the presence of failures of system components
• Tolerance of component faults is measured by parameters
• Reliability - An inverse indicator of failure rate
• How soon a system will fail
• Availability - An indicator of fraction of time a system is available for use
• System is not available during failure
• Serviceability: How easy is it to service / fix
• Systems have to promise strict RAS guarantees because downtime means lost revenue
3
Metrics
• MTTF - Mean Time To Failure
• MTTF = 1 / failure rate = Total #hours of operation / Total #units
• MTTF is an averaged value. In reality failure rate changes over
time because it may depend on age of component.
• Failure rate = 1 / MTTF (assuming average value over time)
• MTTR - Mean Time to Recovery / Repair
• MTTR = Total #hours for maintenance / Total #repairs
• MTTD - Mean Time to Diagnose
• MTBF - Mean Time Between Failures
• MTBF = MTTD + MTTR + MTTF
4
user —> app server —> DB server —> storage/disk
5
Reliability - parallel assembly
6
Topics for today
• Reliability
• Availability
• Single points of failure
7
Availability
• On failure of the node the whole system needs to be shutdown, faulty node
replaced and system. This takes 2 hours.
• Solution
• MTTR = 2 + 2 = 4 hours
• Cost of downtime per year = 80000 x 3.85 * 365 * 24 / 100 = USD 27 million
https://www.brainkart.com/article/Fault-Tolerant-Cluster-Configurations_11320/
9
Availability : Serial and Parallel Systems (1)
10
Availability : Parallel Systems (2)
comp1 comp2
For 3 components ?
11
Reliability block diagrams
12
Fault Tolerant Clusters – Recovery
• Diagnosis
• Detection of failure and location of the failed component, e.g. using
heartbeat messages between nodes
• Backward recovery
checkpoints
• periodically do a checkpoint (save consistent state on stable storage)
• on failure, isolate the failed component, rollback to last checkpoint
and resume normal operation
• Ease to implement, independent of application, but leads to wastage
of execution time on rollback besides unused checkpointing work rollback on errors
• Forward recovery
• In real-time systems or time-critical systems cannot rollback. So state
is reconstructed on the fly from diagnosis data.
• Application specific and may need additional hardware
13
Topics for today
• Reliability
• Availability
• Single points of failure
14
Single Points of Failure in SMP and Clusters
15
Redundancy techniques
16
Review topics
• Session 1: Types of analytics, types of data, intro to caching
• Session 2 : Locality of reference - cache hit / miss calculations, given a program or scenario do you understand
whether it is spatial / temporal locality ?
• Session 3: Solving latency and bandwidth issues with caching, block size, prefetching, multi-threading. Interplay
between techniques, e.g. memory bandwidth impacted in trying to reduce latency with prefetching / multi-
threading.
• Session 4: Various types of message options - blocking, buffering, buffering in interface cards …. Various common
programming features in openmpi (distributed memory) and openmp (shared memory).
• Session 5: Do you know how to design a parallel program using right decomposition ?
• Session 6: Software and system architectures, Given a scenario can you decide which architecture to use ?
Fallacies in Distributed systems
• Session 7: Cluster design - components, failover options
• Session 8: Reliability and availability calculations
17