0% found this document useful (0 votes)
50 views17 pages

SDA Session 8

This document discusses reliability and availability in distributed systems. It defines key metrics like MTTF, MTTR, and availability. It explains how reliability is calculated for systems with components arranged in serial and parallel configurations. Availability is defined as the percentage of time a system is operational. Examples are provided to illustrate how availability is calculated based on MTTF and MTTR of individual components. Various techniques for improving reliability like redundancy and fault tolerance through checkpointing and recovery are also covered. The document concludes by reviewing topics covered in previous sessions related to data analytics systems.

Uploaded by

Roma Thakare
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views17 pages

SDA Session 8

This document discusses reliability and availability in distributed systems. It defines key metrics like MTTF, MTTR, and availability. It explains how reliability is calculated for systems with components arranged in serial and parallel configurations. Availability is defined as the percentage of time a system is operational. Examples are provided to illustrate how availability is calculated based on MTTF and MTTR of individual components. Various techniques for improving reliability like redundancy and fault tolerance through checkpointing and recovery are also covered. The document concludes by reviewing topics covered in previous sessions related to data analytics systems.

Uploaded by

Roma Thakare
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

DSECLZG517: Systems for Data Analytics

Session 8: Reliability and availability

Dr. Anindya Neogi


Associate Professor
[email protected]
Topics for today

• Reliability
• Availability
• Single points of failure

2
Distributed computing – living with failures
• Failures of nodes and links is a common concern in Distributed Systems
• Essential to have fault tolerance aspect in design
• Fault tolerance is a measure of
• How a distributed system functions in the presence of failures of system components
• Tolerance of component faults is measured by parameters
• Reliability - An inverse indicator of failure rate
• How soon a system will fail
• Availability - An indicator of fraction of time a system is available for use
• System is not available during failure
• Serviceability: How easy is it to service / fix
• Systems have to promise strict RAS guarantees because downtime means lost revenue

3
Metrics
• MTTF - Mean Time To Failure
• MTTF = 1 / failure rate = Total #hours of operation / Total #units
• MTTF is an averaged value. In reality failure rate changes over
time because it may depend on age of component.
• Failure rate = 1 / MTTF (assuming average value over time)
• MTTR - Mean Time to Recovery / Repair
• MTTR = Total #hours for maintenance / Total #repairs
• MTTD - Mean Time to Diagnose
• MTBF - Mean Time Between Failures
• MTBF = MTTD + MTTR + MTTF

4
user —> app server —> DB server —> storage/disk

Reliability - serial assembly


• MTTF of a system is a function of MTTF of components
• Serial assembly of components
• Failure of any component results in system failure
• Failure rate of C = Failure rate of A + Failure rate of B = 1/ma + 1/mb

MTTF mc=1/(1/ma + 1/mb)


A server fails every 90 days.
C The disk fails every 45 days.
A B In serial assembly,
MTTF=ma MTTF=mb system fails every 1 / (1/90 + 1/45) = 30 days

• MTTF of system = 1 / SUM (1/MTTFi) for all components i


• Failure rate of system = SUM(1/MTTFi) for all components i

5
Reliability - parallel assembly

• In a parallel assembly, e.g. a cluster of nodes C


A
• MTTF of C = MTTF A + MTTF B because both A MTTF=ma
and B have to fail for C to fail
B
• MTTF of system = SUM(MTTFi) for all MTTF=ma
components i
MTTF mc=ma + mb
A server fails every 90 days.
The disk fails every 45 days.
2 redundant disks are connected in parallel.
Disk subsystem fails in 45 + 45 = 90 days.
System fails every 1 / (1/90 + 1/90) = 45 days

6
Topics for today

• Reliability
• Availability
• Single points of failure

7
Availability

• Availability = Time system is UP and accessible / Total time observed


• Availability = MTTF / (MTTD* + MTTR + MTTF)
or
• Availability = MTTF / MTBF
• A system is highly available when
• MTTF is high
• MTTR is low

* Unless specified one can assume MTTD = 0 8


Example
• A node in a cluster fails every 100 hours while other parts never fail.

• On failure of the node the whole system needs to be shutdown, faulty node
replaced and system. This takes 2 hours.

• The application needs to be restarted, which takes 2 hours.

• What is the availability of the cluster ?

• If downtime is $80k per hour, the what is the yearly cost ?

• Solution

• MTTF = 100 hours

• MTTR = 2 + 2 = 4 hours

• Availability = 100/104 = 96.15%

• Cost of downtime per year = 80000 x 3.85 * 365 * 24 / 100 = USD 27 million

https://www.brainkart.com/article/Fault-Tolerant-Cluster-Configurations_11320/

9
Availability : Serial and Parallel Systems (1)

A(system) = Product (Ai) for all i A(system) = 1 - Unavailability(system)


A(system) = 0.990025 = 1 - Product(1- Ai for all i )
= 1 - (1-0.995)(1-0.995)
= 1 - 0.005x0.005=0.999975

10
Availability : Parallel Systems (2)

comp1 comp2

A(S) = A(Comp1 U Comp2)


= A(Comp1) + A(Comp2) - A(Comp 1) * A(Comp2)
= 0.995 + 0.995 - 0.995 * 0.995
= 0.999975

For 3 components ?

A(S) = A1 + A2 + A3 - A1*A2 - A1*A3 - A2*A3 + A1*A2*A3

11
Reliability block diagrams

• Systems are a complex combination of serial and parallel


connections
• An RBD model is used to analyse availability of a
complex system by encapsulating serial or parallel
connections within blocks
• Sometimes it is non-trivial to create an RBD given the
system dependencies
• User to application needs both switch 1 and 2
available
• Application needs web service 1 which needs either of
the 2 switches available to use the DB

12
Fault Tolerant Clusters – Recovery
• Diagnosis
• Detection of failure and location of the failed component, e.g. using
heartbeat messages between nodes
• Backward recovery
checkpoints
• periodically do a checkpoint (save consistent state on stable storage)
• on failure, isolate the failed component, rollback to last checkpoint
and resume normal operation
• Ease to implement, independent of application, but leads to wastage
of execution time on rollback besides unused checkpointing work rollback on errors
• Forward recovery
• In real-time systems or time-critical systems cannot rollback. So state
is reconstructed on the fly from diagnosis data.
• Application specific and may need additional hardware

13
Topics for today

• Reliability
• Availability
• Single points of failure

14
Single Points of Failure in SMP and Clusters

Bus / Mem failures ? Ethernet failures ?

Node failures ? Protect against node failures with periodic


checkpoints on global storage

15
Redundancy techniques

• Availability can be increased in 2 ways


» Increase MTTF - almost saturated and expensive to increase further
» Reduce MTTR - have redundancy in the cluster so that another node takes over
as one fails (hiding failures)
» Isolated redundancy - redundant components are isolated, e.g. backup
node shares nothing with primary node
» N-version programming - N copies of software are independently built and
run. Results are compared and majority vote taken.

16
Review topics
• Session 1: Types of analytics, types of data, intro to caching
• Session 2 : Locality of reference - cache hit / miss calculations, given a program or scenario do you understand
whether it is spatial / temporal locality ?
• Session 3: Solving latency and bandwidth issues with caching, block size, prefetching, multi-threading. Interplay
between techniques, e.g. memory bandwidth impacted in trying to reduce latency with prefetching / multi-
threading.
• Session 4: Various types of message options - blocking, buffering, buffering in interface cards …. Various common
programming features in openmpi (distributed memory) and openmp (shared memory).
• Session 5: Do you know how to design a parallel program using right decomposition ?
• Session 6: Software and system architectures, Given a scenario can you decide which architecture to use ?
Fallacies in Distributed systems
• Session 7: Cluster design - components, failover options
• Session 8: Reliability and availability calculations

17

You might also like