0% found this document useful (0 votes)

50 views17 pages

SDA Session 8

This document discusses reliability and availability in distributed systems. It defines key metrics like MTTF, MTTR, and availability. It explains how reliability is calculated for systems with components arranged in serial and parallel configurations. Availability is defined as the percentage of time a system is operational. Examples are provided to illustrate how availability is calculated based on MTTF and MTTR of individual components. Various techniques for improving reliability like redundancy and fault tolerance through checkpointing and recovery are also covered. The document concludes by reviewing topics covered in previous sessions related to data analytics systems.

Uploaded by

Roma Thakare

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

50 views17 pages

SDA Session 8

Uploaded by

Roma Thakare

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

DSECLZG517: Systems for Data Analytics

Session 8: Reliability and availability

Dr. Anindya Neogi

Associate Professor
[email protected]
Topics for today

• Reliability
• Availability
• Single points of failure

2
Distributed computing – living with failures
• Failures of nodes and links is a common concern in Distributed Systems
• Essential to have fault tolerance aspect in design
• Fault tolerance is a measure of
• How a distributed system functions in the presence of failures of system components
• Tolerance of component faults is measured by parameters
• Reliability - An inverse indicator of failure rate
• How soon a system will fail
• Availability - An indicator of fraction of time a system is available for use
• System is not available during failure
• Serviceability: How easy is it to service / fix
• Systems have to promise strict RAS guarantees because downtime means lost revenue

3
Metrics
• MTTF - Mean Time To Failure
• MTTF = 1 / failure rate = Total #hours of operation / Total #units
• MTTF is an averaged value. In reality failure rate changes over
time because it may depend on age of component.
• Failure rate = 1 / MTTF (assuming average value over time)
• MTTR - Mean Time to Recovery / Repair
• MTTR = Total #hours for maintenance / Total #repairs
• MTTD - Mean Time to Diagnose
• MTBF - Mean Time Between Failures
• MTBF = MTTD + MTTR + MTTF

4
user —> app server —> DB server —> storage/disk

Reliability - serial assembly

• MTTF of a system is a function of MTTF of components
• Serial assembly of components
• Failure of any component results in system failure
• Failure rate of C = Failure rate of A + Failure rate of B = 1/ma + 1/mb

MTTF mc=1/(1/ma + 1/mb)

A server fails every 90 days.
C The disk fails every 45 days.
A B In serial assembly,
MTTF=ma MTTF=mb system fails every 1 / (1/90 + 1/45) = 30 days

• MTTF of system = 1 / SUM (1/MTTFi) for all components i

• Failure rate of system = SUM(1/MTTFi) for all components i

5
Reliability - parallel assembly

• In a parallel assembly, e.g. a cluster of nodes C

A
• MTTF of C = MTTF A + MTTF B because both A MTTF=ma
and B have to fail for C to fail
B
• MTTF of system = SUM(MTTFi) for all MTTF=ma
components i
MTTF mc=ma + mb
A server fails every 90 days.
The disk fails every 45 days.
2 redundant disks are connected in parallel.
Disk subsystem fails in 45 + 45 = 90 days.
System fails every 1 / (1/90 + 1/90) = 45 days

6
Topics for today

• Reliability
• Availability
• Single points of failure

7
Availability

• Availability = Time system is UP and accessible / Total time observed

• Availability = MTTF / (MTTD* + MTTR + MTTF)
or
• Availability = MTTF / MTBF
• A system is highly available when
• MTTF is high
• MTTR is low

* Unless specified one can assume MTTD = 0 8

Example
• A node in a cluster fails every 100 hours while other parts never fail.

• On failure of the node the whole system needs to be shutdown, faulty node
replaced and system. This takes 2 hours.

• The application needs to be restarted, which takes 2 hours.

• What is the availability of the cluster ?

• If downtime is $80k per hour, the what is the yearly cost ?

• Solution

• MTTF = 100 hours

• MTTR = 2 + 2 = 4 hours

• Availability = 100/104 = 96.15%

• Cost of downtime per year = 80000 x 3.85 * 365 * 24 / 100 = USD 27 million

https://www.brainkart.com/article/Fault-Tolerant-Cluster-Configurations_11320/

9
Availability : Serial and Parallel Systems (1)

A(system) = Product (Ai) for all i A(system) = 1 - Unavailability(system)

A(system) = 0.990025 = 1 - Product(1- Ai for all i )
= 1 - (1-0.995)(1-0.995)
= 1 - 0.005x0.005=0.999975

10
Availability : Parallel Systems (2)

comp1 comp2

A(S) = A(Comp1 U Comp2)

= A(Comp1) + A(Comp2) - A(Comp 1) * A(Comp2)
= 0.995 + 0.995 - 0.995 * 0.995
= 0.999975

For 3 components ?

A(S) = A1 + A2 + A3 - A1A2 - A1A3 - A2A3 + A1A2*A3

11
Reliability block diagrams

• Systems are a complex combination of serial and parallel

connections
• An RBD model is used to analyse availability of a
complex system by encapsulating serial or parallel
connections within blocks
• Sometimes it is non-trivial to create an RBD given the
system dependencies
• User to application needs both switch 1 and 2
available
• Application needs web service 1 which needs either of
the 2 switches available to use the DB

12
Fault Tolerant Clusters – Recovery
• Diagnosis
• Detection of failure and location of the failed component, e.g. using
heartbeat messages between nodes
• Backward recovery
checkpoints
• periodically do a checkpoint (save consistent state on stable storage)
• on failure, isolate the failed component, rollback to last checkpoint
and resume normal operation
• Ease to implement, independent of application, but leads to wastage
of execution time on rollback besides unused checkpointing work rollback on errors
• Forward recovery
• In real-time systems or time-critical systems cannot rollback. So state
is reconstructed on the fly from diagnosis data.
• Application specific and may need additional hardware

13
Topics for today

• Reliability
• Availability
• Single points of failure

14
Single Points of Failure in SMP and Clusters

Bus / Mem failures ? Ethernet failures ?

Node failures ? Protect against node failures with periodic

checkpoints on global storage

15
Redundancy techniques

• Availability can be increased in 2 ways

» Increase MTTF - almost saturated and expensive to increase further
» Reduce MTTR - have redundancy in the cluster so that another node takes over
as one fails (hiding failures)
» Isolated redundancy - redundant components are isolated, e.g. backup
node shares nothing with primary node
» N-version programming - N copies of software are independently built and
run. Results are compared and majority vote taken.

16
Review topics
• Session 1: Types of analytics, types of data, intro to caching
• Session 2 : Locality of reference - cache hit / miss calculations, given a program or scenario do you understand
whether it is spatial / temporal locality ?
• Session 3: Solving latency and bandwidth issues with caching, block size, prefetching, multi-threading. Interplay
between techniques, e.g. memory bandwidth impacted in trying to reduce latency with prefetching / multi-
threading.
• Session 4: Various types of message options - blocking, buffering, buffering in interface cards …. Various common
programming features in openmpi (distributed memory) and openmp (shared memory).
• Session 5: Do you know how to design a parallel program using right decomposition ?
• Session 6: Software and system architectures, Given a scenario can you decide which architecture to use ?
Fallacies in Distributed systems
• Session 7: Cluster design - components, failover options
• Session 8: Reliability and availability calculations

All Certik Skynet Answer (Up-To-date)
100% (2)
All Certik Skynet Answer (Up-To-date)
21 pages
Amazon Method by Wock 1
100% (2)
Amazon Method by Wock 1
2 pages
Omni Legend Scanner
No ratings yet
Omni Legend Scanner
13 pages
Chapter 2 Maintnability Reliability and Availability
100% (1)
Chapter 2 Maintnability Reliability and Availability
60 pages
System Reliability and Availability
No ratings yet
System Reliability and Availability
5 pages
Software Reliability
No ratings yet
Software Reliability
20 pages
96boards Som Carrier Board Schematics
No ratings yet
96boards Som Carrier Board Schematics
28 pages
DX Diag
No ratings yet
DX Diag
31 pages
Computer System General Requirements
No ratings yet
Computer System General Requirements
9 pages
A Survey of Fault Tolerance Mechanisms Adn Checkpoint Restart Implementations For High Performance Computing Systems
No ratings yet
A Survey of Fault Tolerance Mechanisms Adn Checkpoint Restart Implementations For High Performance Computing Systems
25 pages
Fault Tolerant Systems: Prerequisites
No ratings yet
Fault Tolerant Systems: Prerequisites
14 pages
HumanEval Pro and MBPPPro Evaluating Large Language Models
No ratings yet
HumanEval Pro and MBPPPro Evaluating Large Language Models
27 pages
Depndability
No ratings yet
Depndability
33 pages
Epcom
100% (1)
Epcom
2 pages
Fault Tolerance in Distributed Computing
No ratings yet
Fault Tolerance in Distributed Computing
32 pages
Lesson 1 - Introduction To Fault-Tolerant Computing
No ratings yet
Lesson 1 - Introduction To Fault-Tolerant Computing
6 pages
Fault Tolerance Unit 3-4
No ratings yet
Fault Tolerance Unit 3-4
32 pages
BDS Session 3
No ratings yet
BDS Session 3
68 pages
AgNet - Novel Agentic Network Architecture - 2 Col
No ratings yet
AgNet - Novel Agentic Network Architecture - 2 Col
5 pages
CS61C Su18 27 MRR Dependability
No ratings yet
CS61C Su18 27 MRR Dependability
60 pages
Reliability & Availability - Introduction
No ratings yet
Reliability & Availability - Introduction
18 pages
Availability Concepts
No ratings yet
Availability Concepts
39 pages
Quiz 1 Selected Topics in SWE - Model Answer
No ratings yet
Quiz 1 Selected Topics in SWE - Model Answer
5 pages
11 Errors
No ratings yet
11 Errors
33 pages
Ch-4-Fault Tularance - Naming-SM
No ratings yet
Ch-4-Fault Tularance - Naming-SM
42 pages
A Review On Fault Tolerance in Distributed Database
No ratings yet
A Review On Fault Tolerance in Distributed Database
4 pages
Polarmods - Patcher Logcat
No ratings yet
Polarmods - Patcher Logcat
4 pages
RTS UNiT 4
No ratings yet
RTS UNiT 4
19 pages
Reliability
No ratings yet
Reliability
58 pages
Du3 1
No ratings yet
Du3 1
54 pages
Initial Recommendations
No ratings yet
Initial Recommendations
10 pages
Why Do Computers Stop Jim Gray
No ratings yet
Why Do Computers Stop Jim Gray
8 pages
Fault Tolerance Techniques
No ratings yet
Fault Tolerance Techniques
4 pages
Sokoban en
No ratings yet
Sokoban en
6 pages
Lecture 4 ITI
No ratings yet
Lecture 4 ITI
25 pages
Portal Frame Method: General Assumptions
No ratings yet
Portal Frame Method: General Assumptions
4 pages
Software Fault Tolerance Methods
No ratings yet
Software Fault Tolerance Methods
50 pages
7.fault Tolerance
No ratings yet
7.fault Tolerance
35 pages
Availability and Reliability Theory: and The Expectations Behind The Numbers
No ratings yet
Availability and Reliability Theory: and The Expectations Behind The Numbers
37 pages
Avanti Kumari - A Report
No ratings yet
Avanti Kumari - A Report
39 pages
03 - Reliability Software
No ratings yet
03 - Reliability Software
56 pages
Computer Networks: 7 Application
No ratings yet
Computer Networks: 7 Application
46 pages
Reliability and Availability Analysis - Toward Muliplevel Models For Complex Systems, Trivedi2024
No ratings yet
Reliability and Availability Analysis - Toward Muliplevel Models For Complex Systems, Trivedi2024
11 pages
Introduction To RAMS Engineering
No ratings yet
Introduction To RAMS Engineering
34 pages
Chapter 8 - Final
No ratings yet
Chapter 8 - Final
48 pages
Reference Book Principles of Distributed Database System Chapters
No ratings yet
Reference Book Principles of Distributed Database System Chapters
25 pages
Week09-Fault Tolerant System
No ratings yet
Week09-Fault Tolerant System
26 pages
DS Unit - 4
No ratings yet
DS Unit - 4
20 pages
Dis Sys
No ratings yet
Dis Sys
16 pages
Cod PDF
No ratings yet
Cod PDF
14 pages
Meantime To Failure
No ratings yet
Meantime To Failure
10 pages
Reliability and Reusability
No ratings yet
Reliability and Reusability
35 pages
MECHANICAL PADS - CIVIL - Construction - Division 23
No ratings yet
MECHANICAL PADS - CIVIL - Construction - Division 23
23 pages
Patents Database
No ratings yet
Patents Database
126 pages
Huawei ELTE2.3 System Reliability Prediction Technical White Paper
No ratings yet
Huawei ELTE2.3 System Reliability Prediction Technical White Paper
22 pages
RAS FEatures
No ratings yet
RAS FEatures
53 pages
Calculating Total System Availability: Hoda Rohani, Azad Kamali Roosta
No ratings yet
Calculating Total System Availability: Hoda Rohani, Azad Kamali Roosta
27 pages
System Reliability Availability Calculations
No ratings yet
System Reliability Availability Calculations
6 pages
Mobile Web Apps
No ratings yet
Mobile Web Apps
6 pages
Twitter Cheatsheet
No ratings yet
Twitter Cheatsheet
2 pages
IT 602 Week 2 - Slides
No ratings yet
IT 602 Week 2 - Slides
31 pages
Ieee STD p3006.7 Presentation
No ratings yet
Ieee STD p3006.7 Presentation
21 pages
CPE 515 - Reliability and Maintability
No ratings yet
CPE 515 - Reliability and Maintability
108 pages
Chapter - 33 Reliability Prediction Models Reliability Prediction Models
No ratings yet
Chapter - 33 Reliability Prediction Models Reliability Prediction Models
46 pages
Software Reliability: by Allesh Panda Iiit BBSR
No ratings yet
Software Reliability: by Allesh Panda Iiit BBSR
37 pages
IEEEStd 30067 - 2013presentation
100% (3)
IEEEStd 30067 - 2013presentation
42 pages
Availability Concepts
No ratings yet
Availability Concepts
32 pages
IoT-Based Smart Air Conditioning Control For Thermal Comfort
No ratings yet
IoT-Based Smart Air Conditioning Control For Thermal Comfort
6 pages
Instruction Manual EOI PJB
No ratings yet
Instruction Manual EOI PJB
6 pages
HCI Past Question-1
No ratings yet
HCI Past Question-1
14 pages
Chapter 3
No ratings yet
Chapter 3
40 pages
Fault Tolerance: Click To Add Text Dealing Successfully With Partial System. Key Technique: Redundancy
No ratings yet
Fault Tolerance: Click To Add Text Dealing Successfully With Partial System. Key Technique: Redundancy
48 pages
Filter Design Assignment 2016-17 EE 338: Digital Signal Processing
No ratings yet
Filter Design Assignment 2016-17 EE 338: Digital Signal Processing
17 pages
Configuration Emergency Access Management 12
100% (1)
Configuration Emergency Access Management 12
42 pages
Rtos Group 10
No ratings yet
Rtos Group 10
9 pages
Canon Ir2016 Ir2020 Brochure
No ratings yet
Canon Ir2016 Ir2020 Brochure
4 pages
Information Technology Infrastructure IT602
No ratings yet
Information Technology Infrastructure IT602
10 pages
AzureWave AW-NB126H Manual
No ratings yet
AzureWave AW-NB126H Manual
14 pages
Arduino Based Fire Fighting Robot Project
No ratings yet
Arduino Based Fire Fighting Robot Project
13 pages
High Availability Process
No ratings yet
High Availability Process
8 pages
Risk Analysis For Information and Systems Engineering: INSE 6320 - Week 6
100% (1)
Risk Analysis For Information and Systems Engineering: INSE 6320 - Week 6
9 pages
Owner'S Manual: Smartpro 2U Rack-Mount
No ratings yet
Owner'S Manual: Smartpro 2U Rack-Mount
64 pages
Payroll Audit Program Final
100% (2)
Payroll Audit Program Final
39 pages
Fault Avoidance and Tolerance Technique
No ratings yet
Fault Avoidance and Tolerance Technique
15 pages
Design For Six Sigma - Contd..: Session13
100% (1)
Design For Six Sigma - Contd..: Session13
43 pages
Definition of Reliability
No ratings yet
Definition of Reliability
8 pages
Kafka Developer Certified: The Essential Guide
From Everand
Kafka Developer Certified: The Essential Guide
SUJAN
No ratings yet
Kubernetes Made Easy
From Everand
Kubernetes Made Easy
Pankaj Joshi
No ratings yet
Operating Systems Interview Questions You'll Most Likely Be Asked
From Everand
Operating Systems Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet

SDA Session 8

Uploaded by

SDA Session 8

Uploaded by

DSECLZG517: Systems for Data Analytics

Session 8: Reliability and availability

Dr. Anindya Neogi

Reliability - serial assembly

MTTF mc=1/(1/ma + 1/mb)

• MTTF of system = 1 / SUM (1/MTTFi) for all components i

• In a parallel assembly, e.g. a cluster of nodes C

• Availability = Time system is UP and accessible / Total time observed

* Unless specified one can assume MTTD = 0 8

• The application needs to be restarted, which takes 2 hours.

• What is the availability of the cluster ?

• If downtime is $80k per hour, the what is the yearly cost ?

• MTTF = 100 hours

• Availability = 100/104 = 96.15%

A(system) = Product (Ai) for all i A(system) = 1 - Unavailability(system)

A(S) = A(Comp1 U Comp2)

A(S) = A1 + A2 + A3 - A1*A2 - A1*A3 - A2*A3 + A1*A2*A3

• Systems are a complex combination of serial and parallel

Bus / Mem failures ? Ethernet failures ?

Node failures ? Protect against node failures with periodic

• Availability can be increased in 2 ways

You might also like

A(S) = A1 + A2 + A3 - A1A2 - A1A3 - A2A3 + A1A2*A3