0% found this document useful (0 votes)
74 views24 pages

Presentation - 02 Reliability in Computer Systems

This document discusses reliability in computer systems. It defines reliability and outlines things that can cause systems to fail, such as hardware issues, software bugs, human error, and natural disasters. It also describes critical systems and different types, including safety-critical, mission-critical, business-critical, and security-critical systems. Methods to improve reliability like backups, redundancy, fault tolerance, and defensive programming are explained.

Uploaded by

victorwu.uk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
74 views24 pages

Presentation - 02 Reliability in Computer Systems

This document discusses reliability in computer systems. It defines reliability and outlines things that can cause systems to fail, such as hardware issues, software bugs, human error, and natural disasters. It also describes critical systems and different types, including safety-critical, mission-critical, business-critical, and security-critical systems. Methods to improve reliability like backups, redundancy, fault tolerance, and defensive programming are explained.

Uploaded by

victorwu.uk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 24

Teach Computer Science

Reliability in computer
systems

teachcomputerscience.com
2

Lesson Objectives

▪ Students will learn about things that can go wrong in a


computer system and how to avoid such situations.
▪ What are critical systems?
▪ How to protect systems from failure
▪ What to do if a computer system fails
▪ How to analyse reliability of a system

teachcomputerscience.com
1.
Content

teachcomputerscience.com
4

What is reliability?
▪ Reliability of any computer-related component is an attribute
that denotes its consistent performance according to the
specifications.

teachcomputerscience.com
5

Things that can go wrong

 Hardware might fail to operate.


 Software might contain bugs or Natural Hardware
errors. disasters failure
 Human errors can also make the
system inefficient. Security is very System
important to systems as there might Failure
even be a deliberate attack. Software
Human
 Natural disasters like power cuts, error error
flooding or an earthquake affect the
operation of systems too.

teachcomputerscience.com
6

What is a critical system?


▪ Critical systems are computer systems that must be highly
reliable, as their failure may have a great impact on human lives.
▪ Developed using a conservative technique rather than a new
technique.
▪ A new technique is only implemented after analysing its long-
term effects, even though it might seem to be more efficient.

teachcomputerscience.com
7

Types of critical systems

Critical system

Safety-critical Mission-critical Business-critical Security-critical


systems systems systems systems

teachcomputerscience.com
8

Types of critical systems

Safety-critical Mission-critical Business-critical Security-critical


systems systems systems systems
Designed to avoid Failure of these Designed to avoid Designed to protect
danger to human systems affects the loss of business, sensitive
lives and the overall economic loss and information that
environment. performance, as loss of reputation. can be misused
Example: they are responsible Example: Banking when in the wrong
Temperature for the goals in the systems hands.
control of nuclear system. Example: Defence
reactors. Example: navigation systems
systems of aircraft.
teachcomputerscience.com
9

What is backup?
▪ Duplicate data and files stored in a separate server or storage
drive to improve the reliability of a system are called backups.
▪ This protects the data from being lost due to failure.
▪ Backup is also useful when data is accidentally overwritten.

teachcomputerscience.com
10

Backup procedure

 The team responsible for the backup


procedure performs the backup
according to a well-defined schedule.
 Backup disks are to be stored in a
secure location. Back-up
Safe Scheduled
 Disks and tapes secured in a
fireproof location are called an off-
site backup.
 The data can also be backed-up over
Fire-safe
the Internet using cloud technology. Cloud-
technology
teachcomputerscience.com
11

Disaster recovery

 Disaster recovery is the process of getting back lost data from the backup after a
system failure.
 Let us consider the example of hardware failure. To recover from this failure, the
hardware is repaired or replaced with new hardware. The data is recovered from
the backup and copied to the hardware.
 Examples of precautionary measures taken by an organisation to avoid disaster
are use of uninterruptible power supply (UPS), surge protectors (to minimise the
power surges in electronic equipment), fire prevention and anti-virus software.

teachcomputerscience.com
12

Redundancy

 Redundancy is the duplication of Hardware


critical parts of a computer system to redundancy
improve reliability.
 If the primary system fails, the
backup or reserve system steps in.
 Redundancy is very important in Redundancy
critical systems like aircraft systems. Data
If any hardware or software fails Software
redundancy
during a flight, the redundant system redundancy
steps in to avoid failure.

teachcomputerscience.com
13

Types of redundancy
Hardware redundancy Software Data redundancy
Computer systems have an extra redundancy Redundant data in
critical hardware device to avoid Redundant software the backup can
failure. is used to replace the replace the original
Example: A system is provided original program in data in case the
with two power supplies in a case it fails. original data is lost or
parallel set up so that they can be overwritten
easily switched if one of them accidentally.
fails.
Redundant array of independent
disks (RAID): multiple physical
disk drives are used to store
redundant data. teachcomputerscience.com
14

What is fault-tolerance?
▪ Fault tolerance is a property that enables a system to operate
properly even if the system undergoes one or more failures.
▪ Essential for life-critical systems.
▪ This design enables a system to continue its operation, might
be at a reduced level, rather than failing completely, even when
some parts of the system fails.
▪ Data is protected from damage, intrusion or disclosure.

teachcomputerscience.com
15

What is fail-soft system?


▪ When a system gracefully fails, that is, operates at a reduced
level after some component failures, is called a fail-soft system.
▪ For example: a building may operate with reduced lighting and
elevators in case the power fails.

teachcomputerscience.com
16

Defensive programming

 Software can be made more reliable by adding extra checks.


 These checkpoints will warn the user in case the program is not working in the
desired manner. This is called defensive programming.
 This enables the user to take action.
 In the absence of these extra checks, the program would crash without any
warning.

teachcomputerscience.com
17

Measuring reliability
Time between failures

Time to repair Time to failure


Reliability of a system is measured using
various statistical parameters that are
used to predict how reliable the system is.

System Resumes normal System


failure operation failure

teachcomputerscience.com
18

Reliability factors

 Percentage of time:
The percentage of time denotes the percentage of time for which the service was
available and operational during a particular month.
 Number of hours:
Number of hours denotes the amount of time the system has operated without
reporting any problems.

teachcomputerscience.com
19

Reliability factors

 Downtime:
The period during which a system breaks down or spends out of action. Zero
downtime refers to a system that is available all the time.
 Mean time between failures (MTBF):
Meantime between failures is calculated by taking the average of the time
between failures of a system.
 Meantime to failure (MTTF):
Mean time to failure is the time duration in which the system is expected to
continue its operation before system failure.

teachcomputerscience.com
20

Let’s review some concepts

Reliability Critical systems Backup


Reliability of any computer- Critical systems are computer Duplicate data and files stored in
related component is an systems that must be highly a separate server or storage
attribute that denotes its reliable as their failure may have drive to improve the reliability of
consistent performance a great impact on human lives. a system are called backup.
according to the specifications.

Redundancy Fault-tolerance Statistical parameters to


measure reliability
Redundancy is the duplication of Fault tolerance is a property that
critical parts of a computer enables a system to operate Percentage of time
system to improve reliability. properly even if the system
Number of hours
undergoes one or more failures.
(Hardware, software and data)
Downtime
Mean time between failures and
Mean time to failure
teachcomputerscience.com
2.
Activity

teachcomputerscience.com
22

Activity-1
Duration: 15 minutes

You are a programmer developing a banking system.


A. What are the important parts of this system? In what ways
could these parts fail?
B. How can you protect the system from possible failures?

teachcomputerscience.com
3.
End of topic questions

teachcomputerscience.com
24

End of topic questions


1. Where is backup stored?
2. What are the different types of redundancy? How are they
useful in improving the reliability of systems?
3. What is a fault-tolerant system?
4. What is a fail-soft system?
5. How can the reliability of a system be measured? Write down
the different parameters with a line of explanation.

teachcomputerscience.com

You might also like