Presentation - 02 Reliability in Computer Systems
Presentation - 02 Reliability in Computer Systems
Reliability in computer
systems
teachcomputerscience.com
2
Lesson Objectives
teachcomputerscience.com
1.
Content
teachcomputerscience.com
4
What is reliability?
▪ Reliability of any computer-related component is an attribute
that denotes its consistent performance according to the
specifications.
teachcomputerscience.com
5
teachcomputerscience.com
6
teachcomputerscience.com
7
Critical system
teachcomputerscience.com
8
What is backup?
▪ Duplicate data and files stored in a separate server or storage
drive to improve the reliability of a system are called backups.
▪ This protects the data from being lost due to failure.
▪ Backup is also useful when data is accidentally overwritten.
teachcomputerscience.com
10
Backup procedure
Disaster recovery
Disaster recovery is the process of getting back lost data from the backup after a
system failure.
Let us consider the example of hardware failure. To recover from this failure, the
hardware is repaired or replaced with new hardware. The data is recovered from
the backup and copied to the hardware.
Examples of precautionary measures taken by an organisation to avoid disaster
are use of uninterruptible power supply (UPS), surge protectors (to minimise the
power surges in electronic equipment), fire prevention and anti-virus software.
teachcomputerscience.com
12
Redundancy
teachcomputerscience.com
13
Types of redundancy
Hardware redundancy Software Data redundancy
Computer systems have an extra redundancy Redundant data in
critical hardware device to avoid Redundant software the backup can
failure. is used to replace the replace the original
Example: A system is provided original program in data in case the
with two power supplies in a case it fails. original data is lost or
parallel set up so that they can be overwritten
easily switched if one of them accidentally.
fails.
Redundant array of independent
disks (RAID): multiple physical
disk drives are used to store
redundant data. teachcomputerscience.com
14
What is fault-tolerance?
▪ Fault tolerance is a property that enables a system to operate
properly even if the system undergoes one or more failures.
▪ Essential for life-critical systems.
▪ This design enables a system to continue its operation, might
be at a reduced level, rather than failing completely, even when
some parts of the system fails.
▪ Data is protected from damage, intrusion or disclosure.
teachcomputerscience.com
15
teachcomputerscience.com
16
Defensive programming
teachcomputerscience.com
17
Measuring reliability
Time between failures
teachcomputerscience.com
18
Reliability factors
Percentage of time:
The percentage of time denotes the percentage of time for which the service was
available and operational during a particular month.
Number of hours:
Number of hours denotes the amount of time the system has operated without
reporting any problems.
teachcomputerscience.com
19
Reliability factors
Downtime:
The period during which a system breaks down or spends out of action. Zero
downtime refers to a system that is available all the time.
Mean time between failures (MTBF):
Meantime between failures is calculated by taking the average of the time
between failures of a system.
Meantime to failure (MTTF):
Mean time to failure is the time duration in which the system is expected to
continue its operation before system failure.
teachcomputerscience.com
20
teachcomputerscience.com
22
Activity-1
Duration: 15 minutes
teachcomputerscience.com
3.
End of topic questions
teachcomputerscience.com
24
teachcomputerscience.com