Lec17 (SW)
Lec17 (SW)
Software Engineering
Lecture 17
Reliability Engineering (Part-I)
Lecturer
Dr. Faris S. Alghareb
PhD in Computer Engineering @ UCF
email: [email protected]
Copyright © 2020 Faris S. Alghareb. All rights reserved.
Ch11 – Reliability Engineering
Topics covered in this chapter:
v Availability and reliability
v Reliability requirements
v Fault-tolerant architectures
v Reliability measurement
Objectives:
v We expect that software will operate without crashes and failures and will preserve our data and
personal information. We need to be able to trust the software that we use
à this means that the software must be reliable
v The use of software engineering techniques, better programming languages, and effective quality
management has led to significant improvements in software reliability over the past 20 years.
v Nevertheless, system failures still occur that affect the system’s availability or lead to incorrect
results being produced.
v In situations where software has a particularly critical role—perhaps in an aircraft or as part of the
national critical infrastructure—special reliability engineering techniques may be used to achieve
the high levels of reliability and availability that are required.
error
fault–error–failure model
Brian Randell, a pioneer researcher in software reliability, defined a Failure
fault–error–failure model based on the notion that human errors >
cause faults; faults lead to errors, and errors lead to system failures. FIT (109)
2) The erroneous system state resulting from the fault may be transient and ‘corrected’ before an error arises.
3) The system may include fault detection and protection mechanisms. These ensure that the erroneous
behavior is discovered and corrected before the system services are affected.
4) Users adapt their behavior to avoid using inputs that they know it can cause program failures.
Fault tolerance
There are three complementary Fault The system is designed so that faults or
unexpected system behavior during
approaches that are used to tolerance execution are detected at runtime and are
improve the reliability of a system: managed in such a way that system failure
does not occur (built-in runtime checking).
v At some stage, even for critical systems, the costs of this additional effort become unjustifiable. As a result,
software companies accept that their software will always contain some residual faults.
v The level of faults depends on the type of system. Software products have a relatively high level of faults,
whereas critical systems usually have a much lower fault density.
v The rationale for accepting faults is that, if and when the system fails, it is cheaper to pay for the
consequences of failure than it would be to discover and remove the faults before system delivery.
Reliability
The probability of failure-free operation over a
specified time, in a given environment, for a
specific purpose.
Availability
The probability that a system, at a point in time,
will be operational and able to deliver the
requested services.
Ø Both of these attributes can be expressed quantitatively e.g., the availability is 0.999, this means
that, over some time period, the system is available for 99.9% of that time.
Ø If, on average, 2 inputs in every 1000 result in failures, then the reliability, expressed as a rate of
occurrence of failure, is 0.002.
v The perceptions of the system’s reliability that are made in the environment where a software system
will be used may be different.
Ø Usage of a system in an office environment is likely to be quite different from usage of the same system
in a university environment.
v A technical definition of failure is behavior that does not conform to the system’s specification.
v Reliability can only be defined formally with respect to a system specification i.e., a failure is a
deviation from a specification.
Ø Software specifications are often incomplete or incorrect, and it is left to software engineers to interpret
how the system should behave.
Ø Failure is therefore not something that can be objectively defined. Rather, it is a judgment made by users
of a system. This is one reason why users do not all have the same impression of a system’s reliability.
v Faults that affect the reliability of the system for one user may never show up under someone else’s
mode of working.
Inputs causing
erroneous outputs
Input set Ie
Ø For example, if inputs in the set Ie are
executed by frequently used parts of the
system, then failures will be frequent.
Program
Ø However, if the inputs in Ie are executed by
code that is rarely used, then users will
Erroneous
hardly ever see failures. outputs
Output set Oe
v Furthermore, the disruption caused by unavailable systems is not reflected in the simple availability
metric that specifies the percentage of time that the system is available. The time when the system
fails is also important.
v Reliability and availability are closely related, but sometimes one is more important than the other.
Ø If users expect continuous service from a system, then the system has a high-availability
requirement.
Ø if a system can recover quickly from failures without loss of user data, then these failures may
not significantly affect system users.
Availability Explanation
The system is available for 90% of the time. This means that, in a 24-hour
0.9
period (1440 minutes), the system will be unavailable for 144 minutes.
v The overall reliability of a system depends on the hardware reliability, the software reliability, and
the reliability of the system operators.
Ø Hardware reliability focuses on the probability a hardware component fails.
Ø Software reliability focuses on the probability a software component will produce an incorrect
output. The software does not wear out and it can continue to operate after a bad result.
Ø Operator reliability focuses on the probability when a system user makes an error.
v The system software has to take all requirements into account. As well as including requirements
that compensate for software failure, there may also be related reliability requirements to help detect
and recover from hardware failures and operator errors.
1) Probability of Failure on Demand (POFOD) :The likelihood that a service request will result in a
system failure (failures/requests over a period).
Ø POFOD = 0.001 means that 1 out of 1000 service requests result in a failure.
Ø POFOD should be used in situations where a failure on demand can lead to a serious system
failure.
Ø It is relevant for many safety-critical systems such as an emergency shut down system in a
chemical plant.
3) Availability (AVAIL) AVAIL is the probability that a system will be operational when a demand is
made for service.
Ø Availability of 0.9999 means the system is available 99.99% of the time.
Ø Appropriate for systems offering a continuous service, where customers expect it to be there all
the time e.g., VisaNet.
v Non-functional reliability requirements are specifications of the required reliability and availability
of a system using one of the reliability metrics (POFOD, ROCOF or AVAIL).
v You may have different requirements for different parts of the system if some parts are more critical
than others.
v You should follow these three guidelines when specifying reliability requirements:
1) Specify the availability and reliability requirements for different types of failure.
2) Specify the availability and reliability requirements for different types of system service. Critical system
services should have the highest reliability but you may be willing to tolerate more failures in less critical
services.
3) Think about whether a high level of reliability is really required. Other mechanisms can be used to
provide reliable system service.
Software Engineering
Lecture 18
Reliability Engineering (Part-II) – Fault Tolerance
Lecturer
Dr. Faris S. Alghareb
PhD in Computer Engineering @ UCF
email: [email protected]
Copyright © 2020 Faris S. Alghareb. All rights reserved.
Why is Fault Tolerance required?
v Fault tolerance is a runtime approach used to achieve dependability in which systems
include mechanisms to continue in operation, even after a software or hardware fault has
occurred and the system state is erroneous.
v Fault tolerance mechanisms detect and correct this erroneous state so that the
occurrence of a fault does not lead to a system failure.
v Fault tolerance is required in systems that are safety or security critical and where the
system cannot move to a safe state when an error is detected.
v To provide fault tolerance, the system architecture has to be designed to include redundant
and diverse hardware and software.
v Computations are carried out on separate channels, and the outputs of these computations are
compared.
Ø If the outputs are identical and are available at the same time, then the system is judged to be operating
correctly.
Ø If the outputs are different, then a failure is assumed. When this occurs, the system raises a failure
exception on the status output line. This signals that control should be transferred to some other system.
Status
Channel 1
Input value
Splitter Comparator
Output value
Channel 2
Self-monitoring architecture
EECIE20-S4305 Intelligence Systems & Software Engineering: Reliability Engineering Slide 21
Self-monitoring system
The hardware used in each channel is diverse. In practice, this might mean
that each channel uses a different processor type to carry out the required
01 computations, or the chipset making up the system may be sourced from
different manufacturers.
The software used in each channel is diverse. Otherwise, the same software
02 error could arise at the same time on each channel.
v For many medical treatment and diagnostic systems, reliability is more important than
availability because an incorrect system response could lead to the patient receiving incorrect
treatment.
Ø if the system shuts down in the event of an error, this is an inconvenience but the patient will
not usually be harmed.
The designers of the Airbus system have tried to achieve diversity in a number of
different ways:
2 4
1 3 5
Under these configurations, in more than 15 years of operation, there have been no reports indicate
a system failure in the flight control due to lost of control for the aircraft.
v The server hardware is usually identical, and the servers run the same version of the
software. Therefore,
Ø they can cope with hardware failures and software failures that are localized to a single
machine.
Ø They cannot cope with software design problems that cause all versions of the software to
fail at the same time.
Ø To handle software design failures, a system has to use diverse software and hardware.
Module 1
Module 3
A general diagram of Triple Modular Redundancy (TMR)
EECIE20-S4305 Intelligence Systems & Software Engineering: Reliability Engineering Slide 27
Self-monitoring system – N-version Programming
v The implementation of N-version programming implies:
Ø Multiple versions of a software system carry out computations at the same time.
Ø The results are compared using a voting system and the majority result is taken to be the correct result.
Ø A fault manager may try to repair the faulty unit automatically, but if this is impossible, the system is
automatically reconfigured to take the unit out of service. The system then continues to function with
(𝑁 − 1) working units.
Version 1
N software versions
v Using hardware units that have a common specification but that are designed and built by different
manufacturers reduces the chances of such a Common Mode Failure (CMF).
Ø It is assumed that the probability of different teams making the same design or manufacturing error
is small.
v Using a common specification, the same software system is implemented by a number of teams.
Ø these versions are executed on separate computers.
Ø inconsistent outputs or outputs that are not produced in time are rejected.
v At least three versions of the system should be available so that two versions should be consistent
in the event of a single failure.
Ø they should not include common errors and so will not fail in the same way, at the same time.
Ø the software should therefore be written by different teams who should not communicate during the
development process.
v However, it still requires several different teams to develop different versions of the software.
à leads to very high software development costs (paying for many programmers/developers
working in different teams).
v Therefore, the N-version programming approach is only used in systems where it is impractical
to provide a protection system that can guard against safety-critical failures.
To assess the 2
reliability of a The time or the number of
transactions between system failures
system, you have plus the total elapsed time or total
to collect data number of transactions. This is used
about its to measure ROCOF and MTTF.
operation. The
3
data required The repair or restart time after a
may include: system failure that leads to loss of
service. This is used in the
measurement of availability.
Availability does not just depend on
the time between failures but also on
the time required to get the system
back into operation.
v Thus, the ROCOF would be number of failed transactions per N thousand transactions.
v Reliability metrics such as POFOD, the probability of failure on demand, and ROCOF, the rate of
occurrence of failure, may be used to quantitatively specify the required software reliability.
Ø This cannot normally be included as part of a normal defect testing process because data for
defect testing is (usually) a typical of actual usage data.
v The process of measuring the reliability of a system is sometimes called statistical testing.
v The statistical testing process is explicitly geared to reliability measurement rather than fault
finding.
v It consists of a specification of classes of input and the probability of their occurrence. When a new
software system replaces an existing automated system, it is reasonably easy to assess the
probable pattern of usage of the new software.
Ø It should correspond to the existing usage, with some allowance made for the new functionality that is
(presumably) included in the new software.
v Typically, the operational profile is such that the inputs that have the highest probability of being
generated fall into a small number of classes.
Number of inputs
Input classes
EECIE20-S4305 Intelligence Systems & Software Engineering: Reliability Engineering Slide 35
Operational Profiles
v When a software system is new and innovative, however, it is difficult to anticipate how it will
be used.
v Consequently, it is practically impossible to create an accurate operational profile
Developing an accurate operational profile may be difficult or impossible for the following
reasons: