0% found this document useful (0 votes)
6 views40 pages

Lec17 (SW)

The document discusses reliability engineering in software systems, focusing on the distinction between software reliability and availability, fault-tolerant architectures, and reliability measurement metrics. It outlines the importance of reliability in software, the fault-error-failure model, and approaches to improve system reliability, such as fault detection and correction. The document also emphasizes the need for specifying reliability requirements and provides various metrics to measure reliability and availability in software systems.

Uploaded by

Nooraldeen Ali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views40 pages

Lec17 (SW)

The document discusses reliability engineering in software systems, focusing on the distinction between software reliability and availability, fault-tolerant architectures, and reliability measurement metrics. It outlines the importance of reliability in software, the fault-error-failure model, and approaches to improve system reliability, such as fault detection and correction. The document also emphasizes the need for specifying reliability requirements and provides various metrics to measure reliability and availability in software systems.

Uploaded by

Nooraldeen Ali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Ninevah University

College of Electronics Engineering


Computer and Informatics Engineering Department

Software Engineering
Lecture 17
Reliability Engineering (Part-I)

Lecturer
Dr. Faris S. Alghareb
PhD in Computer Engineering @ UCF
email: [email protected]
Copyright © 2020 Faris S. Alghareb. All rights reserved.
Ch11 – Reliability Engineering
Topics covered in this chapter:
v Availability and reliability
v Reliability requirements
v Fault-tolerant architectures
v Reliability measurement

Objectives:

Ø understand the distinction between software reliability and software availability;


Ø have been introduced to metrics for reliability specification and how these are used to specify
measurable reliability requirements;
Ø understand how different architectural styles may be used to implement reliable, fault-tolerant
systems architectures; and
Ø understand how the reliability of a software system may be measured using statistical testing.

EECIE20-S4305 Intelligence Systems & Software Engineering: Reliability Engineering Slide 1


Software Reliability
v Our dependence on software systems for almost all aspects of our business and personal lives
means that we expect that software to be available when we need it.

v We expect that software will operate without crashes and failures and will preserve our data and
personal information. We need to be able to trust the software that we use
à this means that the software must be reliable

v The use of software engineering techniques, better programming languages, and effective quality
management has led to significant improvements in software reliability over the past 20 years.

v Nevertheless, system failures still occur that affect the system’s availability or lead to incorrect
results being produced.

v In situations where software has a particularly critical role—perhaps in an aircraft or as part of the
national critical infrastructure—special reliability engineering techniques may be used to achieve
the high levels of reliability and availability that are required.

EECIE20-S4305 Intelligence Systems & Software Engineering: Reliability Engineering Slide 2


fault–error–failure model
v Failure in Time (FIT): is usually denoted by the symbol (λ). The dimension of the failure rate is the
reciprocal of time and the unit used is 109/hr = 1 FIT (Failure in Time). In other words, 1 FIT means 1
fail in 1,000,000,000 hours (109 hrs).

Human Error System fault System error System failure


An erroneous system An event that occurs at
Human behavior A characteristic of a
state during execution some point in time
that results in the software system that can lead to
introduction of faults that can lead to a system behavior that
when the system does fault
system error. not deliver a service as
into a system is unexpected by
expected by its users.
system users.

error

fault–error–failure model
Brian Randell, a pioneer researcher in software reliability, defined a Failure
fault–error–failure model based on the notion that human errors >
cause faults; faults lead to errors, and errors lead to system failures. FIT (109)

EECIE20-S4305 Intelligence Systems & Software Engineering: Reliability Engineering Slide 3


Faults and Failures
v System faults do not necessarily result in system errors, and system errors do not necessarily result
in system failures:

1) The faulty code may never be executed.

2) The erroneous system state resulting from the fault may be transient and ‘corrected’ before an error arises.

3) The system may include fault detection and protection mechanisms. These ensure that the erroneous
behavior is discovered and corrected before the system services are affected.

4) Users adapt their behavior to avoid using inputs that they know it can cause program failures.

EECIE20-S4305 Intelligence Systems & Software Engineering: Reliability Engineering Slide 4


Fault Management

Fault tolerance
There are three complementary Fault The system is designed so that faults or
unexpected system behavior during
approaches that are used to tolerance execution are detected at runtime and are
improve the reliability of a system: managed in such a way that system failure
does not occur (built-in runtime checking).

Fault detection and Fault detection and correction


correction Verification and validation processes are
designed to discover and remove faults in a
program, before it is deployed for
operational use.
allow extensive compiler Fault avoidance
checking and minimizing Fault avoidance
the use of error-prone The software design and implementation
programming language process should use approaches to software
constructs, such as development that help avoid design and
pointers. programming errors and so minimize the
number of faults introduced into the system.

EECIE20-S4305 Intelligence Systems & Software Engineering: Reliability Engineering Slide 5


The increasing costs of residual fault removal
v As the software becomes more reliable, you need to spend more and more time and effort to find fewer and
fewer faults.

v At some stage, even for critical systems, the costs of this additional effort become unjustifiable. As a result,
software companies accept that their software will always contain some residual faults.

v The level of faults depends on the type of system. Software products have a relatively high level of faults,
whereas critical systems usually have a much lower fault density.

v The rationale for accepting faults is that, if and when the system fails, it is cheaper to pay for the
consequences of failure than it would be to discover and remove the faults before system delivery.

Ø As can be seen, the cost of finding


Cost per error
and removing the remaining faults
detected
in a software system rises
exponentially as program faults
are discovered and removed

Number of residual errors


EECIE20-S4305 Intelligence Systems & Software Engineering: Reliability Engineering Slide 6
Availability and Reliability

Reliability
The probability of failure-free operation over a
specified time, in a given environment, for a
specific purpose.

Availability
The probability that a system, at a point in time,
will be operational and able to deliver the
requested services.

Ø Both of these attributes can be expressed quantitatively e.g., the availability is 0.999, this means
that, over some time period, the system is available for 99.9% of that time.

Ø If, on average, 2 inputs in every 1000 result in failures, then the reliability, expressed as a rate of
occurrence of failure, is 0.002.

EECIE20-S4305 Intelligence Systems & Software Engineering: Reliability Engineering Slide 7


Reliability and specifications
v System reliability is not an absolute value—it depends on where and how that system is used.

v The perceptions of the system’s reliability that are made in the environment where a software system
will be used may be different.
Ø Usage of a system in an office environment is likely to be quite different from usage of the same system
in a university environment.
v A technical definition of failure is behavior that does not conform to the system’s specification.

v Reliability can only be defined formally with respect to a system specification i.e., a failure is a
deviation from a specification.
Ø Software specifications are often incomplete or incorrect, and it is left to software engineers to interpret
how the system should behave.

Ø No one except system developers read software specification documents.

Ø Failure is therefore not something that can be objectively defined. Rather, it is a judgment made by users
of a system. This is one reason why users do not all have the same impression of a system’s reliability.

EECIE20-S4305 Intelligence Systems & Software Engineering: Reliability Engineering Slide 8


Why reliability depends on environments?
v The program’s reliability depends on the number of system inputs that are members of the set of
inputs that lead to an erroneous output—in other words, the set of inputs that cause faulty code to be
executed and system errors to occur.

v Faults that affect the reliability of the system for one user may never show up under someone else’s
mode of working.
Inputs causing
erroneous outputs
Input set Ie
Ø For example, if inputs in the set Ie are
executed by frequently used parts of the
system, then failures will be frequent.

Program
Ø However, if the inputs in Ie are executed by
code that is rarely used, then users will
Erroneous
hardly ever see failures. outputs
Output set Oe

EECIE20-S4305 Intelligence Systems & Software Engineering: Reliability Engineering Slide 9


Availability Perception
v The availability of a system does not just depend on the number of system failures, but also on the
time needed to repair the faults that have caused the failure.

v Furthermore, the disruption caused by unavailable systems is not reflected in the simple availability
metric that specifies the percentage of time that the system is available. The time when the system
fails is also important.

v Reliability and availability are closely related, but sometimes one is more important than the other.
Ø If users expect continuous service from a system, then the system has a high-availability
requirement.
Ø if a system can recover quickly from failures without loss of user data, then these failures may
not significantly affect system users.

EECIE20-S4305 Intelligence Systems & Software Engineering: Reliability Engineering Slide 10


Availability Specifications

Availability Explanation
The system is available for 90% of the time. This means that, in a 24-hour
0.9
period (1440 minutes), the system will be unavailable for 144 minutes.

0.99 In a 24-hour period, the system is unavailable for 14.4 minutes.

The system is unavailable for 1.44 minutes in a 24-hour period.


Elaboration:
0.999 1 − 0.999 = 0.001
One day has (24 hours *60) minutes = 1,440 minutes, Then
The system will be unavailable for (1,440 * 0.001) = 1.44 minutes
The system is unavailable for 8.64 seconds in a 24-hour period—roughly, one
minute per week.
0.9999
Elaboration:
8.64 𝑠𝑒𝑐𝑜𝑛𝑑𝑠 ∗ 7 𝑑𝑎𝑦𝑠 = 60.48 ≈ 1 𝑚𝑖𝑛𝑢𝑡𝑒/𝑤𝑒𝑒𝑘

EECIE20-S4305 Intelligence Systems & Software Engineering: Reliability Engineering Slide 11


System Reliability Requirements
v Reliability requirements can be specified by:
a) Functional requirements to define error checking and recovery facilities and protection against system
failures.
b) Non-functional requirements defining the required reliability and availability of the system.

v The overall reliability of a system depends on the hardware reliability, the software reliability, and
the reliability of the system operators.
Ø Hardware reliability focuses on the probability a hardware component fails.
Ø Software reliability focuses on the probability a software component will produce an incorrect
output. The software does not wear out and it can continue to operate after a bad result.
Ø Operator reliability focuses on the probability when a system user makes an error.

v The system software has to take all requirements into account. As well as including requirements
that compensate for software failure, there may also be related reliability requirements to help detect
and recover from hardware failures and operator errors.

EECIE20-S4305 Intelligence Systems & Software Engineering: Reliability Engineering Slide 12


Reliability Metrics
v Reliability can be defined and measured.
v Reliability metrics are units of measurement of system reliability.
v Three metrics may be used to specify reliability and availability:

1) Probability of Failure on Demand (POFOD) :The likelihood that a service request will result in a
system failure (failures/requests over a period).
Ø POFOD = 0.001 means that 1 out of 1000 service requests result in a failure.
Ø POFOD should be used in situations where a failure on demand can lead to a serious system
failure.
Ø It is relevant for many safety-critical systems such as an emergency shut down system in a
chemical plant.

EECIE20-S4305 Intelligence Systems & Software Engineering: Reliability Engineering Slide 13


Reliability Metrics (Cont…)
2) Rate of occurrence of failures (ROCOF): This metric sets out the probable number of system
failures that are likely to be observed relative to a certain time period (e.g., an hour), the
number of system executions, processing time or the number of transactions, etc.
Ø ROCOF of 0.02 means that 2 failures are likely per 100 time units.
Ø ROCOF should be used when demands on systems are made regularly rather than
intermittently (banking systems, airline booking systems).
Ø The reciprocal of ROCOF is the mean time to failure (MTTF), which is sometimes used as a
reliability metric. MTTF is the average number of time units between observed system failures.
A ROCOF of two failures per hour implies that the mean time to failure is 30 minutes.

3) Availability (AVAIL) AVAIL is the probability that a system will be operational when a demand is
made for service.
Ø Availability of 0.9999 means the system is available 99.99% of the time.
Ø Appropriate for systems offering a continuous service, where customers expect it to be there all
the time e.g., VisaNet.

EECIE20-S4305 Intelligence Systems & Software Engineering: Reliability Engineering Slide 14


Non-functional reliability requirements
v Reliability can be measured so non-functional reliability requirements may be specified
quantitatively.

v Non-functional reliability requirements are specifications of the required reliability and availability
of a system using one of the reliability metrics (POFOD, ROCOF or AVAIL).

v Quantitative reliability specification is useful in a number of ways:


1) The process of deciding the required level of the reliability helps to clarify what stakeholders really
need.
2) It provides a basis for assessing when to stop testing a system.
3) It is a means of assessing different design strategies intended to improve the reliability of a system.
4) If a regulator has to approve a system before it goes into service (e.g., all systems that are critical
to flight safety on an aircraft are regulated), then evidence that a required reliability target has been
met is important for system certification.

EECIE20-S4305 Intelligence Systems & Software Engineering: Reliability Engineering Slide 15


Specifying Reliability Requirements
v To avoid incurring excessive and unnecessary costs, it is important that you specify the reliability that
you really need rather than simply choose a very high level of reliability for the whole system.

v You may have different requirements for different parts of the system if some parts are more critical
than others.

v You should follow these three guidelines when specifying reliability requirements:
1) Specify the availability and reliability requirements for different types of failure.
2) Specify the availability and reliability requirements for different types of system service. Critical system
services should have the highest reliability but you may be willing to tolerate more failures in less critical
services.
3) Think about whether a high level of reliability is really required. Other mechanisms can be used to
provide reliable system service.

EECIE20-S4305 Intelligence Systems & Software Engineering: Reliability Engineering Slide 16


Functional Reliability Specification
v To achieve a high level of reliability and availability in a software-intensive system, you use a
combination of fault-avoidance, fault-detection, and fault-tolerance techniques.
v This means that functional reliability requirements have to be generated which specify how the
system should provide fault avoidance, detection, and tolerance.
v These functional reliability requirements should specify the faults to be detected and the actions to
be taken to ensure that these faults do not lead to system failures.
v There are four types of functional reliability requirements:
1) Checking requirements These requirements identify checks on inputs to the system to ensure that
incorrect or out-of-range inputs are detected before they are processed by the system.
2) Recovery requirements These requirements are geared to helping the system recover after a failure has
occurred.
3) Redundancy requirements These specify redundant features of the system that ensure that a single
component failure does not lead to a complete loss of service.
4) Process requirements These are fault-avoidance requirements, which ensure that good practice is used in
the development process.

EECIE20-S4305 Intelligence Systems & Software Engineering: Reliability Engineering Slide 17


Fault Tolerance
v In critical situations, software systems must be fault tolerant.
v Fault tolerance is a runtime approach to dependability in which systems include mechanisms to
continue in operation, even after a software or hardware fault has occurred and the system state is
erroneous.
v Fault-tolerance mechanisms detect and correct this erroneous state so that the occurrence of a
fault does not lead to a system failure.
v Fault tolerance is required in systems that are safety or security critical and where the system cannot
move to a safe state when an error is detected.
v To provide fault tolerance, the system architecture has to be designed to include redundant and
diverse hardware and software.
v Example of systems that may need fault-tolerant architectures are aircraft systems that must be
available throughout the duration of the flight.
v In general, there are three architectural patterns that have been used in fault-tolerant systems:
1) Protection systems
2) Self-monitoring architectures
3) N-version programming

EECIE20-S4305 Intelligence Systems & Software Engineering: Reliability Engineering Slide 18


End of Lecture 17
Ninevah University
College of Electronics Engineering
Computer and Informatics Engineering Department

Software Engineering
Lecture 18
Reliability Engineering (Part-II) – Fault Tolerance

Lecturer
Dr. Faris S. Alghareb
PhD in Computer Engineering @ UCF
email: [email protected]
Copyright © 2020 Faris S. Alghareb. All rights reserved.
Why is Fault Tolerance required?
v Fault tolerance is a runtime approach used to achieve dependability in which systems
include mechanisms to continue in operation, even after a software or hardware fault has
occurred and the system state is erroneous.

v Fault tolerance mechanisms detect and correct this erroneous state so that the
occurrence of a fault does not lead to a system failure.

v Fault tolerance is required in systems that are safety or security critical and where the
system cannot move to a safe state when an error is detected.

v To provide fault tolerance, the system architecture has to be designed to include redundant
and diverse hardware and software.

v Examples of systems that require fault-tolerant architectures includes:


◆ aircraft systems that must be available throughout the duration of the flight,
◆ telecommunication systems, and
◆ critical command and control systems.

EECIE20-S4305 Intelligence Systems & Software Engineering: Reliability Engineering Slide 19


Protection Systems
v A protection system is a specialized system that is associated with some other control system, which can take
emergency action if a failure occurs.
Ø System to stop a train if it passes a red light
System environment

v Protection systems independently monitor their environment.


Ø if sensors indicate a problem, then the protection system is Protection
sensors
activated to shut down the process or equipment. sensors

v The advantage of this architectural style is that protection system


Protection
software can be much simpler than the software that is Control system
system
controlling the protected process.

v The only function of the protection system is to monitor Actuators


operation and to ensure that the system is brought to a safe state
in the event of an emergency.
Controlled
equipment

Protection system architecture


EECIE20-S4305 Intelligence Systems & Software Engineering: Reliability Engineering Slide 20
Self-monitoring architectures
v A self-monitoring architecture is a system architecture in which the system is designed to monitor
its own operation and to take some action if a problem is detected.

v Computations are carried out on separate channels, and the outputs of these computations are
compared.
Ø If the outputs are identical and are available at the same time, then the system is judged to be operating
correctly.
Ø If the outputs are different, then a failure is assumed. When this occurs, the system raises a failure
exception on the status output line. This signals that control should be transferred to some other system.

Status
Channel 1
Input value
Splitter Comparator
Output value
Channel 2

Self-monitoring architecture
EECIE20-S4305 Intelligence Systems & Software Engineering: Reliability Engineering Slide 21
Self-monitoring system

To be effective in detecting both hardware and software faults, self-monitoring systems


have to be designed so that:

The hardware used in each channel is diverse. In practice, this might mean
that each channel uses a different processor type to carry out the required
01 computations, or the chipset making up the system may be sourced from
different manufacturers.

The software used in each channel is diverse. Otherwise, the same software
02 error could arise at the same time on each channel.

EECIE20-S4305 Intelligence Systems & Software Engineering: Reliability Engineering Slide 22


Self-monitoring system – Examples
v This architecture may be used in situations where it is important for computations to be correct,
but where availability is not essential.
Ø if the answers from each channel differ, the system shuts down.

v For many medical treatment and diagnostic systems, reliability is more important than
availability because an incorrect system response could lead to the patient receiving incorrect
treatment.
Ø if the system shuts down in the event of an error, this is an inconvenience but the patient will
not usually be harmed.

EECIE20-S4305 Intelligence Systems & Software Engineering: Reliability Engineering Slide 23


Self-monitoring system – Examples
v If high-availability is required, you
may use several self-checking
systems working in parallel.
Ø You need a switching unit that
detects faults and selects a result
from one of the systems, where
both channels are producing a
consistent response.

v This is the approach used in the


Airbus family of aircraft for their flight
control systems.
Ø In the event of server failure, which can
be detected by a lack of response, the
faulty server is switched out of the system.
Ø Unprocessed requests are resubmitted to
other servers for processing. The Airbus flight control system architecture

EECIE20-S4305 Intelligence Systems & Software Engineering: Reliability Engineering Slide 24


Airbus architecture Example

The designers of the Airbus system have tried to achieve diversity in a number of
different ways:

2 4

The chipset that is


reduce the probability
The primary Software in each
flight control
used in each channel channel is developed of common failures in
in the primary and in different
computers use a Different
different
secondary systems is Software in
programming programming
different channels
supplied by a secondary systems
processor from languages by languages are
different is less complex than
the secondary different teams. used in the
manufacturer. that of the primary
flight control system – provides only secondary and
systems. critical functionality. primary systems.

1 3 5

Under these configurations, in more than 15 years of operation, there have been no reports indicate
a system failure in the flight control due to lost of control for the aircraft.

EECIE20-S4305 Intelligence Systems & Software Engineering: Reliability Engineering Slide 25


Redundancy and Diversity
v Replicated servers provide redundancy but not usually diversity.

v The server hardware is usually identical, and the servers run the same version of the
software. Therefore,

Ø they can cope with hardware failures and software failures that are localized to a single
machine.

Ø They cannot cope with software design problems that cause all versions of the software to
fail at the same time.

Ø To handle software design failures, a system has to use diverse software and hardware.

EECIE20-S4305 Intelligence Systems & Software Engineering: Reliability Engineering Slide 26


Self-monitoring system – TMR
v N-version programming is one of the approach ensuring high reliability and fault-tolerance of
software on the basis of program redundancy and diversity.
v The notion of N-version programming has been derived from the notion of Triple Modular
Redundancy (TMR), as used in hardware systems.
v In a TMR system, the hardware unit is replicated three (or sometimes more) times.
Ø The output from each unit is passed to an output comparator that is usually implemented as a voting
system.
Ø This system compares all of its inputs, and, if two or more are the same, then that value is output. If one
of the units fails and does not produce the same output as the other units, its output is ignored.
Ø TMR has been used for many years to build systems that are tolerant of hardware failures

Module 1

Input Output selector


Module 2 Resilient output
(voter)

Module 3
A general diagram of Triple Modular Redundancy (TMR)
EECIE20-S4305 Intelligence Systems & Software Engineering: Reliability Engineering Slide 27
Self-monitoring system – N-version Programming
v The implementation of N-version programming implies:

Ø Multiple versions of a software system carry out computations at the same time.

Ø The results are compared using a voting system and the majority result is taken to be the correct result.

Ø A fault manager may try to repair the faulty unit automatically, but if this is impossible, the system is
automatically reconfigured to take the unit out of service. The system then continues to function with
(𝑁 − 1) working units.

Version 1

Input Version 2 Output selector Agreed result


(voter)
...
Version 3
.. ..
. .
Version N Fault manager

N software versions

EECIE20-S4305 Intelligence Systems & Software Engineering: Reliability Engineering Slide 28


Self-monitoring system – NMR
v Components could all have a common design fault, and thus, all produce the same (wrong) answer.

v Using hardware units that have a common specification but that are designed and built by different
manufacturers reduces the chances of such a Common Mode Failure (CMF).

Ø It is assumed that the probability of different teams making the same design or manufacturing error
is small.

v Using a common specification, the same software system is implemented by a number of teams.
Ø these versions are executed on separate computers.

Ø their outputs are compared using a voting system, and

Ø inconsistent outputs or outputs that are not produced in time are rejected.

v At least three versions of the system should be available so that two versions should be consistent
in the event of a single failure.
Ø they should not include common errors and so will not fail in the same way, at the same time.
Ø the software should therefore be written by different teams who should not communicate during the
development process.

EECIE20-S4305 Intelligence Systems & Software Engineering: Reliability Engineering Slide 29


N-version programming VS. Self-monitoring system
v Systems in which a high level of availability is required, then N-version programming may be
less expensive (in terms usage of hardware resources) than self-checking architectures.

v However, it still requires several different teams to develop different versions of the software.
à leads to very high software development costs (paying for many programmers/developers
working in different teams).

v Therefore, the N-version programming approach is only used in systems where it is impractical
to provide a protection system that can guard against safety-critical failures.

EECIE20-S4305 Intelligence Systems & Software Engineering: Reliability Engineering Slide 30


Reliability Measurement
1
The number of system failures given a
number of requests for system
services. This is used to measure the
POFOD and applies irrespective of the
time over which the demands are
made.

To assess the 2
reliability of a The time or the number of
transactions between system failures
system, you have plus the total elapsed time or total
to collect data number of transactions. This is used
about its to measure ROCOF and MTTF.
operation. The
3
data required The repair or restart time after a
may include: system failure that leads to loss of
service. This is used in the
measurement of availability.
Availability does not just depend on
the time between failures but also on
the time required to get the system
back into operation.

EECIE20-S4305 Intelligence Systems & Software Engineering: Reliability Engineering Slide 31


Reliability Measurement
v The time units that may be used in these metrics are calendar time or a discrete unit such
as number of transactions.
Ø use calendar time for systems that are in continuous operation.
Ø For example, monitoring systems, such as process control systems, fall into this category.
Ø Therefore, the ROCOF might be the number of failures per day.
à Systems that process transactions such as bank ATMs or airline reservation systems have variable loads
placed on them depending on the time of day.

v Thus, the ROCOF would be number of failed transactions per N thousand transactions.

EECIE20-S4305 Intelligence Systems & Software Engineering: Reliability Engineering Slide 32


Reliability Testing
v Reliability testing is a statistical testing process that aims to measure the reliability of a system.

v Reliability metrics such as POFOD, the probability of failure on demand, and ROCOF, the rate of
occurrence of failure, may be used to quantitatively specify the required software reliability.
Ø This cannot normally be included as part of a normal defect testing process because data for
defect testing is (usually) a typical of actual usage data.

v The process of measuring the reliability of a system is sometimes called statistical testing.

v The statistical testing process is explicitly geared to reliability measurement rather than fault
finding.

Identify operational Prepare test Apply tests to Compute observed


profiles dataset system reliability

Statistical testing for reliability measurement

EECIE20-S4305 Intelligence Systems & Software Engineering: Reliability Engineering Slide 33


Reliability Measurement Problems
This conceptually attractive approach to reliability measurement is not easy to apply in practice.
The principal difficulties that arise are due to:

Operational profile uncertainty


Operational profile The operational profiles based on
01
uncertainty experience with other systems
may not be an accurate reflection
High costs of test data generation
of the real use of the system.
It can be very expensive to generate High costs of test
02
the large volume of data required in data generation
an operational profile unless the
Statistical uncertainty
process can be totally automated.
Statistical when high reliability is
03
uncertainty specified, you have to generate
a statistically significant number
Recognizing failure of failures to allow accurate
It is not always obvious when a failure Recognizing reliability measurements.
04
has occurred as there may be failure
conflicting interpretations of a
specification.

EECIE20-S4305 Intelligence Systems & Software Engineering: Reliability Engineering Slide 34


Operational Profiles
v The operational profile of a software system reflects how it will be used in practice.

v It consists of a specification of classes of input and the probability of their occurrence. When a new
software system replaces an existing automated system, it is reasonably easy to assess the
probable pattern of usage of the new software.
Ø It should correspond to the existing usage, with some allowance made for the new functionality that is
(presumably) included in the new software.

v Typically, the operational profile is such that the inputs that have the highest probability of being
generated fall into a small number of classes.
Number of inputs

Input classes
EECIE20-S4305 Intelligence Systems & Software Engineering: Reliability Engineering Slide 35
Operational Profiles
v When a software system is new and innovative, however, it is difficult to anticipate how it will
be used.
v Consequently, it is practically impossible to create an accurate operational profile

Developing an accurate operational profile may be difficult or impossible for the following
reasons:

A system may have many different users who


01 each have their own ways of using the system

Users change the ways that they use a system


02 over time.

EECIE20-S4305 Intelligence Systems & Software Engineering: Reliability Engineering Slide 36


Reliability Engineering – Recap
v It is often impossible to develop a trustworthy operational profile.
Ø If you use an out-of-date or incorrect operational profile, you cannot be confident about the accuracy of
any reliability measurements that you make.

v Software reliability can be achieved by avoiding the introduction of faults, by


Ø detecting and removing faults before system deployment, and
Ø by including fault-tolerance facilities that allow the system to remain operational after a fault has caused
a system failure.

v Reliability metrics include:


Ø probability of failure on demand (POFOD),
Ø rate of occurrence of failure (ROCOF), and
Ø availability (AVAIL).

v Dependable system architectures are system architectures including


Ø protection systems,
Ø self-monitoring architectures, and
Ø N-version programming.

EECIE20-S4305 Intelligence Systems & Software Engineering: Reliability Engineering Slide 37

You might also like