Basic Concepts of Reliability
Basic Concepts of Reliability
A system without faults is considered to be highly reliable. Constructing a correct system is a difficult
task. Even an incorrect system may be considered to be reliable if the frequency of failure is
“acceptable.” The key concepts in discussing reliability are:
– Fault
– Failure
– Error
– Time
Failure
A failure is said to occur if the observable outcome of a program execution is different from the
expected outcome. It is something dynamic. The program has to be executing for a failure to occur.
The term failure relates to the behavior of the program. It includes such things as deficiency in
performance attributes and excessive response time
Fault
The adjudged cause of failure is called a fault. A fault is the defect in the program that, when
executed under particular conditions, causes a failure.
There can be different sets of conditions that cause failures, or the conditions can be repeated.
Hence a fault can be source of more than one failure. A fault is a property of the program rather a
property of its execution or behavior. It is what we are really referring to in general when we use the
term "bug". A fault is created when a programmer makes an error
One fault can cause more than one failure depending upon how the system executes the faulty
code. Depending whether fault will manifest itself as a failure; we have three types of faults:
Software reliability focuses solely on faults that have the potential to cause failures by detecting and
removing faults that result in failures and implementing fault tolerance techniques to prevent faults
from producing failures or mitigating the effects of the resulting failures.
Error
It can be defined as incorrect or missing human action that result in system/component containing a
fault (i.e. incorrect system). Examples include omission of misinterpretation of user requirements in
a product specification, incorrect translation, or omission of a requirement in the design
specification.
Correctness Reliability
Results in Causes
Error Fault Failure
During During
development execution
Time
Time is a key concept in the formulation of reliability. If the time gap between two successive
failures is short, we say that the system is less reliable. Two forms of time are considered.
a) Execution time ( ):The execution time for a software system is the actual time spent by a
processor in executing the instructions of the software system. Sometimes a processor may
execute code from different programs, and therefore, their individual execution times must
be considered independently
b) Calendar time (t): The calendar time is the time people normally experience in terms of
years, months, weeks, days, etc. Calendar time is useful in order to express reliability
suitable with the calendar time, because it offers to the managers and to persons that
develop software the chance to see the date when the system attains the objectives of
reliability.
In other words, if you start a stopwatch when you start a program and check it when the
program is done, you are measuring calendar time. This is equivalent to the "real" time
If there are many things running on the machine at the same time, the Execution time could
obviously be much less than calendar time.
Also, if the program is sitting suspended while it waits for IO, the execution time might be
low while calendar time might be high
Software Reliability
It is the probability that a system will operate without failure under given environmental conditions
for a specified period of time.
Reliability is measured over execution time so that it more accurately reflects system usage.
The goal is that reliability must be quantified so that we can compare software systems
• Examples
A user’s perception of the reliability of a software depends upon two categories of information.
– Operational environment
Hardware Reliability
It is the ability of hardware to perform its functions for some period of time. It is usually expressed
as MTBF (mean time between failures).
Some of the important differences between software and hardware reliability can be listed in the
following table:
Software Reliability Hardware Reliability
Failures are primarily due to design faults. Failures are caused by deficiencies in design,
Repairs are made by modifying the design to production, and maintenance
make it robust against conditions that can trigger
a failure
There is no wear-out phenomena. Software Failures are due to wear-out or other energy-
errors occur without warning. “Old” code can related phenomena. Sometimes a warning is
exhibit an increasing failure rate as a function of available before a failure occurs
errors induced while making upgrades.
There is no equivalent to preventive maintenance Repairs can be made which would make the
for software. equipment more reliable through
maintenance.
Reliability is not time dependent. Failures occur Reliability is time related. Failure rates can be
when the logic path that contains an error is decreasing, constant, or increasing with
executed. Reliability growth is observed as errors respect to operating time.
are detected and corrected.
Software design does not use standard Hardware design uses standard components
components
System reliability
It is the probability that a system, including all hardware, firmware, and software, will satisfactorily
perform the task for which it was designed or intended, for a specified time and in a specified
environment.
Operational Profile
The notion of operational profiles, or usage profiles, was developed at AT&T Bell Laboratories and
IBM independently.
The operational profile acts as a guideline for testing, to make sure that when the testing is stopped,
the critical operations are rigorously tested and thus, reliability has attained a desired goal. Using an
operational profile allows us to quickly find the faults that impact the system reliability mostly. The
notion of operational profiles was actually created to guide test engineers in selecting test cases, in
making a decision concerning how much to test and what portions of a software system should
receive more attention.
The operational profile of a system can be used throughout the life-cycle model of a software system
as a guiding document in designing user interface by giving more importance to frequent used
operations, in developing a version of a system for early release which contains the more frequently
used operations.
Actual usage of the system is quantified by developing an operational profile and is therefore
essential in any software reliability engineering process. For accurate measurement of the reliability
of a system, we should test the system in the same way as it will be used by actual users. Ideally, we
should strive to achieve 100% coverage by testing each operation at least once. Since software
reliability is very much tied with the concept of failure, software with better reliability can be
produced within a given amount of time by testing the more frequently used operations first. Use of
the operational profile as a guide for system testing ensures that if testing is terminated, and the
software is shipped because of imperative resource constraints, the most-used (or most critical)
operations will have received the most testing and the reliability will be maximized for the given
conditions/operations. It facilitates finding those faults early that have the biggest impact on
reliability. Quality objectives and operational profile are employed to manage resources and to guide
design, implementation, and testing of software.
Reliability Metrics
In software reliability engineering reliability metrics are used to quantitatively express the reliability
of a software product. Besides being used for specifying and assessing software reliability, they are
also used by many reliability models as a main parameter. Identifying, choosing and applying
software reliability metrics is one of the crucial steps in measurement. Reliability metrics offer an
excellent means of evaluating the performance of operational software and controlling changes to it.
Following reliability metrics are used to quantify the reliability of software product:
The MTTF is the mean time for which a component is expected to be operational. MTTF is
the average time between two successive failures, observed over a large number of failures.
It is important to note that only run time is considered in the time measurements, i.e. the
time for which the system is down to fix the error, the boot time, etc are not taken into
account in the time measurements. An MTTF of 500 means that one failure can be expected
every 500 time units.
The time units are totally dependent on the system and it can even be specified in the
number of transactions, as is the case of database query systems. MTTF is relevant for
systems with long transactions, i.e., where system processing takes a long time. We expect
MTTF to be longer than average transaction length.
MTTR is a factor expressing the mean active corrective maintenance time required to restore an
item to an expected performance level. This includes activities like troubleshooting, dismantling,
replacement, restoration, functional testing, but shall not include waiting times for resources. In
software, MTTR (Mean time to Repair) measures the average time it takes to track the errors
causing the failure and then to fix them. Informally it also measures the down time of a
particular system
In this case, time measurements are real time and not the execution time as in MTTF. Thus,
MTBF of 300 hours indicates that once a failure occurs, the next failure is expected after 300
hours
e) Availability
– Measures the fraction of time system is really available for use
– Takes repair and restart times into account
POFOD measures the likelihood of the system failing when a service request is made. Unlike the
other metrics discussed, this metric does not explicitly involve time measurements. A POFOD of
0.005 means that five out of a thousand service requests may result in failure.
POFOD is an important measure and should be kept as low as possible. It is appropriate for
systems demanding services at unpredictable or relatively long time intervals. Reliability for
these systems would mean the likelihood the system will fail when a service request is made.
Markovian Models
The main function of most equipment and system depend more and more on software with the
development of computer and information technology, but low reliability of software place a serious
constraint on wide application of software, even lead to some disastrous result. Markov Model is
used to represent the architecture of the software & provides a mean for analyzing the reliability of
software “A Markov Model is a mathematical system that undergoes transitions from one state to
another, between a finite or countable number of possible states. It is a random process usually
characterized as memory less: the next state depends only on the current state and not on the
sequence of events that preceded it”. By using Markov Model a statistical model of software is
drawn wherein each possible use of the software has an associated probability of occurrence. Test
cases are drawn from the sample population of possible uses according to the sample distribution
and run against the software under test. Various statistics of interest, such as the estimated failure
rate and mean time to failure of the software are computed.
0.8
State 2
1
0.2
State 1
0.9
0.1 State 3
Figure: A Simple Three State Markov Model
A usage model for a software system consists of states, i.e., externally visible modes of operation
that must be maintained in order to predict the application of all system inputs, and state transitions
that are labelled with system inputs and transition probabilities. To determine the state set, one
must consider each input and the information necessary to apply that input. It may be that certain
software modes cause an input to become more or less probable (or even illegal). Such a mode
represents a state or set of states in the usage chain. Once the states are identified, we establish a
start state, a terminate state , and draw a state transition diagram by considering the effect of each
input from each of the identified states. The Markov chain is completely defined when transition
probabilities are established that represent the best estimate of real usage
Coverage: Covered and uncovered failures of components are mutually exclusive events
Complex Systems: Many simplifying techniques exit which allow the modeling of complex systems.
Sequenced Events: It allow computing of an event resulting from a sequence of sub events.
Disadvantage of Markov Model The major drawback of Markov Model is the explosion of number of
states as the size of system increases. The resulting models are large & complicated.
Markov Chain: The simplest Markov model is the Markov chain. It models the state of a
system with a random variable that changes through time. In this context, the Markov
property suggests that the distribution for this variable depends only on the distribution of
the previous state
Hidden Markov Model: A hidden Markov model is a Markov chain for which the state is only
partially observable. In other words, observations are related to the state of the system, but
they are typically insufficient to precisely determine the state.
Markov Decision Process: A Markov decision process is a Markov chain in which state
transitions depend on the current state and an action vector that is applied to the system.
Typically, a Markov decision process is used to compute a policy of actions that will
maximize some utility with respect to expected rewards.
Partially Observable Markov Decision Process: A partially observable Markov decision
process (POMDP) is a Markov decision process in which the state of the system is only
partially observed. POMDPs are known to be NP complete, but recent approximation
techniques have made them useful for a variety of applications, such as controlling simple
agents or robots.
The failure process of software for a binomial type of model is characterized by the behavior of
the rate of occurrence of failures during the execution of the software. The number of failures
follows a binomial distribution of probabilities, this being the reason for its characterization as a
binomial type model, according to Musa’s classification.
Here, time is replaced as the control variable by the variable number of test data with the
assumption that the execution of a test datum is equivalent to a unit of time of execution of the
software, that is, the measurement unit of software testing is the test datum. This is necessary as
reliability is a characteristic defined as a function of time. Thus, the failure rate of software is
characterized by the number of failures by test datum or by test data set.
A number of test data executed by the software results in a measured coverage percentage of the
software code. Moreover, the coverage percentage depends on the testing criterion used for
evaluation of the test data. Hence, the failure rate of the software is also related to the coverage
of the testing criterion achieved by execution of the test data.
In the initial stages of testing the failure rate is high and the test coverage is low as few test data
have been executed; in the final stages of the testing, the failure rate is low and the test coverage
is usually high.
Hence, the basic assumption of the Binomial model is that the failure rate of the software is
directly proportional to the complement of the measured coverage achieved by execution of the
test data.
I. The software is tested under conditions similar to those of its operational profile.
II. The probability of detection of any fault is determined by the same distribution.
III. Faults detected in each interval of test data are detected independently of each other.
IV. There are N faults in the software at the beginning of the test
V. The test data are executed and the coverage of the elements required by the selection
criterion used in test data evaluation is calculated for each failure occurrence.