UNIT3
UNIT3
Principal properties:
Availability: The probability that the system will be up and running and able to
deliver useful services to users.
Reliability: The probability that the system will correctly deliver services as
expected by users.
Safety: A judgment of how likely it is that the system will cause damage to people
or its environment.
Security: A judgment of how likely it is that the system can resist accidental or
deliberate intrusions.
Resilience: A judgment of how well a system can maintain the continuity of its
critical services in the presence of disruptive events such as equipment failure and
cyberattacks.
Repairability reflects the extent to which the system can be repaired in the event
of a failure;
Maintainability reflects the extent to which the system can be adapted to new
requirements;
Survivability reflects the extent to which the system can deliver services whilst
under hostile attack;
Error tolerance reflects the extent to which user input errors can be avoided and
tolerated.
Many dependability attributes depend on one another. Safe system operation depends on the
system being available and operating reliably. A system may be unreliable because its data has
been corrupted by an external attack. Denial of service attacks on a system are intended to
make it unavailable. If a system is infected with a virus, you cannot be confident in its reliability
or safety.
How to achieve dependability?
There are interactions and dependencies between the layers in a system and changes at one
level ripple through the other levels. For dependability, a systems perspective is essential.
Redundancy: Keep more than a single version of critical components so that if one fails then
a backup is available.
Diversity: Provide the same functionality in different ways in different components so that
they will not fail in the same way.
Redundant and diverse components should be independent so that they will not suffer from
'common-mode' failures.
Process activities, such as validation, should not depend on a single approach, such as testing,
to validate the system. Redundant and diverse process activities are important especially for
verification and validation. Multiple, different process activities the complement each other
and allow for cross-checking help to avoid process errors, which may lead to errors in the
software.
Dependable software often requires certification so both process and product documentation
has to be produced. Up-front requirements analysis is also essential to discover requirements
and requirements conflicts that may compromise the safety and security of the system.
These conflict with the general approach in agile development of co-development of the
requirements and the system and minimizing documentation. An agile process may be defined
that incorporates techniques such as iterative development, test-first development and user
involvement in the development team. So long as the team follows that process and documents
their actions, agile methods can be used. However, additional documentation and planning is
essential so 'pure agile' is impractical for dependable systems engineering.
There are two related issues in achieving highly dependable software: developing highly
dependable software and demonstrating that it is highly dependable.
There are no development techniques which can guarantee that the resulting software system
will be highly dependable. In particular, there are no architectures which enable a highly
dependable system to be built out of components which are not themselves highly dependable.
The best that can be done to achieve a high degree of dependability is to limit complexity to
the minimum possible, to follow fault avoidance and fault detection practices rigorously, and
to design for fault tolerance where this does not conflict with the other principles.
There are no experimental techniques which can allow you to demonstrate that a software
system is highly dependable, without expanding unrealistic resources. Post-hoc evaluation of
a product will always be inadequate for systems with high dependability requirements so, for
such systems, it will always be necessary to certify and assess the development process as well
as the end product.
If we are to avoid faults we need to be able to understand the full consequences of every design
decision. If we are to detect errors we introduce, our methods and tools must provide strong
support for verification. If we are to assess and certify the development process, it must allow
each line of code to be linked unequivocally to a clause in the specification, and vice versa.
Only mathematics can provide the rigour and lack of ambiguity which this requires, which is
why classical engineers carry out mathematical modelling of their systems.
Formal methods are the mathematics of computer systems engineering. Their role is to assist
the engineer in understanding the total possible behaviour of the system he or she is designing.
Reliability is the ability to operate failure-free. More specifically, it is defined as the probability
of a product to operate free of failures for a specified period of time under specified operating
conditions. This definition applies to products, services, systems, hardware, and also software.
Reliability is a quality attribute that is highly valued by customers and users of products and
services.
We need a consistent definition of reliability between software, hardware, and systems. This is
so they may be combined to determine the reliability of a product or service as a whole
delivered to a user. However, there are differences between hardware and software that affect
how we analyze their respective reliabilities. These differences are described.
Software reliability engineering (SRE) assesses how well software-based products and services
meet users' operational needs. SRE uses quantitative methods based on reliability measures to
do this assessment. The primary goal of SRE is to maximize customer satisfaction. SRE uses
such quantitative methods as statistical estimation and prediction, measurement, and modeling.
As the reliability of a product or service is highly dependent on operating conditions and the
reliability of software is related to how the software is used, the quantitative characterization
of the use of software is an integral part in SRE. This characterization of use of software is
captured in an Operational Profile and is discussed.
SRE has been used on a wide variety of products and services. SRE is not limited by what
software development methodology is used. There are no known theoretical or practical limits
on its application except possibly on very small projects. Software under 1,000 source lines
and possibly up to as high as 5,000 source lines may experience few operational failures.
Analysis using few failure events results in large error bounds when estimating reliability.
However, reliability estimation is only a part of SRE. The other parts of SRE such as
determining the operational profile and using it to guide development and test are still useful
for very small projects.
Reliability is the probability of failure-free system operation over a specified time in a given
environment for a given purpose. Availability is the probability that a system, at a point in
time, will be operational and able to deliver the requested services. Both of these attributes can
be expressed quantitatively e.g. availability of 0.999 means that the system is up and running
for 99.9% of the time.
The formal definition of reliability does not always reflect the user's perception of a system's
reliability. Reliability can only be defined formally with respect to a system specification i.e.
a failure is a deviation from a specification. Users don't read specifications and don't know
how the system is supposed to behave; therefore, perceived reliability is more important in
practice.
Availability is usually expressed as a percentage of the time that the system is available to
deliver services e.g. 99.95%. However, this does not take into account two factors:
The number of users affected by the service outage. Loss of service in the middle
of the night is less important for many systems than loss of service during peak
usage periods.
The length of the outage. The longer the outage, the more the disruption. Several
short outages are less likely to be disruptive than 1 long outage. Long repair times
are a particular problem.
Removing X% of the faults in a system will not necessarily improve the reliability by X%.
Program defects may be in rarely executed sections of the code so may never be encountered
by users. Removing these does not affect the perceived reliability. Users adapt their behavior
to avoid system features that may fail for them. A program with known faults may therefore
still be perceived as reliable by its users.
9.
Functional reliability requirements define system and software functions that avoid, detect or
tolerate faults in the software and so ensure that these faults do not lead to system failure.
Reliability is a measurable system attribute so non-functional reliability requirements may
be specified quantitatively. These define the number of failures that are acceptable during
normal use of the system or the time in which the system must be available. Functional
reliability requirements define system and software functions that avoid, detect or tolerate
faults in the software and so ensure that these faults do not lead to system failure. Software
reliability requirements may also be included to cope with hardware failure or operator error.
Reliability metrics are units of measurement of system reliability. System reliability is
measured by counting the number of operational failures and, where appropriate, relating these
to the demands made on the system and the time that the system has been operational. Metrics
include:
Probability of failure on demand (POFOD). The probability that the system will
fail when a service request is made. Useful when demands for service are
intermittent and relatively infrequent.
Rate of occurrence of failures (ROCOF). Reflects the rate of occurrence of failure
in the system. Relevant for systems where the system has to process a large number
of similar requests in a short time. Mean time to failure (MTTF) is the reciprocal of
ROCOF.
Availability (AVAIL). Measure of the fraction of the time that the system is
available for use. Takes repair and restart time into account. Relevant for non-stop,
continuously running systems.
Checking requirements that identify checks to ensure that incorrect data is detected
before it leads to a failure.
Recovery requirements that are geared to help the system recover after a failure has
occurred.
Redundancy requirements that specify redundant features of the system to be
included.
Process requirements for reliability which specify the development process to be
used may also be included.
10.
In critical situations, software systems must be fault tolerant. Fault tolerance is required where
there are high availability requirements or where system failure costs are very high. Fault
tolerance means that the system can continue in operation in spite of software failure. Even if
the system has been proved to conform to its specification, it must also be fault tolerant as there
may be specification errors or the validation may be incorrect.
Fault-tolerant systems architectures are used in situations where fault tolerance is essential.
These architectures are generally all based on redundancy and diversity. Examples of situations
where dependable architectures are used:
Flight control systems, where system failure could threaten the safety of passengers;
Reactor systems where failure of a control system could lead to a chemical or
nuclear emergency;
Telecommunication systems, where there is a need for 24/7 availability.
Protection system is a specialized system that is associated with some other control system,
which can take emergency action if a failure occurs, e.g. a system to stop a train if it passes a
red light, or a system to shut down a reactor if temperature/pressure are too high. Protection
systems independently monitor the controlled system and the environment. If a problem is
detected, it issues commands to take emergency action to shut down the system and avoid a
catastrophe. Protection systems are redundant because they include monitoring and control
capabilities that replicate those in the control software. Protection systems should be diverse
and use different technology from the control software. They are simpler than the control
system so more effort can be expended in validation and dependability assurance. Aim is to
ensure that there is a low probability of failure on demand for the protection system.
Good programming practices can be adopted that help reduce the incidence of program faults.
These programming practices support fault avoidance, detection, and tolerance.
Limit the visibility of information in a program
Program components should only be allowed access to data that they need for their
implementation. This means that accidental corruption of parts of the program state by
these components is impossible. You can control visibility by using abstract data types
where the data representation is private and you only allow access to the data through
predefined operations such as get () and put ().
Check all inputs for validity
All program take inputs from their environment and make assumptions about these
inputs. However, program specifications rarely define what to do if an input is not
consistent with these assumptions. Consequently, many programs behave unpredictably
when presented with unusual inputs and, sometimes, these are threats to the security of
the system. Consequently, you should always check inputs before processing against
the assumptions made about these inputs.
Provide a handler for all exceptions
A program exception is an error or some unexpected event such as a power failure.
Exception handling constructs allow for such events to be handled without the need for
continual status checking to detect exceptions. Using normal control constructs to
detect exceptions needs many additional statements to be added to the program. This
adds a significant overhead and is potentially error-prone.
Minimize the use of error-prone constructs
Program faults are usually a consequence of human error because programmers lose
track of the relationships between the different parts of the system This is exacerbated
by error-prone constructs in programming languages that are inherently complex or that
don't check for mistakes when they could do so. Therefore, when programming, you
should try to avoid or at least minimize the use of these error-prone constructs.
Error-prone constructs:
The current methods of software reliability measurement can be divided into four
categories:
1. Product Metrics
Product metrics are those which are used to build the artifacts, i.e., requirement specification
documents, system design documents, etc. These metrics help in the assessment if the product
is right sufficient through records on attributes like usability, reliability, maintainability &
portability. In these measurements are taken from the actual body of the source code.
Project metrics define project characteristics and execution. If there is proper management of
the project by the programmer, then this helps us to achieve better products. A relationship
exists between the development process and the ability to complete projects on time and within
the desired quality objectives. Cost increase when developers use inadequate methods. Higher
reliability can be achieved by using a better development process, risk management process,
configuration management process.
3. Process Metrics
Process metrics quantify useful attributes of the software development process & its
environment. They tell if the process is functioning optimally as they report on characteristics
like cycle time & rework time. The goal of process metric is to do the right job on the first time
through the process. The quality of the product is a direct function of the process. So process
metrics can be used to estimate, monitor, and improve the reliability and quality of software.
Process metrics describe the effectiveness and quality of the processes that produce the
software product.
Examples are:
A fault is a defect in a program which appears when the programmer makes an error and causes
failure when executed under particular conditions. These metrics are used to determine the
failure-free execution software.
To achieve this objective, a number of faults found during testing and the failures or other
problems which are reported by the user after delivery are collected, summarized, and
analyzed. Failure metrics are based upon customer information regarding faults found after
release of the software. The failure data collected is therefore used to calculate failure
density, Mean Time between Failures (MTBF), or other parameters to measure or predict
software reliability.
Safety engineering processes are based on reliability engineering processes. Regulators may
require evidence that safety engineering processes have been used in system development.
Agile methods are not usually used for safety-critical systems engineering. Extensive process
and product documentation is needed for system regulation, which contradicts the focus in agile
methods on the software itself. A detailed safety analysis of a complete system specification is
important, which contradicts the interleaved development of a system specification and
program. However, some agile techniques such as test-driven development may be used.
Process assurance involves defining a dependable process and ensuring that this process is
followed during the system development. Process assurance focuses on:
Do we have the right processes? Are the processes appropriate for the level of
dependability required. Should include requirements management, change
management, reviews and inspections, etc.
Are we doing the processes right? Have these processes been followed by the
development team.
Process assurance is important for safety-critical systems development: accidents are rare
events so testing may not find all problems; safety requirements are sometimes 'shall not'
requirements so cannot be demonstrated through testing. Safety assurance activities may be
included in the software process that record the analyses that have been carried out and the
people responsible for these.
Safety-related process activities:
Formal methods can be used when a mathematical specification of the system is produced.
They are the ultimate static verification technique that may be used at different stages in the
development process. A formal specification may be developed and mathematically analyzed
for consistency. This helps discover specification errors and omissions. Formal arguments that
a program conforms to its mathematical specification may be developed. This is effective in
discovering programming and design errors.
Advantages of formal methods
Producing a mathematical specification requires a detailed analysis of the requirements
and this is likely to uncover errors. Concurrent systems can be analyzed to discover
race conditions that might lead to deadlock. Testing for such problems is very difficult.
They can detect implementation errors before testing when the program is analyzed
alongside the specification.
Disadvantages of formal methods
Require specialized notations that cannot be understood by domain experts. Very
expensive to develop a specification and even more expensive to show that a program
meets that specification. Proofs may contain errors. It may be possible to reach the same
level of confidence in a program more cheaply using other V & V techniques.
Model checking involves creating an extended finite state model of a system and, using a
specialized system (a model checker), checking that model for errors. The model
checker explores all possible paths through the model and checks that a user-specified
property is valid for each path. Model checking is particularly valuable for verifying concurrent
systems, which are hard to test. Although model checking is computationally very expensive,
it is now practical to use it in the verification of small to medium sized critical systems.
Static program analysis uses software tools for source text processing. They parse the
program text and try to discover potentially erroneous conditions and bring these to the
attention of the V & V team. They are very effective as an aid to inspections - they are a
supplement to but not a replacement for inspections.
Three levels of static analysis:
Characteristic error checking
The static analyzer can check for patterns in the code that are characteristic of errors
made by programmers using a particular language.
User-defined error checking
Users of a programming language define error patterns, thus extending the types of
error that can be detected. This allows specific rules that apply to a program to be
checked.
Assertion checking
Developers include formal assertions in their program and relationships that must hold.
The static analyzer symbolically executes the code and highlights potential problems.
Static analysis is particularly valuable when a language such as C is used which has weak
typing and hence many errors are undetected by the compiler. Particularly valuable for security
checking - the static analyzer can discover areas of vulnerability such as buffer overflows or
unchecked inputs. Static analysis is now routinely used in the development of many safety and
security critical systems.
A safety case is a structured argument, supported by evidence, intended to justify that a system
is acceptably safe, and when there is danger or damage to make it as low as reasonably possible
(ALARP). In industries like transportation and medicine, safety cases are mandatory and
legally binding. Safety cases tend to be presented in a document of textual information and
requirements accompanied by a graphical notation. The most popular way to this graphical
notation is using the Goal Structure Notation (GSN). Even though a requirement in the
automotive ISO 26262, the GSN notation is not some farfetched complex. It is basically sets
the goals, the strategies justifying the claims and evidence, and a solution to make that goal
safe.
The elements of the Goal Structured Notation have a Symbol plus a count, and are inside a
shape. They are as following: (*N represents a number that grows to N+1 on each preceding)
A goal G(N), are rectangles, setting up and objective or sub objective of the safety
case.
A strategy S(N), represented in a parallelogram, describes process or inference
between a goal and its supporting goal(s) and solutions.
A solution Sn(N) shown inside a circle, demonstrates a reference or proof.
A context C(N), shown like a square with curved edges. It defines the limits that
apply to the outlined structure.
A justification J(N), rendered as an oval shows a rational or logical statement
An assumption A(N), also rendered as an oval, presents an intentionally
unsubstantiated statement.
So, considering how a safety case in the GSN notation is structured any program that can make
sketches like Microsoft Visio or mind map could work. But there is a tool specifically for this
it is called ASTHA-GSN. It has an student license, and the tool has some major pluses:
A simple easy to use user interface
It will track the number of structures you have placed and sequentially number them.
Your first goal will automatically be G1, and then G2 and so on
It follows the structure, and knowing you are building a GSN safety case it will tell
you if what you are connecting is incorrect.
It lets you color scheme
o All Goals as Blue
o Strategies as Green
o Solutions as red
o Contexts are yellow
o And justifications and Arguments are white and grey
In the pictures you can see a example of each structure and a practical example.