Grunske Et Al-2011-Software Practice and Experience

SOFTWARE – PRACTICE AND EXPERIENCE
Softw. Pract. Exper. 2011; 41:1233–1258

Published online 18 January 2011 in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/spe.1039
Experience with fault injection experiments for FMEA
Lars Grunske1, ∗, † , Kirsten Winter2 , Nisansala Yatapanage3 , Saad Zafar4, ‡

and Peter A. Lindsay2
1 Swinburne University of Technology, Faculty of ICT, Hawthorn, VIC 3122, Australia
2 Schoolof ITEE, The University of Queensland, St. Lucia, QLD 4072, Australia
3 Software Quality Institute, Griffith University, Nathan, QLD 4111, Australia
4 Riphah International University, Rawalpindi, Pakistan
SUMMARY
Failure Modes and Effects Analysis (FMEA) is a widely used system and software safety analysis
technique that systematically identifies failure modes of system components and explores whether these
failure modes might lead to potential hazards. In practice, FMEA is typically a labor-intensive team-based
exercise, with little tool support. This article presents our experience with automating parts of the FMEA
process, using a model checker to automate the search for system-level consequences of component
failures. The idea is to inject runtime faults into a model based on the system specification and check
if the resulting model violates safety requirements, specified as temporal logical formulas. This enables
the safety engineer to identify if a component failure, or combination of multiple failures, can lead to a
specified hazard condition. If so, the model checker produces an example of the events leading up to the
hazard occurrence which the analyst can use to identify the relevant failure propagation pathways and
co-effectors. The process is applied on three medium-sized case studies modeled with Behavior Trees.
Performance metrics for SAL model checking are presented. Copyright q 2011 John Wiley & Sons, Ltd.
Received 8 September 2009; Revised 11 July 2010; Accepted 25 October 2010
KEY WORDS: behavior trees; failure modes and effects analysis; fault injection experiments; model
checking
1. INTRODUCTION
The development of safety critical systems demands assurance that the system does not pose harm
for people or the environment, even if one or more of the system’s components fail [1, 2]. The
related assurance process is known as hazard analysis. The goal of this process is to give evidence
that a system design avoids hazardous states in the presence of component failures. This involves
identifying cause–consequence relationships between component failure modes and hazardous
conditions at the system level. The process is typically conducted as an alternating forward and
backward search, between causes and effects [3]. Backward search tries to identify all the relevant
causes for each hazard; a commonly used technique is Fault Tree Analysis (FTA) [4, 5]. Forward
search identifies the consequences of already identified failure modes and is most commonly
performed by a systematic technique known as Failure Modes and Effects Analysis (FMEA) [6],
which is the main topic of this paper.
∗ Correspondence to: Lars Grunske, Swinburne University of Technology, Faculty of ICT, Hawthorn, VIC 3122,
Australia.
†
E-mail: [email protected], [email protected]
‡
Saad Zafar was at Griffith University when working on this project.
Copyright q 2011 John Wiley & Sons, Ltd.

1234 L. GRUNSKE ET AL.
Problem description. In practice, FMEA is performed manually in structured sessions by a

team of expert analysts. The quality of the outcome depends on the ability of the analysts to
predict the ways in which components might fail, and how the system will behave in the pres-
ence of such failures. For even quite small systems, this process can be extremely tedious and
error-prone because of the large number of situations and feature interactions that need to be
considered.
Current solutions. A variety of different approaches to systematic exploration of system-level
consequences of component failures have been proposed, including automated simulations [7];
analysis of synthesized fault trees [8, 9]; execution and animation of the model [10]; model
checking [11]; and combinations of the above [12]. Simulation-based approaches are fast, but
explore only a subset of all possible traces, and important relationships between failure modes
and hazards can easily be missed. Analysis of synthesized fault trees relies on the correctness
and completeness of the fault tree, which cannot be guaranteed for complex systems. Manual and
tool supported execution and animation of the model can be extremely labor-intensive due to the
number of combinations to be tried, and exhaustive search of the state space is rarely feasible
because of the human guidance needed. Model-checking based approaches, on the other hand,
enable automated, exhaustive exploration of the system state space and thus are a very promising
approach [13–17]. The general model-checking approach to FMEA is to inject faults (component
failure modes) into the model and then use the model checker to determine whether hazardous
states become reachable. If so, then the model checker can be used to find an example of how the
component failure could lead to hazard. If the hazardous state is not reachable, then the component
failure will not lead to this hazard—at least, not on the basis of the model being used.
Contribution. This paper extends the general model-checking approach outlined above by
applying it to a new modeling notation that is gaining acceptance in systems and software engi-
neering companies [18, 19]. The Behavior Tree (BT) approach uses a graphical notation to model
the behavior of systems with components that may comprise hardware, software, human operators,
or even other systems [20, 21]. By staying very close to the terminology of the system’s require-
ments specification, the notation has been well received in the industry as a means for improved
understanding of system requirements by system engineers who are not necessarily formal methods
experts. At the same time, because it has a formal interpretation in terms of concurrent state
machines, it is amenable to tool support [22–24]. BTs seem to overcome the perception that formal
notations are difficult to understand, which has been one of the main barriers to wider acceptance
of formal methods in the industry [25, 26].
The purpose of this paper is to demonstrate that useful tool support can be provided for
undertaking FMEA on the models produced under the BT approach. The paper demonstrates that
fault injection and translation into a model checking framework, and formalization of hazards and
interpretation of counterexamples in particular, can be done reasonably naturally in BTs and that
analysis is computationally feasible, at least for the three medium-sized examples considered.
Two technical innovations underlie the paper. The first is the introduction of a failure-view transi-
tions model, to capture the ordering of occurrence of component failure modes. This concept extends
our previous single-point FMEA approach [16] to also cover multiple failures and in-operation
fault repair. It also enables the user to separately investigate the consequences of the failures
occurring in different orders. The second innovation is to enable user-specified prioritization of
certain component interactions over other interactions, including interactions with the external
environment, which enables the system analyst to focus attention on key safety features of the
design, instead of being distracted by behaviors that are much less likely to arise in practice. We
show how and when this feature is useful in practice.
In this paper we present our experience with applying the approach on three medium-sized
systems, checking the full range of single-point and double-point of failures in an automated
process. We discuss the outcomes for the FMEA process as well as the performance of the overall
approach, thus providing additional evidence on the practicality of the approach.
Outline. The remainder of the paper is organized as follows: Section 2 provides the background
material to our work. It introduces the BT modeling language and the concepts of model checking
Copyright q 2011 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2011; 41:1233–1258
DOI: 10.1002/spe
EXPERIENCE WITH FAULT INJECTION EXPERIMENTS FOR FMEA 1235
with the SAL tool suite. In Section 3, we introduce our overall approach for tool-supported FMEA.
Section 4 documents in a step-by-step fashion how this approach is applied to a system, namely
the industrial metal press. We report on the results of two further systems, the mine pump and the
ambulatory infusion pump (AIP), in Section 5. Section 6 discusses the practicability and usefulness
of using this approach, and issues of scalability. Finally, we conclude and discuss the future work
in Section 7.
2. PRELIMINARIES
2.1. Behavior trees

The BT modeling notation [20] is a graphical notation for capturing the functional behavior
of a system specified in natural language. The strengths of the BT notation are twofold: first,
the graphical nature of the notation, based around concurrent threads of interactions between
components and their consequent changes of state, provides a succinct means for expressing
component interactions and interactions with the environment, which in turn enables the overall
system behavior to be grasped more easily. This has made it very successful for reviewing system
requirements specifications in the industry, where such specifications can be many hundreds of
pages in length, even early in system requirements analysis [18].
Second, BT models are developed in a stepwise fashion, integrating functional requirements into
the model one by one. That is, single requirements are modeled as single BTs, called individual
requirements trees. In a second step these individual requirement trees are composed into one
large BT, called the integrated requirements tree. Composition of requirements trees is done at
the graphical level: An individual requirements tree is merged with a second tree (which can be
another individual requirements tree or an already integrated tree) if its root node matches one
of the nodes of the second tree. Semantically, this merging step is based on the fact that the
matching node provides the point at which the preconditions of the merged individual requirement
trees are satisfied. This structured process provides a successful solution for handling very large
requirements specifications [20, 22].
The syntax of the BT notation comprises nodes and edges. Each node refers to a particular
component C and is of type T , which can be either a state realization, a selection, a guard, or an
event. A component is in a state and may have attributes. The node type is further specialized by
the behavior B, which can be an identifier describing a state, an event, or a channel name, or B
can be an expression which refers to attributes of the component. Figure 1 shows the different
node types. More precisely, a node type can be
(a) a state realization, modeling C being in a state if B is a state name, or updating C’s
attribute if B is an update expression over the attribute;
(b) a selection (or condition) on C’s state if B is a state name, or a selection on the state of
one of C’s attributes if B is an expression over the attribute; in both cases, the control
flow terminates if the condition is not satisfied;
(c) a guard; the control flow can pass the guard when C is in state B, otherwise it is blocked
until the state realization occurs;
tag Component Flag tag Component Flag tag Component Flag

(a) [ Behavior ] (b) ? Behavior ? (c) ?? Behavior ??
tag Component Flag tag Component Flag tag Component Flag tag Component Flag
(d) > Behavior < (e) < Behavior > (f) >> Behavior << (g) << Behavior >>
Figure 1. Different node types of the BT syntax: (a) state realization; (b) selection; (c) guard; (d) internal
input event; (e) internal output event; (f) external input event; and (g) external output event.
DOI: 10.1002/spe
(d–e) an internal event modeling communication and data flow between components within the
system, where B specifies an event; the control flow can pass the internal input event
node when the event occurs (the message is sent), otherwise it is blocked until it occurs;
(f–g) an external event modeling communication and data flow between the system and its
environment, where B specifies an event; the control flow can pass the external input event
node when the event occurs (the message is sent), otherwise it is blocked until it occurs.
A node can also be labeled with one or more flags, used to indicate control flow. In Figure 1
Flag is used as a placeholder for a set of flags. Flags often refer to a matching node: that is, a
node of the same type and with the same component name and behavior. A flag can be either:
1. a reversion ˆ in the case where the node is a leaf node, indicating that the control flow loops
back to the matching node;
2. a reference node =>, indicating that the flow continues from the matching node;
3. a synchronization point =, where the control flow waits until all other threads with a matching
synchronization point have reached the synchronization point; or
4. killing of a thread −−, which kills any thread that starts with the matching node.
Nodes also have tags, which enable tracing back to requirements identifiers in the original
system functional requirements specification. Since tags have no consequences for the semantics
of the formal model they will largely be omitted from the models given below.
Control flow in the system is modeled by either a normal or a branching edge. Figure 2 shows
the different types of normal edges. As an example, we use state realization nodes in the figure,
however, there are no restrictions on the node types. A normal arrowed edge models sequential flow
between two steps (Figure 2(a)). If two nodes are drawn together (i.e. with no edge in between,
see Figure 2(c)), the two steps occur together atomically, i.e. simultaneously, to all extents and
purposes, at this level of system description.
Figure 3 shows the two types of branching edges using two branches only. In general, however,
a branching can involve more than two branches. Branching behavior is either concurrent or
alternative. Concurrent branching (Figure 3(a)) models threads running in parallel (Figure 3(a)
only depicts the first node of each thread). As an example the threads in the figure start with a
guard node. The branches, however, can start with any node type.
In alternative branching (Figure 3(b)), the control flow follows only one of the branches.
Alternative branches can either comprise selections only (for example, as shown in Figure 3(b)) or
no selections at all. Alternative branching over selections operates as a non-deterministic choice
over the branches with a satisfied selection condition Bi . If none of the selections is satisfied the
tag tag
tag
tag
tag tag (c)
(a) (b)
Figure 2. Sequential behavior in the BT syntax: (a) sequential flow;

(b) atomic sequential flow; and (c) atomic block.
tag tag
tag tag tag tag

(a) (b)
Figure 3. Branching behavior in the BT syntax: (a) concurrent flow and (b) alternative flow.
DOI: 10.1002/spe
behavior terminates. Alternative branching over non-selections behaves like a non-deterministic

choice that is unguarded.
The semantics of BTs is formalized in [27].
2.2. Model checking

For our approach of supporting the FMEA process with model checking we utilize SAL [28]. SAL
is an open suite of tools for the analysis of state machines, including model checkers for branching
and linear time temporal logic (CTL and LTL, respectively [29]), and a deadlock checker. For our
approach the SAL tool suite provides several benefits: first, LTL is a suitable formalism to capture
the safety conditions, and second, the input notation of the tool, the SAL language, is based on
the notion of a simple transition system and as shown by [16], BTs can be readily and faithfully
translated into SAL code.
The core SAL language resembles Action Systems [30]. The transition relation is given as a set
of guarded actions. At each step the system non-deterministically chooses an action whose guard
is satisfied in the current state and applies the action’s updates to yield the next state.
A simple example of SAL code is shown in the framed box below. A SAL context consists
of types and modules definitions. Within a module, local, global, input, and output variables can
be declared. They comprise the state variables of the system. Local and output variables can be
changed by the module and global variables can change only in a limited way. Input variables,
however, cannot be changed by the module. Their value is non-deterministically chosen from their
type at each step. For example, the input variable pressButton in the example can either be true
or false at any step, and thus models the arbitrary behavior of a pedestrian using the button at the
traffic light. The rest of the module consists of initializations of local variables and a specification
of the transition relation. The actions in the latter comprise in their simplest form a guard and a
set of updates. For instance, action A1 in the example is guarded by light=red and contains
a single update car’=stop. SAL uses primed variables to refer to variables in the next state.
The updates of each action happen atomically. At each step, the SAL system non-deterministically
chooses from the list of actions (e.g. A1 to A4) one of those actions whose guard is satisfied in
the current state.
traffic: CONTEXT =
BEGIN Color:TYPE={red,yellow,green};
Move:TYPE={stop,go};
behavior: MODULE =
BEGIN
LOCAL light: Color, car: Move
INPUT pressButton: BOOLEAN
INITIALIZATION light=yellow;car=stop
TRANSITION
[ A1: light=red --> car’=stop
[] A2: pressButton --> light’=yellow
[] A3: light=yellow --> light’=green
[] A4: light=green --> car’=go
[] ... ]
END; % of module
END % of context
When translating BT models into SAL code, each state realization node and each internal/external
output node is translated into the update of a SAL action, which are automatically labeled A1,
A2, etc. Each selection and event node is translated into the guard of an action. Additionally, a
‘program-counter’ variable pc is introduced into (the guard and update of) each action in order to
control the sequence in which transitions are to be fired. Internal events are modeled by Boolean
variables which are set to true when the input is sent (by an internal output node) and are set
to false either if the input is consumed (by the matching internal input node) or if another step
has been taken and the event has not been consumed. This reflects that events are transient and
might be missed. External input events become SAL input variables which are not controlled
DOI: 10.1002/spe
by the model and behave arbitrarily. To render the transition relation of the SAL model total (a
prerequisite for model checking) we add an ELSE action which is enabled if no other action is.
That is, all paths in the model are infinite.
Our SAL translator includes an option for giving priority to internal actions over external actions.
An external action is one involving an external input or output event in the BT model; all other
actions are internal actions. An example of generated SAL code (in the context of FMEA) is
shown in Figure 8. More detail about the translation is given in [16] and a formal definition of the
translational semantics can be found in [23].
From the SAL tool suite we use a symbolic model checker for LTL. This tool checks if a system,
modeled in the SAL language, satisfies a given property, specified in LTL. In the case where the
property is violated the tool outputs a counter-example, i.e. a sequence of states that leads to the
violation. Symbolic model checkers, such as the SAL tool, use a graph structure, called Binary
Decision Diagrams (BDDs) [31], to represent the model internally. This representation is symbolic,
which means that the tool does not store the information about states and state transitions explicitly.
Therefore, the generation of a counter-example in the case of a violation is a separate step that is
triggered once the model checking algorithm reports a violation of the checked property. Moreover,
specific to symbolic model checking is that the runtime of the checker is not always proportional
to the size of state space of the model. Instead the number of BDDs give an indication of the
size of the model that is processed. Some models can be represented more succinctly than others
depending on multiple factors such as interdependencies between the variables, etc. These factors
will influence how we can interpret the statistical results from our experiments.
Reachability of hazardous states can easily be modeled in LTL using operators for next state X,
always G, until U, and eventually F. For example, a simple traffic-light system needs to ensure
the following requirement: if the button is pressed then eventually the light will turn green. This
requirement can be formalized in LTL as follows:
G(pressButton ⇒ F(light = green))
The temporal operator G ensures that the desired behavior is always the case (in every possible
run of the system), whereas F models ensure that at some state in the future the desired change
will happen. The symbol ⇒ models implication. In some cases, however, it might not be sufficient
if the desired state is reached eventually since this can be at any time in the future. If we want
to know that it only takes a fixed number of steps to reach the goal the formula has to be more
specific. The following formula gives an example:
G(pressButton ⇒ X(X(light = green)))
This formula models that after the button has been pressed it takes two steps to set the light to
green. Since the next operator X models that something occurs in the next state, the combination
of two next operators (i.e., X(X . . .) models that something occurs in the state after the next state,
i.e. in two steps.
Another example for a typical LTL formula is given by the following requirement: The car does
not go before the light turns green. This can be formalized using the until operator U.
G(car = stop U light = green)
This formula states that it is always the case (G) that the car is in state stop until (U) the light
is in state green. Implicitly, this also requires that the light will be green eventually. That is, a
behavior in which the light will never be green violates this formula. Note that until is useful for
formalizing properties where something occurs after a number of internal steps, where the exact
number of steps may be indeterminate (e.g. because it includes steps of components that have no
bearing on the property in question).
Apart from temporal operators, LTL supports propositional logical operators, such as ∧ (and),
∨ (or), ¬ (negation), and ⇒ (implication).
DOI: 10.1002/spe
3. AUTOMATED HAZARD ANALYSIS
3.1. Process overview

As described in the introduction, the aim of an FMEA is to explore the consequences of possible
component-level failure modes and to identify the circumstances under which hazards may occur.
The final outcome of an FMEA is a table which documents for each component the set of relevant
component failure modes and for each of these failure modes its consequences. The table may also
include notes on possible failure detection, correction, or mitigation mechanisms. The structure,
number of columns, and meaning of columns of the resulting FMEA table may vary in different
companies and organizations. However, the following column headings are often used [6]: investi-
gated component, failure mode, description of the local effect of the failure mode, possible causes
of the failure, and possible effects at the system level. For complex systems with a large number of
components and a large number of failure modes per component, this table can become very large.
Our approach aims to support identification of the possible consequences of given component
failure modes at the system level, using fault injection experiments based on BT models in
combination with a model checker to automatically search for hazardous conditions. For each
investigated failure mode a new BT model will be created by injecting the component failure mode
into the original BT model. The resulting model is called a failure view. Because of the nature
of BT models, and the way in particular that component functions map to localized fragments of
BTs, fault injection is typically very straightforward: it is typically simply a matter of modifying
small parts of the tree to replace the normal behavior by behavior corresponding to the failed
component. This is discussed in further detail below.
In the traditional FMEA only one failure mode is investigated at a time. In our approach we
can also consider multiple faults that are injected into a failure view. We also allow the user to
combine failure views with different failure modes in order to investigate how specific sequences
of failure modes influence the safe operation of the system.
To link the original behavior and the failure views together it is necessary to specify a transition
relation. This transition relation describes how the system can transition between the failure views
and the correct system model. For example, the transition from the correct behavior to one of
the failure views models so that the injected fault occurs at some point during the operation of
the system. A transition between different failure views models that an additional fault occurred.
A transition from a failure view back to the original behavior models a repair action where a
fault has been removed from the system. The transition relation together with the system model
including all failure views that are considered is compiled into a SAL model, called the experiment
model. (For example, a single-failure experiment consists of the fault-free model, the failure view
of the single failure and the transition from the former to the latter.)
To check whether an injected fault can lead to hazardous system behavior, we formally specify
the safety conditions (i.e. the negated hazard conditions) as temporal logical formulas. When
applied to the experiment model, the model checker can automatically verify whether a safety
violation can occur.
An overview of the complete process is given in Figure 4. Model M and failure views Fvi are
given as BT models. The transitions between the failure views, here depicted as a matrix, together
with the BT models, build the experiment model which is automatically translated into SAL code.
The safety conditions are formalized using LTL. The SAL code of the experiment model and
the LTL formulas are entered into the SAL model checker to check if a safety condition can be
violated in a failure view, thereby establishing a cause consequence relationship.
The work reported here was undertaken using the BTE tool [32], one of several tools developed
to support BTs development and analysis. Figure 5 depicts the steps involved in this process and
characterizes the degree of tool support for each of the steps. Modeling the system using BTs (1) is
supported by the BTE editor for BTs (the dashed box indicates that tools support is available,
but the step requires user interaction). Given the model the user then has to identify the hazards
(2) (rounded boxes are manual steps that are not tool supported) and formalizes according to these
hazards the safety conditions to be checked (3), again this is a manual task and the output is
DOI: 10.1002/spe
Model M
SAL Code
Failure
View Fv1
Fv2 Cause
Fv3 Consequence
Analysis
(MC)
Safety LTL
Conditions Formulae
Figure 4. Methodology at a glance.
Figure 5. The process in nine steps.
a set of LTL formulae. After (manually) identifying the components’ failure modes (4) the user
can create the (BT) failure views of the model (5) using the support provided by the BTE tool.
This completes the inputs required for the checking process and the model checker automatically
checks the safety requirements (6) (a plain box indicates full automation of the step). The output
of the model checker is analyzed manually by the user and failure consequences are identified (7).
Countermeasures can be chosen (8) and integrated into the BT model (9) using the BTE tool. The
process then repeats the analysis.
As noted above, our approach goes beyond the traditional FMEA in supporting effects analysis
of multiple failures. Failure views corresponding to multiple failures can be constructed with the
aid of the BTE tool. However, it is left to the analyst to decide how best to handle the potential
combinatorial explosion in the number of failure views to be considered. One approach would be to
do a breadth-first sweep, pruning the experiment model at cutsets (i.e. combinations of component
failures that give rise to hazards.) Another approach to reduce the number of failure views for
multiple failure modes is described in [7]. In this approach each failure mode is associated with a
likelihood and multiple failure modes are only selected where the combined likelihood is greater
than a predefined threshold. Obviously, one assumption for this approach is that all failure modes
are statistically independent.
DOI: 10.1002/spe
3.2. Formalization of safety conditions

One input to our fault injection experiments is the set of safety requirements. Safety requirements
mandate the absence of hazard conditions in the system. Traditionally, a hazard is defined as a
state or condition of a system where it only depends on the environment if an accident occurs [1].
Hazard identification is typically performed during the risk analysis phase. The safety requirements
become input for system hazard analysis techniques. In our case, they are prerequisites for the
FMEA process. It is good safety-analysis practice to consider cases of system behavior that would
not arise under normal fault-free operation, such as unusual combinations of component states, or
receipt of a message that would not normally occur in a given state, when generating the hazard
conditions to be analyzed. This will sometimes entail extending the model of system behavior to
cover the implications of such cases. An example is given in Section 4 below.
As a formal language for the safety requirements we use a temporal logic as described in
Section 2.2. The formalization process can be guided through specification patterns, as introduced
by Dwyer et al. [33]. In particular, we use property specification patterns for safety critical systems
[34], which are natural language constructs for safety requirements that can be transformed into
linear (LTL) or branching time (CTL) formulas.
We give an example of a safety requirement for a metal press system in which the plunger rises
when an electric motor is turned on and falls when the motor is off. One of the hazard conditions
for the press is if the motor gets switched on when the plunger is falling fast: the momentum of
the plunger might cause the winding gear to break and cause injury or death. The corresponding
safety condition is the negation of the hazard condition: whenever the plunger is falling fast, the
motor should be switched off. In LTL this would be written as
G(plunger = fallingFast ⇒ electricMotor = off )
3.3. Fault injection

The formalized hazard conditions are identical in every fault injection experiment. The variable
part of the fault injection experiments is the model of the system which contains the failure modes
that are currently under investigation. The resulting model after the injection of a failure mode is
called a failure view.
Bondavalli and Simoncini [35] identified five failure types that are generally responsible for
unsafe behavior: value failures (a message is sent with an incorrect value), commission failures
(a message is sent unexpectedly), omission failures (a message is not sent), too-early failures
(a message is sent too early), and too-late failures (a message is sent too late). To generate a failure
view the user has to inject a fault into the design model.
In a system requirements specification, which our approach takes as its starting point, component
behaviors are described by a set of individual functional requirements, often scattered through the
document. However, each functional requirement maps directly to a localized fragment of the BT.
Fault injection is thus typically simply a matter of changing only small parts of the BT model,
without affecting other parts. Because our approach uses temporal logic for formalizing safety
conditions, detailed timing relationships are abstracted away, so too-early and too-late failures will
not be considered here; they would be the subject for later system hazard analysis. But each of the
other three failure mode types can be captured as BT failure views by modifying the BT model
as follows:
1. omission failure: remove the responsible behavior fragment from the BT model;
2. commission failure: add a corresponding behavior fragment to the BT model; and
3. value failure: alter the corresponding behavior fragment in the BT model (e.g. by changing
the message sent in an output event).
More complex component failure modes, such as events occurring out of sequence for example,
can also be modeled: it is up to the analyst to decide which ones to consider. The resulting BT
failure view models the behavior of the system in the presence of the component failure.
DOI: 10.1002/spe
Sensor Sensor
> atTop < > atBottom <
Sensor Sensor
[high] [low]
Sensor Sensor
< sensor_high > < sensor_low >
Sensor
> atTop <
Figure 6. Failure view of faulty sensor.
For example, Figure 6 shows how an omission failure can be injected into the BT model of
a sensor—in this case, where a sensor has failed stuck high: i.e. it is not able to realize its state
low thereafter and does not send the low message. The corresponding state realization node in
the BT model has to be eliminated, and also any corresponding output nodes with which other
components are notified about the sensor’s change to state low. Figure 6 shows the BT thread with
two shaded nodes, the state realization and the internal output, which are to be deleted to generate
the BT of the failure view.
The BTE tool (see Section 3.1) provides support for generating a failure view in the following
way: Each node in the BT model can be annotated by a set of view identifiers to specify the views
the node is part of (including the fault-free view). To generate a new failure view V the user simply
specifies which of the existing nodes will be in the view and annotates them with the identifier
for V, then adds new nodes into the BT model as needed, with appropriate transitions. Menu
options in the tool facilitate this process, for example by enabling all nodes to annotate with V, so
that the user simply needs to unannotate nodes corresponding to removed behavior when creating
the failure view for an omission failure.
As noted above, the BT approach to system modeling makes fault injection relatively straightfor-
ward compared to the more traditional models such as Statecharts [13]. This is even more evident
when it comes to injection of multiple independent component failure modes, since independent
functions map to different parts of the tree, meaning that interference does not need to be explicitly
considered. Of course, failure of one component may mean that another component is misled into
behaving incorrectly—a so-called command failure—but such consequences ‘emerge’ out of the
resulting BT model, rather than having to be added explicitly to the model by the analyst, as
happens in most other approaches.
3.4. The failure view transition matrix

To specify which failure mode combinations will be analyzed, the user next creates a failure view
transition matrix (see Figure 4 above).
A failure-view transition matrix is a square matrix whose rows and columns are labeled with
the system model M (which we also call the fault-free failure view) and the failure views Fvi that
are specified by the user. An entry (depicted as x) in row i at column j indicates that a transition
is possible from the failure view of row i to the failure view of column j. The matrix in Figure 4
for example specifies that the fault-free model M can transition into failure-view model Fv1 and
Fv1 can become Fv2 . In this example it is also possible that Fv1 changes back to the fault-free
model M and Fv2 can transition back to Fv1 , which can be interpreted as a possible repair of
faults that are present in the failure views.
By transition from one failure view (the source) to another (the target) we mean that the behavior
can transfer from any node in the source BT model to the corresponding node in the target BT
model, provided it exists. (It might be eliminated due to failure injection in the target failure
view, for example.) In other words, the fault can occur at any time during operation of the system
being modeled.
DOI: 10.1002/spe
Failure
Failure View
View 1 Transitions
1 -> 2
Failure Failure 1 -> 3
View 2 View 3
2 -> 4
3 -> 4
Failure
View 4
Figure 7. Specifying the failure view transitions.
Figure 8. SAL code generated from sensor thread in Figure 6.
Figure 7 shows an example of how the user specifies the failure-view transition matrix
(as supported in the BTE tool). In the example the user intends to analyze a double-point of
failure of the system. For this purpose four different failure views have been generated by the user
(1–4): 1 represents the fault-free view; 2 and 3 represent single-failure views, where two different
failure modes are present (say, f 1 of component C1 and f 2 of component C2, respectively);
and 4 where both f 1 and f 2 are present. The failure view transition now specifies the different
orders in which the failure views may occur: Transition 1 → 2 corresponds to the failure of
component C1 occurring first, for example, and transition 2 → 4 corresponds to the subsequent
failure of component C2. The user also intends to investigate the reverse order: first failure of
component C2 occurring (transition 1 → 3) followed by a failure of component C1 (transition
3 → 4). Because the model checker considers all possible paths through the corresponding BTs,
the analysis covers the occurrence of the two failures of the components at any possible time
during system operation.
This approach gives the analyst a lot of control over which fault combinations are considered.
For example, it is possible to investigate the effects of stepwise degradation of a component’s
behavior in this way, by using separate failure views for the different levels of (faulty) behavior
and transitions between them. It is also possible to examine what would happen if one component
fault was corrected before another one, to check if hazards occur during maintenance for example.
3.5. Generation of the fault injection experiment

Once the user has created the different failure views and specified the failure view transition matrix
to be used, the BTE tool generates the model checker code automatically. The code generation is
described in detail in [23] and conforms to the formal semantic of BTs given in [27].
An example translation of the sensor fault as modeled in the BT in Figure 6 is shown in Figure 8,
where view=1 models the case with no fault and view=2 models the case with the sensor fault.
DOI: 10.1002/spe
The translation of the extended model, consisting of the individual BT models and the failure-
view transition matrix, into SAL proceeds in essentially the same manner as for individual BT
modes as explained in Section 2.2 above. The main change is that the guard in each action
is extended with details of the views in which the action is applicable. Thus for example,
actions A1 and A2 are enabled in both views, since they represent behavior that was present
in both the fault-free BT and the failure view. Actions A3 and A4 are enabled only during
fault-free behavior: the sensor correctly changes state (sensor’=low) and transmits a message
(sensorLowMsg’=true). The program counter variable is updated (pc1’=1) to implement
the reversion in action A4. Action A5 is enabled only when the fault is present; the sensor does not
change state nor sends a message. Instead, the behavior progresses back to the beginning (action
A1 is re-enabled). The last action, A6, represents the view transition, from the normal view to the
failure view; this may occur at any time during the operation of the system.
This SAL code is generated automatically according to the rules defined in [23]. The user is
provided this functionality, including the customization of the translation, via a menu option in
the BTE tool (see Section 3.1).
3.6. Failure mode consequence analysis

The failure mode consequence analysis for each of the generated fault injection experiments is
automatically performed using the SAL symbolic model checker. The model checker performs a
complete state space search to determine whether the system satisfies the given temporal logical
properties, the safety conditions. If the result of the fault injection experiment indicates that the
properties are satisfied, the component failure modes that are under investigation do not lead to
violations of the safety condition. If on the other hand the model checker determines that a property
is violated, it returns a counter-example, which describes one possible sequence of steps that leads
the system to the hazardous situation. This gives a clear indication to the safety analyst of how
the hazardous state can be reached in the model.
The SAL model checker provides a counter-example as a sequence of actions that have been
fired to reach the hazardous state. It is straightforward to map the sequence back to steps through
the BT, to recreate the sequence of nodes representing the behavior that leads to the hazard.
When all fault injection experiments are analyzed by the model checker, the results are investi-
gated by the FMEA team. Based on the results of each fault injection experiment, in particular the
generated counter-examples in the cases where the property is violated, a row in the FMEA table
will be created. The outcomes of the fault injection experiments will give a clear indication for the
content of the following columns: failure mode, description of the local effect of the failure mode,
and effect at the system level. However, columns such as the possible cause for the failure mode,
the recommended failure detection mechanism, the recommended mitigation mechanism, and the
recommended design changes must still be added manually since they rely on the expertise of the
analysts in the FMEA team.
Standard model checkers provide one counter-example only per violated property. This is due
to the fact that the counter-example generation is computationally expensive. In some cases the
counter-example delivered by the model checker might not be very informative. It might show
very unlikely behavior or even behavior that would only occur in the model but not in the real
world. This problem is intrinsic to all approaches that utilize a standard model checker to find
counter-examples.
There are several different possible approaches to ruling out such behaviors and trying to
discover more useful counter-examples. The user can try to improve the model to render it closer
to reality. Or in some cases the property to be checked can be restated (e.g. by including a fairness
constraint for concurrent behavior). We provide another approach, by allowing prioritization of
internal events over external events (cf. Figures 1(d) and (e)). Under this approach, an external
event can be processed only if no other (competing) internal events are enabled at the same time.
Effectively, this enforces a delay on external actions as they may be blocked for some time. The
observation behind this approach was that in many systems, the equipment under control often
reacts (i.e. changes state) more slowly than the control system itself. In the Press example below,
DOI: 10.1002/spe
for example, the changes in the plunger state typically take much longer than internal transitions
in the control system components (see Section 4).
The prioritization feature is built into the translation step as an option and can be seen as
an alternative interpretation of the model. But since it eliminates some behaviors, it might lead
to unsound results. Therefore, we use this facility only in those cases where we have found a
violation of the property without the prioritization but are not happy with the counter-example.
The prioritization of internal communication then filters out undesired counter-examples and can
lead to a more interesting counter-example.
4. AN APPLICATION: THE INDUSTRIAL METAL PRESS
In the following we apply our approach for tool-supported FMEA to a case study, namely the
industrial metal press§. The industrial metal press is a system for compressing sheets of metal
into body parts for vehicles. When the press is turned on, a plunger begins rising, with the aid
of an electric motor. The operator may then load the metal into the press. When the plunger
reaches the top of the press, the operator can then push and hold a button, which turns off the
motor, causing the plunger to fall. For the safety of the operator, the button is located a safe
distance away from the falling plunger. When the plunger reaches the bottom, thereby compressing
the metal, the motor is automatically turned on. The plunger begins rising again, repeating the
cycle.
The operator may abort the fall of the plunger by releasing the button, but only if the plunger
has not yet reached the ‘point of no return’ also referred to as PONR. It is dangerous to turn on
the motor when the plunger is falling beyond this point, as the momentum of the plunger could
cause the motor to explode, exposing the operator to flying parts. Figure 9 shows the physical
architecture of the press. See [10] for more detail.
4.1. The model

We produce a BT model from the given functional requirements which consists of individual
sub-trees (each of which models a single requirement) that are integrated into the overall BT by
grafting matching nodes. The result is shown in Figure 10. The tag on the left side of each BT node
indicates the initial requirement the sub-tree is derived from. The modeling process is described
in detail in [36].
This initial model, however, is too abstract to be useful for the analysis in FMEA. Therefore,
we add design details and refine the BT model accordingly. For example, we refine the notion
of the plunger Falling into a sequence of FallingSlowly, and FallingFast. The transition from
falling slowly to falling fast is triggered by an external input event (fallingFast) which models the
physical movement of falling to the PONR and its corresponding time delay. We also add sensor
components to the architecture, together with the corresponding communication mechanisms. The
resulting BT model is more complex. It is a concurrent design in which the behavior of each
component (i.e. controller, plunger, motor, operator, top-, bottom-, PONR- and button sensor) is
modeled in a separate thread.
To realize this design, each component notifies other components about state changes that are
relevant. For instance, the plunger sends a message after realizing its state atTop which is received
by the top sensor component. Having received this message the top sensor changes its state into
High and sends a message to the controller.
Figure 11 shows the thread for the controller (partly) on the left-hand side and the top sensor
thread on the right-hand side. The controller maintains a number of threads polling for incoming
§ Alldetails of this case study, as well as the full model, generated SAL code and counter-examples are available
via http://www.itee.uq.edu.au/∼dccs/FMEA.
DOI: 10.1002/spe
top sensor
PoNR sensor
bottom sensor
serial I/F
motor from PLC
parallel I/F
to PLC hydraulic clutches
Figure 9. Press with the plunger at top.
...
Plunger
R1 [ At bottom ]
Button
R3 [ Pushed ]
Power
R2 ?? On ??
R3 Controller
Controller + [ Closing ]
R2
+ [ Opening ]
Motor
R3 [ Off ]
Motor
R2 [ On ]
Plunger
R3 [ Falling ]
Plunger
R2 [ Rising ]
Plunger Operator
R6 ?? At bottom ?? R4, R5
?? Release button ??
Plunger
R3 ?? At top ??
R6 Controller Plunger Plunger

R3 Controller + [ Opening ] R4 ? Falling slowly ? R5 ? Falling fast ?
+ [ Open ]
R4 Controller R5 Plunger
Operator + [ Opening ] ?? At bottom ??
R3 ?? Pushes button ??
...
Figure 10. Initial requirements BT model.
sensor messages. Upon receiving a message from a sensor it sets an internal attribute that captures
the current status of the sensor. This internal representation of the sensors and the motor is
then responsible for triggering the controller’s action (e.g. sending a message to switch on the
motor) which is captured in the thread for the controller’s main behavior on the left in Figure 11.
This architecture models quite closely the communication between sensors and the controller
as in the real system and will reproduce problems that can be encountered during the system’s
runtime.
The top sensor (modeled in the thread shown on the right-hand side of Figure 11) gets input
from the plunger component being at the top. Receiving this message the top sensor changes into
state high and outputs a message to the controller. After receiving a message that the plunger is
not at the top anymore the top sensor realizes the state low and again notifies the controller about
this state change.
DOI: 10.1002/spe
Controller Top Sensor

[Opening] > plunger_AtTop <
Controller Controller Controller Top Sensor

<turnOnMotor> >topSensor_High< >topSensor_Low< [ high ]

??TopSensor=High?? [TopSensor=High] [TopSensor=Low] < topSensor_High >

[Open] >topSensor_High< >topSensor_Low< > plunger_NotAtTop <

??Button=Pushed?? [ low ]

[Closing] < topSensor_Low >

<turnOffMotor> > plunger_AtTop <
Controller Controller
??BottomSensor=High?? ??Button=Released??
Controller Controller Controller

[Opening] ?PONRSensor=High? ?PONRSensor=Low?
Controller Controller
??BottomSensor=High?? [Opening]
Figure 11. Left: Part of the controller thread and right: top sensor thread.
The plunger changes state in response to messages it receives from the motor and external
events it receives from the physical environment. For example, it transitions from state At bottom
to Rising belowPONR after receiving a message that the motor is on (via internal event motor On);
when the environment signals that the plunger has now passed the PONR (via external input event
risingAbovePONR), the plunger changes to its state Rising abovePONR; and finally after receiving
an external event indicating that the top has been reached, it changes to At top. Upon receiving
a message that the motor is off, it transitions to state Falling slow, and then to Falling fast upon
passing the PONR (indicated by means of external input event fallingFast), and so on.
Further details of the model is available via http://www.itee.uq.edu.au/∼dccs/
FMEA/.
Note that other designs are possible which satisfy the functional requirement given above.
For example, in the design studied here, the controller sends a message to switch the motor on
as soon as it receives a message from the bottom sensor that the plunger is at the bottom. In
an alternative design, the controller might instead be looking for a signal that the plunger has
passed the PONR before it checks the bottom sensor state. In the fault-free case the two system
designs will result in exactly the same press behavior, in the sense of how the plunger responds to
operator commands. But in the presence of component faults the two designs have quite different
behaviors.
DOI: 10.1002/spe
4.2. Safety conditions

The safety conditions for the press system are described below, together with their formalization
in linear temporal logic LTL (cf. Section 2.2).
1. If the operator is not pushing the button and the plunger is at the top, the motor should
remain on.
G((operator = released button∧plunger = at top) ⇒ motor = on)
2. If the plunger is falling below the PONR, the motor should remain off.
G(plunger = falling fast ⇒ motor = off )
3. If the plunger is falling above the PONR and the operator releases the button, the motor
should eventually turn on, before the plunger changes state. This can be modeled in LTL as
G((plunger = falling slow∧operator = released button) ⇒

(plunger = falling slow U motor = on))
For technical reasons it is also possible in our model (in contrast to the real system) that
the behavior skips forever without performing any action. Therefore, we have to add an
antecedent to the formula to exclude paths on which the plunger never falls beyond the
PONR (i.e. plunger is falling fast), which do not reflect any real behavior of the system.
GF(plunger = falling fast)

⇒ G((plunger = falling slow∧operator = released button)
⇒ (plunger = falling slow U motor = on)))
4. The motor should not turn off while the plunger is rising.
G(¬((plunger = rising below PONR∨plunger = rising above PONR)∧motor = off ))
Note that condition 3 goes beyond a simple static condition of the system, such as particular
component state combinations, and instead relates to a dynamic property, namely that one compo-
nent (the plunger) should not transition to another state before a certain event takes place (namely,
the motor comes on). This is an example where the richness of temporal logic is needed.
4.3. Analysis of the fault-free model

Before we undertake FMEA on our BT model, for validation purposes we first analyze the behavior
of the fault-free model of the Press with respect to the safety conditions. The model checker is
very useful as a debugging tool at this stage. (For example, that is how we discovered the need
for the antecedent for safety condition 3 above.)
We first use the tool to translate the model without restrictions on the order of enabled actions
and communication. In particular the operator can arbitrarily push and release the button at any
stage. This unrestricted model allows us to check the reaction of the Press system to every possible
user behavior. When applying the model checker to the resulting SAL code we find that only
safety condition 3 (abort function) is satisfied; all other safety conditions are violated.
On inspecting the counter-examples we find that the violations are due to race conditions: Our
model consists of threads which are executing concurrently. In the SAL tool, if more than one
action is enabled at the same time, then one is chosen arbitrarily, i.e. without preferences (which
is a typical interleaving semantics). As a result the model exhibits some behaviors that are highly
unlikely to occur in the real system.
An example of this is given by the counter-example for safety condition 1. The system, including
the plunger, exhibits normal behavior up to the point where the plunger has reached the top, the
DOI: 10.1002/spe
operator has pushed the button, the motor is switched off, but before the plunger starts to fall (and
drops below the top sensor position), the operator releases the button again. This situation violates
the formalized safety condition 1.
Although such behavior is possible strictly in a real system, its likelihood is low, and the system
would recover by aborting operation once the plunger starts to fall, so that the associated risk is
probably negligible. Since at this stage of analysis we are more interested in checking for higher
likelihood, higher risk outcomes, and not so interested in detailed timing issues, we opt to use the
prioritization mechanism described in Sections 2.2 and 3.6 above. This has the effect of giving
priority to system-internal behavior over external input. External input events are used in our model
to represent (a) the physical movement of the plunger (i.e. after some time the model receives
the external events risingAbovePONR, atTop, fallingFast, and atBottom, respectively) and (b) the
arbitrary behavior of the operator (i.e. external input events are pushedButton and releasedButton).
These actions consume (some unspecified amount of) time. Internal events in contrast model the
communication between the system components which is almost instantaneous. Therefore, it seems
a reasonable assumption that internal events are faster than external events.
When the prioritization option is used, the model reacts to external input events only once all
internal events have been recognized and reacted to. This has the effect of ruling out behaviors
such as the plunger falling from the top to PONR faster than the controller can communicate with
the motor, or the user releasing the button before the plunger starts to fall (see counter-example
described above).
Technically speaking the model with prioritization is an under approximation of the original
model, since some behavior is removed. This of course can be dangerous from a safety analysis
viewpoint, since it may inadvertently dismiss unsafe behaviors such as race conditions. It should
therefore be used with a great degree of caution. But on the other hand it is a useful technique for
checking for issues that go beyond timing issues, as will be demonstrated below.
When the prioritization feature is enabled within our translation tool and the checks of the
safety conditions are rerun on the fault-free model, this time all four conditions are satisfied. For
the following FMEA, we also use this technique: thus all the results have to be interpreted with
respect to this assumption of prioritized internal actions.
4.4. Failure modes of the system

As possible failure modes of the system we consider the failing of sensors and other components
to send the correct values to the controller. The system reacts to the signals from four different
sensors: top, PONR, and bottom (each of which can indicate values high and low), and the button
(signaling pushed or released).
Here we consider what are probably the most likely permanent failure modes of sensors, namely
when they break and get stuck at a constant value. The local effect of such failures is to misrepresent
the plunger’s current position. This in turn can mislead the controller into taking the wrong action.
For example, the top sensor might get stuck at value high, causing the controller to assume that
the plunger is still at the top although it might have fallen already.
As an additional failure mode, we consider commission failures of the motor component, which
can change instantaneously to an on or off state.
This leads to 10 different failure modes for the industrial press which we call single-failure
modes. When we combine the different single-failure modes we get 40 double-failure modes, each
of which models the combination of two single-failure modes.
4.5. FMEA results

In single-failure mode seven of the injected faults lead to a violation of a safety condition:
• Button sensor stuck pushed leads to a violation of safety conditions (1) and (3).
• Bottom sensor stuck high leads to a violation of safety conditions (2).
• PONR sensor stuck low leads to a violation of safety condition (2).
• PONR sensor stuck high leads to a violation of safety condition (3).
DOI: 10.1002/spe
Figure 12. Counter-Example as indicated by the model checker.
• Top sensor stuck high leads to a violation of safety condition (4).

• Motor fails on leads to a violation of safety condition (2).
• Motor fails off leads to a violation of safety condition (1),(3), and (4).
For all other failure modes the system satisfies the four given safety conditions.
In Figure 12 we give an example of a violation which is generally output by the system as a
counter-example. Generally, a counter-example lists a sequence of transitions which lead to a state
in which the safety condition does not hold. In our application within the process of FMEA, we
can read from a counter-example the relationship between a failure mode of a component and the
hazard that can occur as a consequence. That is, the counter-example shows how the failure mode
leads to the hazard condition.
Every failure mode was then combined with a second failure mode to exercise a double-failure
mode. As a result only in one case the combination of failures led to a safety violation that was
not apparent in the single-failure view, namely
• Top sensor and bottom sensor stuck high leads to a violation of safety condition (2).
Interestingly though, in a number of cases failure modes would eliminate each other’s impacts.
For example, if the PONR sensor stuck low and the Button stuck pushed failure modes occur
simultaneously, safety condition 2 could be proved to be satisfied by the faulty system. However,
the press system with only the PONR sensor stuck low violates condition 2.
5. FURTHER EXPERIENCE APPLYING THE METHOD
Further experiments were conducted applying our approach in the same fashion as outlined in
Section 4 to two medium-sized systems, namely the mine pump and the AIP. As a result, we
have gained further insight into the practicality and performance results of the proposed method.
As with the press system we have performed an FMEA investigating single-failure modes and
double-failure modes. We did not apply any heuristics to minimize the number of experiments
in double-failure mode but instead applied a brute force approach that considers every possible
combination to investigate scalability. The aim was to show how the approach can be applied in
practice and to study its performance.
In the following subsections we briefly describe the case studies and report on the findings for
the FMEA process¶. The last subsection provides further discussion on the performance of our
approach for all three systems.
¶ Please
note that the full description of the models used, the hazard conditions, as well as the results of the FMEA
can be accessed via http://www.itee.uq.edu.au/∼dccs/FMEA/.
DOI: 10.1002/spe
5.1. The Mine pump

The mine pump case study (see e.g. [37]) describes a simple pump control system to prevent
flooding in a mineshaft. Usually, water collects at the bottom of a mineshaft. When it reaches a
certain depth (detected by a high-water sensor) the pump should be switched on. When the depth
has been sufficiently reduced (detected by a low-water sensor) the pump should be switched off.
The pump can also be switched on or off at the request of a human operator, provided the water
level is between its high and low limits. A designated operator, the supervisor, has the authority
to switch the pump on or off regardless of the water level.
Risk of fire dictates that the pump must not be operated when the atmosphere contains too
much methane. The methane level is measured by a methane sensor. Two other sensors, a carbon
monoxide sensor and an airflow sensor also monitor the environment. A critically high carbon
monoxide level or a critically low airflow lead to immediate evacuation of the shaft. A critical
reading from any of the sensors causes an alarm to be sent to the human operator. All readings
whether critical or not, are logged for possible future analysis, as are records of the pump’s
activities.
Based on the functional requirements of the system a BT model has been constructed and
analyzed with the proposed model-checking-based FMEA process. Three safety conditions are
used in this process which where formalized as LTL formulas as follows .
1. If the methane level is high the pump should be turned on
In our model the methane level is given by an attribute of the environment component,
envCH4Level, which is to be measured by the methane sensor CH4Sensor. The values
of the sensor are to be read by the system which captures the readings of the attribute
systemCH4Sensor. This model leads to a sequence of three steps to realize the action
(switching off the pump). We capture this as a conjunction of the sub-formulae of the form
G(A ⇒ X(action = internal U B)) in order to allow for a number of internal steps that do
not obstruct the required immediacy of the system’s behavior.
G(envCH4Level = high ⇒ X(action = internal U CH4Sensor = high))∧

G(CH4Sensor = high ⇒ X(action = internal U systemCH4Sensor = high))∧
G(systemCH4Sensor = high ⇒ X(action = internal U pump = off ))
2. If the airflow is low the personnel should evacuated

We formalize this safety condition in a similar fashion as the condition in (1). The actual
airflow is given by the attribute envAirflow. The sensor airflowSensor measures the airflow.
This sensor has to be read by the system using attribute systemAirflowSensor. Again, the
model exhibits a sequence of steps that lead to the evacuation of personnel. The safety
requirement is formalized as the conjunction of three sub-formulae, each of which enforces
the immediacy of each step to take place.
G(envAirflow = low ⇒ X(action = internal U airflowSensor = low))∧

G(airflowSensor = low ⇒ X(action = internal U systemAirflowSensor = low))∧
G(systemAirflowSensor = low ⇒ X(action = internal U personnel = evacuate))
3. If the carbon monoxide level is high the personnel should evacuated

The carbon monoxide level is given by attribute envCOLevel. A sensor, COSensor
measures the level constantly. The system reads the sensor using attribute systemCOSensor.
Ourmodel contains internal actions (for which action = internal) which are not observable and thus not relevant
in measuring the safety of the system.
DOI: 10.1002/spe
To implement the required condition three steps are needed, each of which must occur
immediately.
G(envCOLevel = high ⇒ X(action = internal U COSensor = high))∧

G(COSensor = high ⇒ X(action = internal U systemCOSensor = high))∧
G(systemCOSensor = high ⇒ X(action = internal U personnel = evacuate))
We found violations of all three requirements exercising the different single-failure views.
In total, 4 of the 12 single-failure modes investigated led to a violation of at least one safety
requirement. Interestingly, the combination of single-failure modes to double-failure modes only
showed us violations that had already been detected in a corresponding single-failure mode.
5.2. The ambulatory infusion pump

An AIP is a medical device that is used to provide drug therapy to patients who are away from
the direct care of healthcare professionals. This safety critical device infuses a measured amount
of drug at a pre-programmed infusion rate. It has an inbuilt pump to deliver the drug through a
line connected to a needle assembly. The hazards associated with the device may lead to death
or serious injury of the patient. These hazards are under-delivery or non-delivery of the drug,
over-delivery of the drug, and air embolism. The under-delivery or non-delivery of the drug may be
caused by the line being blocked or kinked while the pump is running. The presence of air bubbles
in the line during drug infusion may lead to a serious medical condition called air embolism,
whereas incorrect calculation of the drug infused may lead to the hazards of drug under-delivery
or over-delivery.
For this case study we have identified three safety conditions that needed to be checked during
the application of the FMEA process.
1. If there is a blockage in the line the pump must be stopped.
A blockage has to be detected by the blockage sensor, occSensor. When the sensor has
detected the problem, the controller reads the sensor value into its attribute aipOccSensor,
after which the controller sends a signal to switch off the pump (i.e. pump = stop). These
three steps have to happen immediately to meet the safety condition (i.e. no other observable
actions should happen in between) or the pump must have been stopped for other reasons.
We formalize the requirement as a conjunction of three sub-formulae, each of which is built
according to the template explained in Section 5.1 that caters for internal actions within an
immediate reaction.
G(envLine = blocked ⇒ X(action = internal U (occSensor = true ∨ pump = stop)))∧

G(occSensor = true ⇒ X(action = internal U (aipOccSensor = blocked ∨ pump = stop)))∧
G(aipOccSensor = blocked ⇒ X(action = internal U pump = stop))
2. If air is in the line the pump must be stopped.

Air in the line is indicated by attribute envAir and the sensor airDetector has to detect its
occurrence. The controller has to read and record this value in attribute aipAirSensor and
stop the pump immediately. As above, during any of these steps the pump might have been
stopped for other reasons, a behavior that also satisfies the safety condition. We apply the
template for allowing internal actions to happen in between.
G(envAir = airInLine ⇒ X(action = internal U (airDetector = true ∨ pump = stop)))∧

G(AirDetector = true ⇒ X(action = internal U (aipAirSensor = air ∨ pump = stop)))∧
G(aipAirSensor = air ⇒ X(action = internal U pump = stop))
DOI: 10.1002/spe
3. As soon as the pump operation is interrupted the drug volume must be re-calculated.
To model the interruption of the pump we utilize the next state operator within the antecedent
of the implication. If the pump is pumping and in the next state it is not pumping, then it
has been interrupted and the drug volume must be calculated. This can be modeled using the
following formula:
G((pump = pumping) ∧ X(pump = pumping) ⇒ X(aipVol = calculateVol))
However, running the model checker showed us that this condition is violated in the case
where the pump is stopped for other reasons. Obviously, in this case the volume cannot be
re-calculated since the system as a whole is down. This result hints at a different safety
condition, namely how can the system be shut down safely when the power supply breaks
down, which is not further investigated here. To check, however, that the system operates
appropriately under normal conditions, we have to modify the formula by extending the
antecedent of the implication: if the pump is pumping and in the next state has stopped
pumping and the controller is still running ((aipPump = running)∧(aip = on)), then in the
next state the drug volume must be re-calculated. The resulting formula reads as follows:
G((pump = pumping∧X(pump = pumping∧aipPump = running∧aip = on))

⇒ X(aipVol = calculateVol))
We have investigated eight single-failures modes, and could identify a violation of two of the
safety requirements. When checking the double-failure modes we did not find any new violations
that were not already found in single-failure mode. However, two of the violations that were found
in single-failure mode were eliminated when checked in combination with a second failure mode.
5.3. Comparison of results for the three systems

The size and complexity of the three BT models used for the different systems (Metal Press,
AIP, Mine Pump) can be compared by examining the data in Table I. The given figures refer to
the corresponding fault-free models with prioritization. The first column counts the numbers of
components in each of the systems. The number of the BT nodes in the three models is counted
in column two. The number of transitions (in the generated SAL code) relates to the number of
nodes in a BT since most of the nodes result in one transition (apart from selections and guards).
However, atomic edges link nodes into a single transition. This becomes apparent for the mine
pump and the AIP, which have almost the same number of nodes, however, the mine pump model
has almost twice as much transitions as the AIP model. Generally, models with fewer transitions
lead to a more efficient model checking process. Column 4 gives the number of threads in the
BT models. To get this figure we counted the concurrent branches in the corresponding tree. This
figure measures the concurrency represented in the model. External messages are simulated by
input variables in the SAL code which change their values non-deterministically and thus add
extra complexity to the model. Internal messages measure the internal communication between
the components in the system.
What we can see from Table I is that the press is the smallest of the three models. The mine
pump and the AIP model appear to be of similar size when looking at the BT model. For instance,
Table I. Attributes of the three BT models (without injected failures).

1 2 3 4 5 6
No. of No. of No. of No. of No. of No. of
components nodes transitions threads ext. msg int. msg
Press 8 119 60 16 7 20
Mine pump 11 228 127 30 10 35
AIP 8 220 75 24 16 16
DOI: 10.1002/spe
Table II. Comparison of model checking experiments on the three BT models.

No. of No. of No. of Average exec. Average exec. Total experiment
BDD Vars visited states iterations time (single) time (multiple) time (min)
Press 239 46.14×103 63 0.59 min 0.6 min 75.18

Mine pump 452 14.10×109 117 32.4 min 44.96 min 6249.38
AIP 347 35.47×109 146 14.1 min 15.38 min 1030.62
the number of nodes in the BTs is almost similar. However, the mine pump is a more complex
model wherein it contains more threads and also more transitions. Most of the behavior in the
AIP’s controller thread is modeled atomically, which rules out interleaving behavior between the
sensor and actuator threads, thus reducing the complexity of the model’s state space.
These interpretations are also reflected in Table II which compares the complexity of the model
checking experiments of the three models. The number of BDD variables reflects the complexity
of the internal representation of the models in the model checker. The number of visited states
shows the size of the state space that is investigated. This figure does not necessarily correlate
with the number of BDD variables since some state spaces can be represented more compactly
than others. This mismatch can be observed, for instance, between the mine pump and the AIP
model. The number of iterations indicates after how many steps (of applying the transition relation
to the states) a fixpoint could be found and all reachable states were met. Obviously, the more
iterations are necessary, the longer the computation takes. However, for some models one iteration
(or step) is a simpler and faster computation than for others. Although the AIP model requires
more iterations, the overall execution time is less than for the mine pump. These figures can vary
from experiment to experiment (i.e. between no-failure, single-point-of-failure, and double-point-
of-failure experiments). The numbers given here are taken from the no-failure experiments.
The columns four and five of Table II summarize the average total execution times for single-
point-of-failure experiments and double-point-of-failure experiments, respectively. The experiments
were executed on a Dell PowerEdge 2600 Server with a 2.4 GHz Intel Xeon Processor and 1G
RAM running RedHat Linux 8.0 as the operating system. The numbers in Table II are quite similar
for the press model and the AIP model. That is, the failures have relatively little impact on the
complexity of the models. This is different for the mine pump model where the double-point-of-
failure experiments were running longer than the single-point-of-failure experiments. This suggests
that for the AIP model, as well as for the press model, the time to compute the satisfiability of the
requirements is relatively small compared to the time for building the initial model, so therefore
the different runtimes for computing the fixpoint do not become apparent in the same way as for
the mine pump.
The last column of Table II shows the total time it took to run all double-failure mode experiments
(figures are in minutes). It can be seen that for larger models it will be beneficial to apply heuristics
to reduce the number of experiments run in a failure mode of a higher level. For the mine pump
it took, for example, over 4 days to finalize the checks in double-failure mode. Whereas each
experiment run is itself relatively fast, the sum of all runs if they are many can be quite large.
The total execution times for each of the model checking experiments (single-failure views vs
double failure views) are presented in Figure 13. (Please be aware that the diagrams have different
timescales on the vertical axis.) The results are ordered according to the safety requirements being
checked. An analysis of the performance statistics shows a fairly homogeneous distribution across
all runs. Variation in the execution times result mainly from the complexity of the system model. In
addition, the structure of the formalized safety requirements has an effect on the model-checking
performance. As an example for the metal press the third safety requirement has a higher execution
time than the other safety conditions. This theorem is defined by three nested temporal operators,
of which one is the until operator, resulting in a complex model checking algorithm to be applied.
Finally, if a safety condition is violated the execution time may increase if a long counter-example
needs to be generated.
DOI: 10.1002/spe
55 42
Total Execution Time for Metal Press Total Execution Time for Metal Press
(Single Failures) (Double Failures)
40
50
Time (seconds)
Time (seconds)
38
45
36
40
34
35
32
30 30
0 5 10 15 20 25 30 35 40 0 20 40 60 80 100 120 140 160
Experiment No. Experiment No.
4000 7000
Total Execution Time for Minepump Total Execution Time for Minepump
3500 (Single Failures) 6000 (Double Failures)
3000
5000
Time (seconds)
Time (seconds)
2500
4000
2000
3000
1500
2000
1000
1000
500
0 0
0 5 10 15 20 25 30 35 40 0 20 40 60 80 100 120 140 160 180
3000 3000
Total Execution Time for AIP Total Execution Time for AIP
(Single Failures) (Double Failures)
2500 2500
Time (seconds)
Time (seconds)
2000 2000
1500 1500
1000 1000
500 500
0 0
0 5 10 15 20 25 0 20 40 60 80
Figure 13. Total execution time of the case studies with single-failure views and double-failure views.
6. DISCUSSION
6.1. Practicability of the BT approach

Creating a model—formal or informal—from a set of given informal requirements is a challenge.
Some support can be expected from the chosen modeling notation and its associated modeling
process. BTs, the notation that is used in our approach, provides a technique that as can been seen
from industrial experience [18, 19, 24] reduces the gap between requirements specification and
model. It allows the user to systematically translate the English text into sub-trees which are merged
subsequently. The process of integrating the requirements at a graphical level is novel and has
been more formally investigated in [38]. It supports the construction of a first initial model out of
DOI: 10.1002/spe
the functional requirements and thus simplifies the validation (since the model obviously satisfies
the requirements). A comparison with other formal system modeling techniques is given in [39].
On the other hand, the modeling notation cannot prevent errors in the model (and the require-
ments) or wrong assumptions which might undermine the results of the proposed FMEA process.
Obviously, modeling has to be done with care and assumptions made have to be documented when
presenting the findings. We find, however, that the use of a model checker is of great help in
debugging the model and validating that it does behave according to the modeler’s intention.
6.2. Scalability of the automated support for FMEA

Automated tool support is the most important factor for successfully incorporating analysis into
industrial practice. It is, however, by nature limited in the scale of the targeted problem. This is
also the case for our approach utilizing a model checker. The underlying techniques and algorithms
have improved quite steadily over the last decades and will further develop (e.g. by integrating
dependence BDDs [40]). Reduction techniques that can reduce the size of the problem constitute
one of the biggest research questions in the field. For BTs we have developed our own slicing
algorithm [41] which allows a model to be reduced to the part that is relevant for a particular
question. Other reduction mechanisms as well as decomposition strategies, which allow the user to
break a model into parts that can be investigated in isolation, will be investigated in our future work.
With respect to the support for FMEA and the combinatorial explosion of multiple failures
which can be expected for large systems with many components our approach does not provide any
solution as yet. Strategies for choosing experiments of particular importance have to be put in place
to manage the complexity of large systems. However, our approach does mitigate the problem in
so far that failure experiments can easily be configured by the analyst to investigate sequences of
particular interest. This can be defined in the failure transition matrix introduced in this work, which
provides the most degree of freedom for the task of setting up the chosen experiments (i.e. injecting
failures and defining the transition between failures). The benefit over manual exploration is clearly
that the cause–consequence checks are automated and can run in the background. Even if the full
experiment takes days to run it does not require human interaction or attention.
7. CONCLUSION AND FUTURE WORK
In this paper, we have presented an approach to automating FMEA based on the BT language. This
process uses fault injection experiments and model checking to explore whether component failures
can lead to violations of hazard conditions, formalized as temporal logical formulas. Technically,
we are able to provide tool support for the generation of fault injection experiments for multiple
runtime failures. We argue that the BT language makes fault injection relatively simple, in contrast
to other methods.
We applied our approach to three case studies: the metal press, the mine pump, and the AIP.
For each of the case studies we have formalized the hazard conditions and applied our process
of tool-supported FMEA to investigate single-failure and double-failure modes. As a result, we
could identify several relationships between failure modes and possible hazards, which identify
weaknesses in the design that should be resolved or mitigated. Furthermore, the performance
results for the fault injection experiments are presented and discussed.
In the future work, we would like to extend our tool with a feature that animates the counter-
examples directly in the BT that is used for the fault injection experiment. This would help the user
to better visualize the series of steps between occurrence of the fault and occurrence of the hazard.
We also plan to investigate generation of multiple counterexamples [42, 43], to improve coverage
of the different ways that the cause–consequence relationship between a fault and a hazard can
manifest itself.
In this work we only consider un-timed BTs as the foundation for our fault injection experiments.
To investigate the consequences of timing failures timed BTs [44] and timed fault injection
experiments [45] would have been an alternative to allow for a more detailed FMEA. However,
DOI: 10.1002/spe
although theoretical possible, the scalability of the approach due to complexity of the underlying
real-time model-checking is still an open problem.
Another stream of research is to use the recently developed probabilistic extension of BTs
[46] to perform a probabilistic FMEA [47, 48] using a stochastic model checker. This enables
the analyst to quantify the cause–consequence that are revealed, and allows the analyst to define
tolerable hazard rates for each hazard condition and experiment with different component failure
rates, for example, to derive quantified component safety requirements. Furthermore, based on a
probabilistic FMEA it is possible to include dependence on the environment and the deployment
of software components to hardware nodes.
REFERENCES
1. Leveson NG. Safeware: System Safety and Computers. Addison-Wesley: Reading, MA, 1995.
2. Lutz RR. Software engineering for safety: a roadmap. ICSE—Future of SE Track. ACM Press: New York, 2000;
213–226.
3. Lutz RR, Woodhouse RM. Requirements analysis using forward and backward search. Annals of Software
Engineering 1997; 3:459–475.
4. Vesely WE, Goldberg FF, Roberts NH, Haasl DF. Fault Tree Handbook. U.S. Nuclear Regulatory Commission,
1996.
5. IEC 61025. IEC (International Electrotechnical Commission) Fault-Tree-Analysis (FTA), 1990.
6. IEC 60812. IEC (International Electrotechnical Commission), Functional safety of electrical/electronical/
programmable electronic safety/related systems, Analysis Techniques for System Reliability—Procedure for
Failure Mode and Effect Analysis (FMEA), 1991.
7. Price CJ, Taylor N. Automated multiple failure FMEA. Reliability Engineering and System Safety 2002;
76(1):1–10.
8. Papadopoulos Y, Walker M, Parker D, Lonn H, Törngren M, Chen D, Johansson R, Sandberg A. Semi-automatic
FMEA supporting complex systems with combinations and sequences of failures. SAE Journal of Passenger
Cars-Mechanical Systems 2009; 2(1):791–802.
9. Walker M, Papadopoulos Y. Qualitative temporal analysis: Towards a full implementation of the Fault Tree
Handbook. Control Engineering Practice 2009; 17(10):1115–1125.
10. Atchison B, Lindsay P, Tombs D. A case study in software safety assurance using formal methods. Technical
Report, University of Queensland, SVRC 99-31, 1999.
11. Reese JD, Leveson NG. Software deviation analysis. Proceedings of the 19th International Conference on Software
Engineering. ACM Press: New York, 1997; 250–261.
12. Bieber P, Castel C, Seguin C. Combination of fault tree analysis and model checking for safety assessment of
complex system. Proceedings of the Fourth European Dependable Computing Conference (EDCC-4) (Lecture
Notes in Computer Science, vol. 2485), Grandoni F (ed.). Springer: Berlin, 2002; 19–31.
13. Schneider F, Easterbrook S, Callahan J, Holzmann G. Validating requirements for fault tolerant systems using
model checking. Proceedings of the Third International Conference on Requirements Engineering. IEEE Computer
Society: Colorado Springs, CO, U.S.A., 1998; 4–13.
14. Bozzano M, Villafiorita A. Improving system reliability via model checking: The FSAP/NuSMV-SA safety
analysis platform. International Conference on Computer Safety, Reliability, and Security (SAFECOMP 2003)
(Lecture Notes in Computer Science, vol. 2788). Springer-Verlag: Berlin, 2003; 49–62.
15. Cichocki T, Górski J. Failure mode and effect analysis for safety-critical systems with software components.
Proceedings of the 19th International Conference on Computer Safety, Reliability and Security, SAFECOMP
2000 (Lecture Notes in Computer Science, vol. 1943), Koornneef F, van der Meulen M (eds.). Springer: Berlin,
2000; 382–394.
16. Grunske L, Lindsay PA, Yatapanage N, Winter K. An automated failure mode and effect analysis based on high-
level design specification with Behavior Trees. Proceedings of the Fifth International Conference on Integrated
Formal Methods (IFM 2005) (Lecture Notes in Computer Science, vol. 3771), Romijn J, Smith G, van de Pol J
(eds.). Springer: Berlin, 2005; 129–149.
17. Heimdahl MPE, Choi Y, Whalen MW. Deviation analysis: A new use of model checking. Automated Software
Engineering 2005; 12(3):321–347.
18. Powell D. Requirements evaluation using Behavior Trees—Findings from industry. Industry Track of Australian
Conference on Software Engineering (ASWEC 2007). Available at: http://www.behaviorengineering.org/docs/
ASWEC07 Industry Powell.pdf [15 March 2010].
19. Boston J. Behavior trees—How they improve engineering behaviour. Sixth Annual Software & Systems Engineering
Process Group Conference (SEPG), Melbourne, Australia, 2008. Available at: http://www.behaviorengineering.org
[15 March 2010].
20. Dromey RG. From requirements to design: Formalizing the key steps. International Conference on Software
Engineering and Formal Methods. IEEE Computer Society: Washington, DC, U.S.A., 2003; 2–13.
21. Behavior Engineering website. Available at: http://www.behaviorengineering.org [15 March 2010].
DOI: 10.1002/spe
22. Wen L, Dromey RG. From requirements change to design change: A formal path. International Conference on
Software Engineering and Formal Methods (SEFM 2004). IEEE Computer Society: Washington, DC, U.S.A.,
2004; 104–113.
23. Grunske L, Winter K, Yatapanage N. Defining the abstract syntax of visual languages with advanced graph
grammars—A case study based on Behavior Trees. Journal of Visual Language and Computing 2008;
19(3):343–379.
24. Papacostantinou P, Tran T, Lee P, Phillips V. Implementing a Behaviour Tree analysis tool using Eclipse
development frameworks. Experience Report Proceedings of Australian Software Enginerering Conference
(ASWEC08). IEEE Computer Society: Washington, DC, U.S.A., 2008; 61–66.
25. Hall A. Seven myths of formal methods. IEEE Software 1990; 7(5):11–19.
26. Berry DM. Formal methods: The very idea—Some thoughts about why they work when they work. Science of
Computer Programming 2002; 42(1):11–27.
27. Colvin R, Hayes IJ. A semantics for Behavior Trees. ACCS Technical Report ACCS-TR-07-01, ARC Centre for
Complex Systems, April 2007.
28. de Moura L, Owre S, Rueß H, Rushby J, Shankar N, Sorea M, Tiwari A. SAL 2. International Conference
on Computer-Aided Verification, (CAV 2004) (Lecture Notes in Computer Science, vol. 3114), Alur R, Peled D
(eds.). Springer: Berlin, 2004; 496–500.
29. Emerson EA. Temporal and modal logic. Handbook of Theoretical Computer Science, vol. B, van Leeuwen J
(ed.). Elsevier Science Publishers: Amsterdam, 1990.
30. Back R, von Wright J. Trace refinement of action systems. Concurrency Theory (CONCUR ’94) (Lecture Notes
in Computer Science, vol. 836), Jonsson B, Parrow J (eds.). Springer: Berlin, 1994; 367–384.
31. Bryant RE. Graph-based algorithms for boolean function manipulation. IEEE Transactions on Computers 1986;
C-35(8):677–691.
32. BTE. Genetic software engineering tools. Available at: http://www.sqi.gu.edu.au/gse/tools [15 March 2010].
33. Dwyer MB, Avrunin GS, Corbett JC. Patterns in property specifications for finite-state verification. Proceedings
of the 1999 International Conference on Software Engineering (ICSE’99). Association for Computing Machinery:
New York, 1999; 411–421.
34. Bitsch F. Safety patterns—The key to formal specification of safety requirements. International Conference on
Computer Safety, Reliability and Security (SAFECOMP 2001) (Lecture Notes in Computer Science, vol. 2187).
Springer: Berlin, 2001; 176–189.
35. Bondavalli A, Simoncini L. Failure Classification with respect to Detection. Esprit Project Nr 3092 (PDCS:
Predictably Dependable Computing Systems), 1990.
36. Winter K, Yatapanage N. The metal press case study. Available at: http://www.itee.uq.edu.au/∼dccs/FMEA [15
March 2010].
37. Burns A, Lister A. A framework for building dependable systems. The Computer Journal 1991; 34(2):173–181.
38. Winter K, Hayes IJ, Colvin R. Integrating requirements: The behavior tree philosophy. Proceedings of the Eighth
IEEE International Conference on Software Engineering and Formal Methods (SEFM). IEEE Computer Society:
Washington, DC, U.S.A., 2010.
39. Lindsay PA. Behavior trees: From systems engineering to software engineering. Proceedings of the Eighth
IEEE International Conference on Software Engineering and Formal Methods (SEFM). IEEE Computer Society:
Washington, DC, U.S.A., 2010.
40. Tang Z, Dugan JB. BDD-based reliability analysis of phased-mission systems with multimode failures. IEEE
Transactions on Reliability 2006; 55(2):350–360.
41. Yatapanage N, Winter K. Slicing Behavior Tree models for verification. Proceedings of the Sixth IFIP International
Conference on Theoretical Computer Science (TCS 2010), IFIP AICT 323, 2010.
42. Aljazzar H, Hermanns H, Leue S. Counterexamples for timed probabilistic reachability. Proceedingsof the Third
International Conference on Formal Modeling and Analysis of Timed Systems (FORMATS 2005) (Lecture Notes
in Computer Science, vol. 3829), Pettersson P, Yi W (eds.). Springer: Berlin, 2005; 177–195.
43. Han T, Katoen JP, Damman B. Counterexample generation in probabilistic model checking. IEEE Transactions
on Software Engineering 2009; 35(2):241–257.
44. Grunske L, Winter K, Colvin R. Timed behavior trees and their application to verifying real-time systems. 18th
Australian Software Engineering Conference (ASWEC 2007). IEEE Computer Society: Washington, DC, U.S.A.,
2007; 211–222.
45. Colvin R, Grunske L, Winter K. Timed behavior trees for failure mode and effects analysis of time-critical
systems. Journal of Systems and Software 2008; 81(12):2163–2182.
46. Colvin R, Grunske L, Winter K. Probabilistic timed Behavior Trees. Proceedings of the Sixth International
Conference on Integrated Formal Methods (IFM 2007) (Lecture Notes in Computer Science, vol. 4591), Davies J,
Gibbons J (eds.). Springer: Berlin, 2007; 156–175.
47. Grunske L, Colvin R, Winter K. Probabilistic model-checking support for FMEA. Proceedings of the Fourth
International Conference on the Quantitative Evaluation of Systems (QEST 2007). IEEE Computer Society:
Washington, DC, U.S.A., 2007; 119–128.
48. Aljazzar H, Fischer M, Grunske L, Kuntz M, Leitner-Fischer F, Leue S. Safety analysis of an airbag system
using probabilistic fmea and probabilistic counter examples. Sixth International Conference on the Quantitative
Evaluation of Systems. IEEE Computer Society: Washington, DC, U.S.A., 2009; 299–308.
DOI: 10.1002/spe

Grunske Et Al-2011-Software Practice and Experience

Uploaded by

Copyright:

Available Formats

Grunske Et Al-2011-Software Practice and Experience

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Grunske Et Al-2011-Software Practice and Experience

Uploaded by

Copyright:

Available Formats

SOFTWARE – PRACTICE AND EXPERIENCE

Softw. Pract. Exper. 2011; 41:1233–1258

Experience with fault injection experiments for FMEA

Lars Grunske1, ∗, † , Kirsten Winter2 , Nisansala Yatapanage3 , Saad Zafar4, ‡

Received 8 September 2009; Revised 11 July 2010; Accepted 25 October 2010

Copyright q 2011 John Wiley & Sons, Ltd.

Problem description. In practice, FMEA is performed manually in structured sessions by a

2.1. Behavior trees

tag Component Flag tag Component Flag tag Component Flag

Figure 2. Sequential behavior in the BT syntax: (a) sequential flow;

tag tag tag tag

behavior terminates. Alternative branching over non-selections behaves like a non-deterministic

2.2. Model checking

G(pressButton ⇒ F(light = green))

G(pressButton ⇒ X(X(light = green)))

G(car = stop U light = green)

3. AUTOMATED HAZARD ANALYSIS

3.1. Process overview

Figure 4. Methodology at a glance.

Figure 5. The process in nine steps.

3.2. Formalization of safety conditions

G(plunger = fallingFast ⇒ electricMotor = off )

3.3. Fault injection

Figure 6. Failure view of faulty sensor.

3.4. The failure view transition matrix

Figure 7. Specifying the failure view transitions.

Figure 8. SAL code generated from sensor thread in Figure 6.

3.5. Generation of the fault injection experiment

3.6. Failure mode consequence analysis

4. AN APPLICATION: THE INDUSTRIAL METAL PRESS

4.1. The model

Figure 9. Press with the plunger at top.

R6 Controller Plunger Plunger

Figure 10. Initial requirements BT model.

Controller Top Sensor

Controller Controller Controller Top Sensor

Controller Controller Controller Top Sensor

Controller Controller Controller Top Sensor

Controller Top Sensor

Controller Top Sensor

Controller Top Sensor

Controller Controller Controller

4.2. Safety conditions

G((plunger = falling slow∧operator = released button) ⇒

GF(plunger = falling fast)

G(¬((plunger = rising below PONR∨plunger = rising above PONR)∧motor = off ))

4.3. Analysis of the fault-free model

4.4. Failure modes of the system

4.5. FMEA results

Figure 12. Counter-Example as indicated by the model checker.

• Top sensor stuck high leads to a violation of safety condition (4).

5. FURTHER EXPERIENCE APPLYING THE METHOD

5.1. The Mine pump

G(envCH4Level = high ⇒ X(action = internal U CH4Sensor = high))∧

2. If the airflow is low the personnel should evacuated

G(envAirflow = low ⇒ X(action = internal U airflowSensor = low))∧

3. If the carbon monoxide level is high the personnel should evacuated

G(envCOLevel = high ⇒ X(action = internal U COSensor = high))∧

5.2. The ambulatory infusion pump

G(envLine = blocked ⇒ X(action = internal U (occSensor = true ∨ pump = stop)))∧